Years ago Dan Hughes wrote a great blog post explaining how expensive TCP is. His web site is long gone, but I managed to grab the blog post before it disappeared and he kindly allowed me to republish it.
If you ask a CIO which part of their infrastructure costs them the most, I’m sure they’ll mention power, cooling, server hardware, support costs, getting the right people and all the usual answers. I’d argue one the the biggest costs is TCP, or more accurately badly implemented TCP.
Approximately two years ago I tried to figure out whether aggressive marketing of deep buffer data center switches makes sense, recorded a few podcasts on the topic and organized a webinar with JR Rivers.
Not surprisingly, the question keeps popping up, so it seems it’s time for another series of TL&DR articles. Let’s start with the basics:
One of my readers sent me this email after reading my Loop Avoidance in VXLAN Networks blog post:
Not much has changed really! It’s still a flood/learn bridged network, at least in parts. We count 2019 and talk a lot about “fabrics” but have 1980’s networks still.
The networking fundamentals haven’t changed in the last 40 years. We still use IP (sometimes with larger addresses and augmentations that make it harder to use and more vulnerable), stream-based transport protocol on top of that, leak addresses up and down the protocol stack, and rely on technology that was designed to run on 500 meters of thick yellow cable.
I mentioned Multipath TCP (MP-TCP) numerous times in the past but I never managed to get beyond “this is the thing that might solve some TCP multihoming challenges” We fixed this omission in Episode 100 of Software Gone Wild with Christoph Paasch (software engineer @ Apple) and Mat Martineau from Open Source Technology Center @ Intel.
A while ago I found an interesting analysis of HTTP/2 behavior under adverse network conditions. Not surprisingly:
When there is packet loss on the network, congestion controls at the TCP layer will throttle the HTTP/2 streams that are multiplexed within fewer TCP connections. Additionally, because of TCP retry logic, packet loss affecting a single TCP connection will simultaneously impact several HTTP/2 streams while retries occur. In other words, head-of-line blocking has effectively moved from layer 7 of the network stack down to layer 4.
In the second half of his Networks, Buffers and Drops webinar JR Rivers focused on end systems: what tools could you use to measure end-to-end TCP throughput, or monitor performance of an individual socket or the whole TCP stack?
You’ll need at least free ipSpace.net subscription to watch the video.
Here’s the question I got from one of my readers:
Do you have any data available to show the benefits of jumbo frames in 40GE/100GE networks?
In case you’re wondering why he went down this path, here’s the underlying problem:
In autumn 2016 I embarked on a quest to figure out how TCP really works and whether big buffers in data center switches make sense. One of the obvious stops on this journey was a chat with Thomas Graf, Linux Core Team member and a founding member of the Cilium project.
One of my readers watched my TCP, HTTP and SPDY webinar and disagreed with my assertion that shaping sometimes works better than policing.
TL&DR summary: policing = dropping excess packets, shaping = delaying excess packets.
When someone tells you that “TCP is a lossy protocol” during a job interview, don’t throw him out immediately – he was just trusting the Internet a bit too much (click to enlarge).
Everyone has a bad hair day, and it really doesn’t matter who published that text… but if you’re publishing technical information, at least try to do no harm.
A long time ago in a podcast far, far away one of the hosts saddled his pony unicorn and started explaining how stateful firewalls work:
Stateful firewall is a way to imply trust… because it’s possible to hijack somebody’s flows […] and if the application changes its port numbers… my source port changes when I’m communicating with my web server - even though I’m connected to port 80, my source port might change from X to Y. Once I let the first one through, I need to track those port changes […]
WAIT, WHAT? Was that guy really trying to say “someone can change a source port number of an established TCP session”?
Achieving 40 Gbps of forwarding performance on an Intel server is no longer a big deal - Juniper got to 160 Gbps with finely tuned architecture - but can you do real-time optimization of a million concurrent TCP sessions on that same box at 20 Gbps?
My “Was it bufferbloat?” blog post generated an unexpected amount of responses, most of them focusing on a side note saying “it looks like there really are service providers out there that are clueless enough to reorder packets within a TCP session”. Let’s walk through them.