Is It Time to Replace TCP in Data Centers?
One of my readers asked for my opinion about the provocative “It’s Time to Replace TCP in the Datacenter” article by prof. John Ousterhout. I started reading it, found too many things that didn’t make sense, and decided to ignore it as another attempt of a proverbial physicist solving hard problems in someone else’s field.
However, pointers to that article kept popping up, and I eventually realized it was a position paper in a long-term process that included conference talks, interviews and keynote speeches, so I decided to take another look at the technical details.
Let’s start with the abstract:
Every significant element of TCP, from its stream orientation to its expectation of in-order packet delivery, is wrong for the data center. It is time to recognize that TCP’s problems are too fundamental and interrelated to be fixed; the only way to harness the full performance potential of modern networks is to introduce a new transport protocol into the data center.
If that doesn’t sound like a manifesto rallying point, I don’t know what does. Everyone is fed up with TCP in one way or another (quite often because TCP gets blamed for unrelated problems like the lack of a session layer), so it’s not hard to react to that message with a “hear, hear!!!” TCP specifications are also full of arcane nerd knobs resulting in bizarre behavior due to incompatible features being enabled by default in different TCP/IP protocol stacks.
However, most attempts to persuade networking engineers to ignore the laws of physics inevitably fail when marketing encounters reality. Let’s see how well the It’s Time to Replace TCP in the Datacenter article fares in that respect.
Prof. Ousterhout compares TCP with Homa, a new message-based connectionless transport protocol. The claims in the position paper seem to be based on measurements described in A Linux Kernel Implementation of the Homa Transport Protocol article, which used the following setup:
- Majority of the results was produced in a cluster of 40 nodes with a full mesh of bidirectional client-server traffic.
- It looks like the setup used a small number of parallel TCP sessions between a pair of nodes and tried to send multiple independent messages across a single TCP session resulting in transport-layer head-of-line blocking (more about that later).
As a starter, it’s worth pointing out that the above scenario does not correspond to how TCP is used in most application stacks or production environments and is heavily biased toward connectionless alternatives to TCP, making the measurement results potentially inapplicable to real life.
Transport Protocol Requirements
Ignoring the biased abstract, let’s dig into the technical details, starting with the transport protocol requirements (section 2):
- Reliable delivery
- Low latency
- High throughput (including high message throughput)
- Congestion control
- Efficient load balancing across server cores
- NIC offload
So far so good, but there’s a conspicuously missing requirement in that list: the recipient must know when it’s safe to start working on a message, and it might not be safe to handle incoming messages out-of-order. While the order of execution might not matter for read-only transactions with no side effects, it’s crucial if the messages cause a state change in the recipient application.
While the read-only transactions are usually more common than the read-write ones, the latter ones tend to impact data consistency, and raise the interesting challenge of providing consistent data to read-after-write transactions. Someone has to reshuffle incoming out-of-order messages into the sequence in which they were sent, and if the transport protocol is not doing that, everyone else has to reinvent the wheel. Look at the socket API or happy eyeballs if you want to know how well that ends.
The efficient load balancing across server cores (as opposed to across the network) is also worrying – it implies the CPU cores are the bottleneck, and might indicate unfamiliarity with the state-of-the-art high-speed packet processing. The focus on CPU cores and hotspots described in the deep-dive article might be an artifact of using small number of TCP sessions, which is an unnecessary artificial restriction – systems running Nginx or NodeJS have been scaled to 1-million connections per server a decade ago. Statistical distribution of TCP sessions to cores would also be much better with a larger number of TCP sessions.
Speaking of high-speed networking: most optimized implementations use CPU cores dedicated to packet processing, sometimes reaching a throughput of tens of Gbps per CPU core – see for example Snap: a Microkernel Approach to Host Networking, or what fd.io is doing. I have no idea where the author got “Driving a 100 Gbps network at 80% utilization in both directions consumes 10–20 cores just in the networking stack” – the Snap paper quoted as a reference definitely does not support that claim.
Everything about TCP Is Wrong
The next section of the article identifies everything that’s wrong with TCP:
- Stream orientation (as opposed to messages)
- Connection-oriented protocol
- Bandwidth sharing
- Sender-driven congestion control
- In-order packet delivery
The first two points are valid – TCP was designed to be a connection-oriented protocol transporting a stream of bytes1. The article goes into long lamentations of why that’s bad and why a message-oriented transport protocol would be so much better but never asks even one of these simple questions (let alone tries to answer them):
- What’s wrong with Infiniband? Environments focused on low latency often use Infiniband. Why should we use Homa to solve the challenges it is supposed to solve if we have a ready-to-use proven technology?
- What’s wrong with RoCE? Maybe you dislike Infiniband and prefer Ethernet as the transport fabric. In that case, you could use RDMA over Converged Ethernet (RoCE) as a low-latency high-throughput mechanism.
- What’s wrong with existing alternatives? Even if you don’t like RDMA, you could choose between several existing message-oriented transport protocols. SCTP and QUIC immediately come to mind. There’s also SRD used by AWS.
- What happened to other RPC frameworks? We had solutions implementing Remote Procedure Calls (RPC) on top of UDP, including Sun RPC, DNS and NFS. Some of them faded away; others started using TCP as an alternate transport mechanism instead of improving UDP. gRPC (the new cool RPC kid) uses HTTP/22 which rides on top of TCP. Why should that be the case, considering how bad TCP supposedly is?
- Why is everyone still using TCP? There’s no regulatory requirement to use TCP, and everyone is unhappy with it. Why are we still using it? Why is nobody using alternatives like SCTP? Why is there no QUIC-like protocol but without encryption (which is often not needed within data centers)? Why is nobody (apart from giants like Google and AWS) investing significant resources in developing an alternative protocol? Are we all lazy, or incompetent, or is TCP just good enough to shrug and move on?
Anyway, not mentioning any of the alternative transport protocols seems biased and a significant omission, and misses the opportunity to discuss their potential shortcomings and how Homa might address them.
But wait, it gets better when we get to the networking details.
Starting with bandwidth sharing, the article claims that:
In TCP, when a host’s link is overloaded (either for incoming or outgoing traffic), TCP attempts to share the available bandwidth equally among the active connections. This approach is also referred to as “fair scheduling.”
That claim ignores a few minor details:
- It’s really hard to influence incoming traffic.
- Bandwidth sharing is just a side-effect of running many independent TCP sessions across a network using FIFO queueing, but even then it’s far from ideal.
- While a TCP/IP stack might try to provide fair scheduling of outbound traffic (but often does not), all bets are off once the traffic enters the network.
- Obviously, one could use QoS mechanisms (including priority queuing and weighted round-robin queuing) to change that behavior should one wish to do so.
The next claim is even worse (the sender-driven congestion control section makes similar claims):
TCP’s approach discriminates heavily against short messages. Figure 1 shows how round-trip latencies for messages of different sizes slow down when running on a heavily loaded network compared to messages of the same size on an unloaded network.
I checked the Figure 1, and it uses workload “based on message size distribution measured on a Hadoop cluster at Facebook” – not exactly the most common data center scenario, but one that might exacerbate TCP shortcomings due to heavy incast. Even worse, the deep dive article (the source for Figure 1) compares Homa (a connectionless messaging protocol) with a scenario in which a single TCP session seems to carry numerous messages generated in parallel:
Streams also have the disadvantage of enforcing FIFO ordering on their messages. As a result, long messages can severely delay short ones that follow them. This head-of-line blocking is one of the primary sources of tail latency measured for TCP in Section 5.
Sending independent messages over a single TCP session obviously results in head-of-line blocking – behavior that has been well-understood since the early days of HTTP. The solutions to head-of-line blocking are also well-known, from parallel sessions to QUIC.
Let me also reiterate that this is not how TCP is commonly used. Environments in which a single application client would send multiple independent competing messages to a single server tend to be rare.
Anyway, ignoring the “apples to mushroom soup” comparison, which skews the results so far that they become meaningless, let’s focus on incast-generated congestion. Yet again, the countermeasures have been known for decades, from flow-based queuing to priority queuing or data center TCP (DCTCP) that reduces the average queue length.
In any case, once we accept that it doesn’t make sense to use a single high-volume TCP session between two nodes, what’s stopping an optimized TCP implementation from setting a higher priority on traffic belonging to low-bandwidth conversations3, similar to a proof-of-concept implemented in Open vSwitch years ago4.
Finally, there’s the in-order packet delivery – the article claims that:
TCP assumes that packets will arrive at a receiver in the same order they were transmitted by the sender, and it assumes that out-of-order arrivals indicate packet drops.
IP never guaranteed in-order delivery5, so TCP never assumed in-order packet arrival. The second part of the claim might have been valid in certain TCP implementations decades ago but is no longer true6.
But wait, it gets worse. When AWS introduced SRD as the transport protocol for overlay virtual networks, they claimed they observed a significant TCP performance improvements due to SRD packet spraying. At the same time, the Homa position paper claims that:
However, packet spraying cannot be used with TCP since it could change the order in which packets arrive at their destination.
One of them must be wrong.
The ultimate gem in this section is the following handwaving argument:
I hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of data center networks.
As the server-to-network links tend to be an order of magnitude slower than the core links, I hypothesize that it takes a significant amount of bad luck to get core congestion solely based on flow-consistent routing. The statistical details are left as an exercise for the reader.
Regardless of the validity of that claim, flow-consistent routing might result in imbalance in edge or core link utilization, but even there we have numerous well-known solutions. I wrote about flowlets in 2015 when there were already several production-grade implementations, while the FlowBender paper was published in 2014. There’s also the Conga paper implemented in Cisco ACI, and a number of other well-known mechanisms that can be used to alleviate the imbalance.
Is It Time to Replace TCP in Data Centers?
I would love to see a reliable (as opposed to UDP) message-based transport protocol between components of an application stack and have absolutely no problem admitting that TCP is not the best tool for that job.
Also, based on the A Linux Kernel Implementation of the Homa Transport Protocol article, in which the author described the proposed solution, it looks like they spent a lot of time implementing Homa as a Linux kernel module. I wish them all the best, but unfortunately I can’t take Homa seriously based on the arguments made in this position paper – why should I trust the claimed benefits of a proposed solution if the position paper favoring it misidentified the problems and ignored all prior art?
Anyway, is this a problem worth solving? In many cases, the answer is a resounding no. Application stacks often include components that generate much higher latency than what TCP could cause. From that perspective, Homa looks a lot like another entry in the long list of solutions in desperate search of a problem.
Finally, the It’s Time to Replace TCP in the Datacenter article focuses solely on application-to-application messaging, which might be relevant for the “niche” applications like large microservices-based web properties, high-performance computing, or map-reduce clusters that use in-memory data structures or local storage. At the same time, it completely ignores long-lived high-volume connections that represent the majority of traffic in most data centers: storage traffic, be it access to remote disk volumes or data synchronization between nodes participating in a distributed file system or distributed database.
Also, complaining about a screwdriver being bad at hammering nails just tells everyone you’re misusing the tools. ↩︎
Which totally surprised me as it kills some really cool ideas you implement with streaming telemetry. It’s quite possible I’m missing something fundamental. ↩︎
At this point I’m almost expecting a comment from prof. Olivier Bonaventure saying “we implemented this a decade ago, and here’s the link to the corresponding paper.” ↩︎
Reality might intervene: the network devices often ignore those signals as most networks don’t trust host QoS marking, usually for an excellent reason – everyone thinks whatever they’re doing must be high priority. ↩︎
That’s why it was so cumbersome to implement Token Ring bridging on top of IP networks. ↩︎
Or so I was told by people who worked on TCP/IP implementations for ages. Please feel free to correct my ignorance. ↩︎
Thank you for this article.
While "expectation of in-order packet delivery" was an eyebrow-raiser for me (it's literally the opposite!), the complaint that TCP has HoL blocking when unrelated messages are fired into a single socket was my signal to put this paper down and do something else.
I hope those fanning the hype on this topic can reorganize around reasonable criticisms and actual problems experienced by sensibly written applications.
Oh... And Golang absolutely needs more jabs for its ridiculous violation of both Postel's law and the Principle of Least Surprise with its insane TCP_NODELAY defaults. While it's probably too late to un-ring that bell, the situation sure smells like a similar bad call based on somebody's narrow view on what constitutes a widespread problem.
Thanks for this article! I have enjoyed reading it...
I have regularly get such academic ideas without seeing a proper experience in real-life networking. I am so tired of them... Since the early nineties... :-)
Most of those research guys have no ideas about realities. They should read your blog to capture some stuff... :-) Keep going!
Hmmm, I read the three relevant papers (2 homa papers + the position paper) and John's response (see comments below) after this post. My impression is quite the opposite: it appears that the "research guys" actually went into great depth in understanding the problems and measuring the performance, while some "real engineers" are too prone to dismiss interesting things that go against their intuition or get bogged down with secondary issues.
I agree that the position paper could have been worded better and it definitely has some inaccurate/incorrect claims about TCP. But, overall, I find the author's response rather convincing.
B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout, "Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities," Proc. ACM SIGCOMM 2018, August 2018, pp. 221–235. Complete version
J. Ousterhout, ["A Linux Kernel Implementation of the Homa Transport Protocol,"](J. Ousterhout, "A Linux Kernel Implementation of the Homa Transport Protocol," 2021 USENIX Annual Technical Conference (USENIX ATC '21), July 2021, pp. 773–787.) 2021 USENIX Annual Technical Conference (USENIX ATC '21), July 2021, pp. 773–787.
As far as I'm concerned, Homa might be the best transport protocol ever invented, and they could have done a great implementation job.
However, the measurement methodology used in the deep dive papers makes the comparison between TCP and Homa useless, as does ignoring a half-dozen other technologies that I mentioned in the blog post.
Version 1 of the position paper published on October 3rd 2022 makes several obviously invalid claims, ignores prior art and real-life deployment of production-grade solutions like RDMA within HPC clusters, SRD within AWS, or RoCE within Azure, and fails to explain why Homa would be better than any of the obviously-successful alternatives. Please don't claim that the "networking engineers are too prone to dismiss interesting things that go against their intuition or get bogged down with secondary issues" if the authors proposing new ideas don't get the obvious stuff right.
I don't have problem with people wanting to push protocols. My thinking on this is similar to that of Rodney Brooks: try thousands of things, a lot of them will fail, but we'll learn a lot and realize what's working to keep. But people should go beyond talking about protocols on papers; it rings hollow. Have the guts to implement them in the real world (or are they afraid of risking their own money?) If a protocol can prove itself, surely it'll catch on in time.
Having said that, there's no need to use the typical tactic of talking down another one to boost your own, and if you do so, stick to the right metrics. I agree with the majority of what you wrote in this blog. TCP has its own flaws, serious ones at that, but a lot of what John brought up was bogus. For ex:
"In TCP, when a host’s link is overloaded (either for incoming or outgoing traffic), TCP attempts to share the available bandwidth equally among the active connections. This approach is also referred to as “fair scheduling.”
BS, a transport protocol does none of that, neither should it. TCP doesn't do it either. Why should TCP or any transport protocol care about scheduling? Queuing/scheduling is a network function, be it done in the NIC or in the routers/switches. If I'm wrong, pls point out.
TCP needs in-order delivery: utter nonsense. What are its segment number and retransmission mechanism for?
"it assumes that out-of-order arrivals indicate packet drops." Don't think so. When a packet doesn't come within its expected time window, then TCP assumes it's lost in transit/dropped and attempts rectification. If it arrives out of order but doesn't get lost, TCP couldn't care less.
"packet spraying cannot be used with TCP": per-packet LB has existed for so long. It's used within networks (not on Internet where flow-based LB is used) to promote better LB than flow-based and because within a network, it's easier to control QoS.
"I hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of data center networks": congestion often happens due to hotspots, which can be due to sudden overload (link failure, flapping), or incast, LRD/self-similar traffic...Whether you use flow-based or packet-based routing, you can't get over these inconveniences. That's why trivial traffic modeling can't speak for reality, and we need to put protocols in action instead of just testing them in some simplistic lab scenarios.
Homa has its merit, so John and his team should spend their effort pushing it in real networks and see how it fares in reality. That will be more productive and helpful, than writing paper talking down TCP on faults it's not responsible for.
I think you are accusing technical incompetence where you should be critisising weak phrasing.
"BS, a transport protocol does none of that, neither should it. TCP doesn't do it either. Why should TCP or any transport protocol care about scheduling? Queuing/scheduling is a network function, be it done in the NIC or in the routers/switches. If I'm wrong, pls point out."
Fair scheduling probably is not the most precise term, but congestion control basically trickles down to trying to achieve something approaching fairness.
"TCP needs in-order delivery: utter nonsense. What are its segment number and retransmission mechanism for?"
Avoiding much out-of-order delivery is important still if you care about performance, which is what this paper is all about.
"Don't think so. When a packet doesn't come within its expected time window, then TCP assumes it's lost in transit/dropped and attempts rectification. If it arrives out of order but doesn't get lost, TCP couldn't care less."
However, with many in-between packets as can occur in the decribed context (packet-granular link load balancing), fast retransmit may cause additional traffic and reduce the congestion window/sending rate.
"per-packet LB has existed for so long. It's used within networks (not on Internet where flow-based LB is used) to promote better LB than flow-based and because within a network, it's easier to control QoS." I do however agree with this. The prevalent notion that packet spraying causes heavy out-of-order delivery is usually wrong on paths with symmetric link capacities and similar delay.
"congestion often happens due to hotspots, which can be due to sudden overload (link failure, flapping), or incast, LRD/self-similar traffic" again, the argument is not really that flow-consistent routing is actually the root cause, but that, if it were not for that, the available link capacity could be used more effectively.
All in all, while i do agree that with a title like that, he is just asking for angry responses, i also wonder whether the critics could be a bit more fair and maybe even more polite. The notion that this guy has just no idea what he is talking about because he is an academic is cheap and just plain wrong.
Thx for your thought. No, I'm not angry with John, otherwise I'd not have said Homa has merits and John and his team should spend more effort trying to implement it in real networks that they can build/get their hands on. That's a literal comment, not sarcasm.
"Fair scheduling probably is not the most precise term, but congestion control basically trickles down to trying to achieve something approaching fairness."
A man of John's stature should know the terms better, esp. if he wants his ideas to be taken more seriously. Congestion control is, as its name suggests, about relieving congestion, not about fair go between flows. For fair go between flows, that's technically a network problem and should be solved in the network, using QoS, with superior results. Putting scheduling into the transport protocol, is far from ideal, to put it mildly.
Also, besides the point, but adding crappy kludges like fast retransmit and delayed ack is a classic way of generating more complexity due to an incomplete understanding of the problem's (congestion control) nature. DCTCP ECN is a step in the right direction, but taken at the wrong layer, hence the limited effect. In DC environment though, due to the small size of the network (vs the big Internet), DCTCP ECN should be "good enough" for the most use cases though.
"Avoiding much out-of-order delivery is important still if you care about performance, which is what this paper is all about."
Yes, I agree. However, the wording is wrong: "However, packet spraying cannot be used with TCP since it could change the order in which packets arrive at their destination." Here John implied that TCP needs in-order delivery. TCP doesn't NEED in-order delivery. It can be good to have it if latency is of concerns, but NEED implies something essential, which is not the case here, as packet spraying/per-packet LB has existed for ages, and TCP ran on top of it no problem. Again, a man like John should have known the difference. Did he not know, or did he intentionally mislead, in order to promote Homa over TCP? That's my beef.
"the argument is not really that flow-consistent routing is actually the root cause"
John's wording "I hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of data center networks," suggests to me he thinks it's the root cause :)) .
Like I said, I don't think people should be restricted to 1 or 2 transport protocols. They should try more, and keep what works. Protocol reverence and holy wars are just dumb. But when criticizing, stick to the relevant problems. Making additional sockets to send unrelated msg is a task that should be the responsibility of the applications, which are aware of what they are sending and when. Dumping more complexity into the transport protocol is the wrong way to do things. The transport protocol should be generic enough and simple enough to scale well, as should the network. Policy should be independent of mechanism, and the transport protocol, like the network, should worry about the mechanism, the how, not the what. The L4 OS, for ex, gets its breathtaking speed from stripping off all non-essential elements. Simplicity is can be very performance-enhancing. Want a fast network and fast transport? Do the same.
"A man of John's stature should know the terms better, esp. if he wants his ideas to be taken more seriously."
No reason to purposely "misunderstand" what he is writing though. Also, i still take him seriously, because he seems to be doing interesting and solid work and some appreciate that.
"Congestion control is, as its name suggests, about relieving congestion, not about fair go between flows. For fair go between flows, that's technically a network problem and should be solved in the network, using QoS, with superior results. Putting scheduling into the transport protocol, is far from ideal, to put it mildly."
That is arguable at best and more importantly totally irrelevant to what the paper was alluding to.
"However, the wording is wrong. [...]"
That's what i've been saying. The wording could be better, but implying that this is actually meant and extending further argumentation based on that serves nothing but self-gratification.
John's wording "I hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of data center networks," suggests to me he thinks it's the root cause :)) .
Again. Obviously the routing in itself is seldom the root cause. It's an argument for the benefit of being able to allow for more fine-granular routing. He claims that the benefits are larger compared to improvements based on other routing techniques, i.e. without it you are less efficient and therefore "create congestion" when traffic load is high. The claim of that being the biggest improvement is in most cases not true imo, but it depends on topology, traffic and what you compare to. But implying he thinks that flow-consistent routing would somehow create congestion, not bc of being less efficient, but as a root cause, is either ill-natured or lacks empathy.
Ivan raises lots of interesting points, as do the comments above. I've written a response and posted it on the Homa Wiki: https://homa-transport.atlassian.net/l/cp/27FLingq. Additional comments are welcome.
The proliferation of scientific papers that are flawed, lack evidence or facts, and ignore logic is a growing concern in the scientific community. While the format of scientific papers may lend them an air of authenticity, the reality is that many of these papers are little more than opinionated tweets masquerading as serious research.
One of the primary reasons for the rise in flawed scientific papers is the pressure to publish. In academia, publishing research is often seen as a key metric of success, and this pressure can lead to researchers cutting corners or rushing to publish without proper peer review or fact-checking. As a result, many papers lack rigor and fail to meet the basic standards of scientific inquiry.
Another factor is the increasing influence of politics and ideology on scientific research. In some cases, researchers may be more interested in advancing a particular political or ideological agenda than in conducting rigorous research. This can lead to biased research, cherry-picking of data, and other practices that undermine the credibility of scientific inquiry.
The consequences of flawed scientific papers can be severe, ranging from wasted resources to public health risks. In some cases, flawed research can lead to false conclusions that can have a real-world impact on policy decisions, medical treatments, and public health interventions. This highlights the urgent need for more rigorous standards and oversight of scientific research, as well as greater transparency and accountability in the publishing process.
In summary, the prevalence of flawed scientific papers is a serious issue that undermines the credibility of scientific inquiry and can have real-world consequences. To address this issue, it's important to promote rigorous standards, greater transparency, and increased oversight of the publishing process.
Late to the party, but I'd like to comment on Homa insistence on spraying. I think it is just dead end optimization in data centers going forward. Spraying basically relies on two things: - network is very symmetrical and all we'll ever need is ECMP or WCMP at most (think Clos) - everybody else is doing spraying - otherwise we'll have transient hotspots
But what we see now is some ideas from HPC world slowly percolating into IP/Ethernet data center world, including new topologies. And all interesting (as in smaller diameter, larger network with the same number of switches and links) topologies are less symmetrical than Clos and utilize non-minimal paths. In fact, most of their capacity is via non-minimal paths. Topologies with non minimal paths require some forwarding and routing tricks, but they also require global adaptive routing for efficient utilization. ECMP just can't describe desired behavior like utilizing shortest paths first and then non-minimal ones and trying alternative paths instead of reducing rate as a possible reaction to congestion. Global adaptive routing in turn needs some identifiable long-lived traffic artefacts, otherwise there is nothing to move to an alternative path and no way to adapt.
PS. Google had a series of papers at SIGCOMM 2022 about their Dragonfly+ - like data center topology and adaptive routing.
Thanks for the comment. I still think that Clos is the optimal any-to-any fabric, both from the switch- and bandwidth utilization perspective.
Obviously Google had to depart from that when they effectively reinvented what Plexxi has been selling a decade ago (although Google claims they can flip the spine mirrors whereas Plexxi had fixed topology), and in scenarios where you have a bunch of cables (physical, virtual, or mirrored) between leaf switches you have to get creative.