Do Packet Drops Matter for TCP Performance?
Approximately two years ago I tried to figure out whether aggressive marketing of deep buffer data center switches makes sense, recorded a few podcasts on the topic and organized a webinar with JR Rivers.
Not surprisingly, the question keeps popping up, so it seems it’s time for another series of TL&DR articles. Let’s start with the basics:
- Every network will eventually experience congestion. What matters is how the network deals with the congestion.
- Network congestion results in either packet drops or increased latency, and you can’t avoid both of them at the same time without alleviating congestion.
- Don’t believe in the magic powers of QoS - it’s a zero-sum game.
- The only way to alleviate network congestion is to reduce the amount of data injected into the network;
- You can control the network load by dropping excess traffic before it enters the network (ingress policing), or by persuading the senders to reduce the transmission rate.
If you’re worried about the impact of packet drops you might want to avoid policing (= packet drops) at the network edge and focus on adjusting the senders’ transmission rate. You could use static mechanisms (traffic shaping) or try to guesstimate how much traffic the network can carry. Time for another set of facts:
- Early networking technologies implemented a plethora of backpressure mechanisms that a network node could use to reduce the ingress traffic. These technologies include per-link flow control (X.25 and Fibre Channel buffer-to-buffer credits) and stop-and-start mechanisms (XON/XOFF, PAUSE frames, Priority Flow Control).
- Hop-by-hop backpressure mechanisms usually aim to implement lossless transport network, resulting in suboptimal performance or deep buffer requirements on links with large round-trip-times (RTT). In-depth analysis of this claim is left as an exercise for the reader.
- There are tons of other problems with hop-by-hop backpressure mechanisms including increased amount of network state and head-of-line blocking. Yet again, we won’t go into details (but feel free to figure them out).
Designers of IP decided not to deal with this particular can of worms:
- IP networks have no hop-by-hop backpressure mechanism apart from what the underlying layer-2 technology might have;
- Commonly used layer-2 technologies (Ethernet, PPP, or SONET/SDH links) have no backpressure, the only exceptions being lossless Ethernet and reliable PPP;
- The domain of a potential backpressure mechanism is limited to a single layer-2 domain (until the first moment a packet is routed). That’s why it’s a total nonsense to talk about lossless packet forwarding in routed networks including VXLAN-based layer-2 networks.
- The transport protocol in the sending node has to guestimate the state of the network (end-to-end bandwidth and current congestion);
- UDP is unreliable transport protocol and therefore (A) does not care whether the packets sent into the network are delivered or not and (B) does not adjust the sending rate.
- TCP (as a reliable transport protocol) has to estimate available network resources to minimize retransmissions and optimize goodput.
Finally we’re getting somewhere. How does TCP do it?
- Lacking explicit backpressure mechanism in IP networks (with ECN being somewhat of an exception) the only signals available to TCP are packet drops and increased latency.
- Algorithms responding to increased latency are always starved when competing with algorithms responding to packet loss as output queues in network devices cause increased latency way before network devices start dropping packets.
- Most TCP congestion avoidance algorithms therefore respond to packet drops not latency increase. The usual response to a packet drop is to halve the sending rate.
- Unless you know better (see also: Understanding Your Environment) it’s better to have reasonable amount of packet drop instead of increased latency. Not surprisingly, not everyone got that memo (see also: Bufferbloat).
As always, there are tons of exceptions:
- BBR congestion control uses more complex algorithm that includes estimated round-trip-time;
- Network devices could set TCP ECN bits to indicate impending congestion. Unfortunately most traditional TCP implementations treat ECN bits in the same way as packet drops;
- Data Center TCP modifies response to ECN bits resulting in gradual sending rate reduction;
- The negative impact of deep buffers can be alleviated with advanced Active Queue Management (AQM) algorithms like CoDel. I haven’t seen a hardware implementation of CoDel yet, and would love to understand how much buffering CoDel really needs.
Now for the crucial questions:
- Are packet drops bad? According to engineers familiar with modern TCP implementations (see below for details) the answer is NO (due to selective retransmission).
- Do they affect performance? Of course (see above), but then it’s unrealistic to expect unlimited performance;
- Can they reduce the throughout of a single TCP session below the available network capacity? Yes, assuming that they happen so frequently that the flow of data is interrupted because the receiver cannot send selective ACKs soon enough. Welcome to the Bandwidth-Delay Product twilight zone.
- Can they cause underutilization of networking infrastructure? Yes, assuming your network carries a small number of high-bandwidth TCP sessions. In more mainstream environments you’d see tens of thousands of TCP sessions on every high speed link and the statistics would take care of the problem.
Speaking of small number of TCP sessions… it’s worth mentioning that measurements on low-speed links show a clear advantage of delaying TCP traffic (traffic shaping) versus dropping excess traffic. However, you should also keep in mind that we’re expecting TCP to deal with drastically different environments, from 2400-baud modems and WiFi networks to 100 Gbps data center links, and you cannot expect measurements done at one extreme set of conditions to be relevant across the whole range of environments.
To summarize (again, based on what people who should know better are telling me):
- Drops are not a big deal in low-latency environments;
- Shallow buffers (and corresponding drops) might even be beneficial because they keep latency low;
- Congested low-speed links require smarter AQM algorithms like CoDel - not a problem because in those environments you can do packet scheduling in software;
- We can do better than respond to drops, but there’s no simple solution;
- Mobile networks that perform buffering and retransmission in the radio network are a totally different story and require extensive optimizations.
Have I missed anything? Please write a comment!
Want to know more? Start with:
- To Drop or To Delay with Juho Snellman;
- TCP in the Data Center with Thomas Graf;
- On Lossiness of TCP blog post
- TCP, HTTP and SPDY free webinar;
- Networks, Buffers and Drops free webinar with JR Rivers;
- QoS Fundamentals webinar with Ethan Banks;
Finally, check out the How Networks Really Work series starting on June 18th. You could join the live session for free if you send me a really good question.
Also, all switches I've seen from Arista and Cisco that were targeting HFT environments were shallow-buffer switches focused on minimizing latency.
But then, of course, everyone might have been missing something really important. Wouldn't be the first time.
Thanks for the reply. Feed is usually send over MoldUdp in multicast. But for order entry, ouch is the protocol used by nasdaq systems and it is running on top of TCP soupbin. So it is vulnerable to packet loss as much it is to latency.
However, considering how much money is made in HFT, wouldn't it be simpler to increase link speeds to make sure they're not congested?
It would be really nice to know whether things like ECN and DC-TCP would make things better... and having a pointer to real-life test results would be awesome.
I know on the HCI designs for VMWare VSAN and all out of box solutions like vxRAIL recommend big buffer switches when using 6+ nodes. My guess is VMWare might be a bit slow and having a packet traverse it's large network stack causing an artificial problem coupled with the large ingestion of synchronized data. Again, if I had a lab to do it to get the facts. Sorry man, as always I enjoy your write ups. :)
Parallel applications present communication phases which can last for some microseconds, and in some cases (e.g. synchronization barriers) they need to be completed before the next computation phase starts. Because of packet loses, we found traces of the execution in which these phases were delayed several milliseconds. This problem increases with the amount of traffic in the network, so larger executions (applications running on many more nodes) typically have more traffic, more losses and more delays, increasing overall execution time. Eventually, this problem restricts application scalability (number of nodes that can run the application in parallel with proportional increase in performance).
We considered some alternatives already suggested in your text, such as using DC-TCP, but it was not available in our kernels (would had to be backported; not a problem nowadays) and some embedded Broadcom Ethernet switches used in our boards did not support ECN marking, so it would be useless anyway. BTW, using Ethernet flow control (pauses) did not really prevent packet loses (we speculated with packet losses at the NIC or OS level, or defective implementations since these were not 802.1Qbb with a lossless implementation in mind).
Eventually, the obtained scalability was suboptimal because of these issues; I believe this is a clear motivation for most HPC environments using lossless technologies (Infiniband, Intel Omnipath, Bull BXI, Cray Aries & Slingshot, etc.), apart from RDMA support and low switching latency.
One of these years ivan will mention "fq_"codel in a sentence. From a
network operator's perspective:
* fq_codel ensures statistical multiplexing
* fq_codel ensures that congestion control algorithms across many flows
fate share faster
* fq_codel is more robust against packet floods
* fq_codel is safer to use ecn with
* codel drops from the head of the queue, not the tail, which makes flows where
the most important data is the most current data - work better - voip, gaming and dns - with less latency.
* fq_codel makes delay based and loss based TCPs co-exist better
It's only the default on roughly 100% linux nowadays for all devices, with sch_fq a distant second. It's now available for freebsd also.
The specialized version we did for wifi has taken off like a skyrocket
(default in many qca based products, like google wifi and eero, intel just added support for iwl in linux 5.1), and we got it on OSX 2 years back.
fq_codel is now the default QoS system for I think about 4/5ths the
home router market - especially for inbound shaping. Etc.
We're really not huge on running codel standalone on a single queue - it's too gentle.
As single queued AQMs go, pie is better. sch_cake, even better, even in
single queue mode (paper due out next month). (in no case am I
recommending ecn at present due to the l4s/sce dispute).
"fq_"codel (and for that matter "fq_pie", allows for delay based and loss based TCPs to co-exist better, and will in the end help break the logjam here - bbr works great with it,
A very relevant paper on the futility of conventional congestion
control came out recently ( https://arxiv.org/pdf/1903.03852.pdf ) and is being discussed on the BBR mailing list: https://groups.google.com/forum/#!topic/bbr-dev/chcftJgJ3vA
But we have not seen *any* of the new aqm or fq technologies appear in hardware offloads yet. It seemed to be straightforward - most cards/switches have a 5 tuple hw hash already, drr was long ago (2008) made to work on netfpga - I think it's mostly market demand and awareness need to continue to be raised.
"fq_codel provides great isolation… if you’ve got low-rate videoconferencing and low rate web traffic they never get dropped. A lot of the issues with iw10 go away, because all that other traffic sees is the front of the queue and you don’t know how big its window is and you don’t care because you are not affected by it.
And: fq_codel increases utilization across your entire networking fabric especially for bidirectional traffic… If we’re sticking code into boxes to deploy codel, don’t do that. Deploy fq_codel. It’s just an across the board win.” - Van Jacobson, IETF 84
Which we were hoping more folk would have done by now.