Do Packet Drops Matter for TCP Performance?

Approximately two years ago I tried to figure out whether aggressive marketing of deep buffer data center switches makes sense, recorded a few podcasts on the topic and organized a webinar with JR Rivers.

Not surprisingly, the question keeps popping up, so it seems it’s time for another series of TL&DR articles. Let’s start with the basics:

  • Every network will eventually experience congestion. What matters is how the network deals with the congestion.
  • Network congestion results in either packet drops or increased latency, and you can’t avoid both of them at the same time without alleviating congestion.
  • Don’t believe in the magic powers of QoS - it’s a zero-sum game.
  • The only way to alleviate network congestion is to reduce the amount of data injected into the network;
  • You can control the network load by dropping excess traffic before it enters the network (ingress policing), or by persuading the senders to reduce the transmission rate.

If you’re worried about the impact of packet drops you might want to avoid policing (= packet drops) at the network edge and focus on adjusting the senders’ transmission rate. You could use static mechanisms (traffic shaping) or try to guesstimate how much traffic the network can carry. Time for another set of facts:

  • Early networking technologies implemented a plethora of backpressure mechanisms that a network node could use to reduce the ingress traffic. These technologies include per-link flow control (X.25 and Fibre Channel buffer-to-buffer credits) and stop-and-start mechanisms (XON/XOFF, PAUSE frames, Priority Flow Control).
  • Hop-by-hop backpressure mechanisms usually aim to implement lossless transport network, resulting in suboptimal performance or deep buffer requirements on links with large round-trip-times (RTT). In-depth analysis of this claim is left as an exercise for the reader.
  • There are tons of other problems with hop-by-hop backpressure mechanisms including increased amount of network state and head-of-line blocking. Yet again, we won’t go into details (but feel free to figure them out).

Designers of IP decided not to deal with this particular can of worms:

  • IP networks have no hop-by-hop backpressure mechanism apart from what the underlying layer-2 technology might have;
  • Commonly used layer-2 technologies (Ethernet, PPP, or SONET/SDH links) have no backpressure, the only exceptions being lossless Ethernet and reliable PPP;
  • The domain of a potential backpressure mechanism is limited to a single layer-2 domain (until the first moment a packet is routed). That’s why it’s a total nonsense to talk about lossless packet forwarding in routed networks including VXLAN-based layer-2 networks.
  • The transport protocol in the sending node has to guestimate the state of the network (end-to-end bandwidth and current congestion);
  • UDP is unreliable transport protocol and therefore (A) does not care whether the packets sent into the network are delivered or not and (B) does not adjust the sending rate.
  • TCP (as a reliable transport protocol) has to estimate available network resources to minimize retransmissions and optimize goodput.

Finally we’re getting somewhere. How does TCP do it?

  • Lacking explicit backpressure mechanism in IP networks (with ECN being somewhat of an exception) the only signals available to TCP are packet drops and increased latency.
  • Algorithms responding to increased latency are always starved when competing with algorithms responding to packet loss as output queues in network devices cause increased latency way before network devices start dropping packets.
  • Most TCP congestion avoidance algorithms therefore respond to packet drops not latency increase. The usual response to a packet drop is to halve the sending rate.
  • Unless you know better (see also: Understanding Your Environment) it’s better to have reasonable amount of packet drop instead of increased latency. Not surprisingly, not everyone got that memo (see also: Bufferbloat).

As always, there are tons of exceptions:

  • BBR congestion control uses more complex algorithm that includes estimated round-trip-time;
  • Network devices could set TCP ECN bits to indicate impending congestion. Unfortunately most traditional TCP implementations treat ECN bits in the same way as packet drops;
  • Data Center TCP modifies response to ECN bits resulting in gradual sending rate reduction;
  • The negative impact of deep buffers can be alleviated with advanced Active Queue Management (AQM) algorithms like CoDel. I haven’t seen a hardware implementation of CoDel yet, and would love to understand how much buffering CoDel really needs.

Now for the crucial questions:

  • Are packet drops bad? According to engineers familiar with modern TCP implementations (see below for details) the answer is NO (due to selective retransmission).
  • Do they affect performance? Of course (see above), but then it’s unrealistic to expect unlimited performance;
  • Can they reduce the throughout of a single TCP session below the available network capacity? Yes, assuming that they happen so frequently that the flow of data is interrupted because the receiver cannot send selective ACKs soon enough. Welcome to the Bandwidth-Delay Product twilight zone.
  • Can they cause underutilization of networking infrastructure? Yes, assuming your network carries a small number of high-bandwidth TCP sessions. In more mainstream environments you’d see tens of thousands of TCP sessions on every high speed link and the statistics would take care of the problem.

Speaking of small number of TCP sessions… it’s worth mentioning that measurements on low-speed links show a clear advantage of delaying TCP traffic (traffic shaping) versus dropping excess traffic. However, you should also keep in mind that we’re expecting TCP to deal with drastically different environments, from 2400-baud modems and WiFi networks to 100 Gbps data center links, and you cannot expect measurements done at one extreme set of conditions to be relevant across the whole range of environments.

To summarize (again, based on what people who should know better are telling me):

  • Drops are not a big deal in low-latency environments;
  • Shallow buffers (and corresponding drops) might even be beneficial because they keep latency low;
  • Congested low-speed links require smarter AQM algorithms like CoDel - not a problem because in those environments you can do packet scheduling in software;
  • We can do better than respond to drops, but there’s no simple solution;
  • Mobile networks that perform buffering and retransmission in the radio network are a totally different story and require extensive optimizations.

Have I missed anything? Please write a comment!

Further information

Want to know more? Start with:

Finally, check out the How Networks Really Work series starting on June 18th. You could join the live session for free if you send me a really good question.

12 comments:

  1. Especially in hft trading environments, very first packets after the market is open are very important and they need to be delivered with minimal latency and without loss. There is a microburst traffic to be handled, and in that case it is better to buffer the packet instead of dropping and relying on tcp recovery. If dropped, retransmission timeout and the faith of the regenerated packet can cause million dollars..
    Replies
    1. Minimal latency and buffering are competing agendas in that case, aren't they?
    2. I've never been working in HFT environments, but I was told by several people who were that (A) the trading data is transferred using UDP over IP multicast and (B) they don't do retransmissions because the retransmissions take too much time. They switch over to a backup feed the moment data is lost from the primary feed (and hope the backup feed will survive the day).

      Also, all switches I've seen from Arista and Cisco that were targeting HFT environments were shallow-buffer switches focused on minimizing latency.

      But then, of course, everyone might have been missing something really important. Wouldn't be the first time.
    3. Hi Ivan,

      Thanks for the reply. Feed is usually send over MoldUdp in multicast. But for order entry, ouch is the protocol used by nasdaq systems and it is running on top of TCP soupbin. So it is vulnerable to packet loss as much it is to latency.
    4. Thank you... and so I learn something new every day ;))

      However, considering how much money is made in HFT, wouldn't it be simpler to increase link speeds to make sure they're not congested?
    5. Yes, for sure. But due to the fairness rule, every competitor who is trying to buy/sell first needs to be queued on the same physical link of the gateway server, in fifo logic. This link can be a 100gbps nic with kernel bypass support on hw. This is one expensive solution. On the other hand, 10g nic with kernel bypass and a large buffers with minimal latency possible switch can be achive nearly similar results in terms of trading. There are some other parameters of course, the matching capacity of the trading system, the instrument diversity etc.. by the way, let me introduce my self, Serdar Kut. You may remember me from evpn ebgp nexthop talk we had in mail. My best regards
  2. I've seen large farms of mongodb VMs used in a microservice world utilizing 25Gb iscsi across 9 hosts where shallow buffers were worse off than having big buffer. Big buffer helped taking +200ms of network induced storage latency at 260Mbps causing datastore loss to something more friendly around 43ms of storage latency and beyond 10Gbps throughput. It showed significantly reducing discards that were causing storage latency. The only other option I would have is to introduce more VM hosts to spread the micro burst of 280 Mongodb VMs doing a log rotate, gzip at the same nanosecond. If I was running a cloud I wouldn't want to tempt fate with crazy customer workloads like mine.
    Replies
    1. Do I understand correctly that you had ~300 hosts doing mostly writes to 9 iSCSI targets - a typical incast scenario where buffers matter most.

      It would be really nice to know whether things like ECN and DC-TCP would make things better... and having a pointer to real-life test results would be awesome.
    2. Very close understanding. It was 9 VMWare hosts, x2 v4 xeons, DDR4 2600 ~300 MongoDB VM's using Ubuntu on top of those hosts with the default big block filesystem sized configured. The VMWare hosts were tied to a cut-through 25Gb leaf (where discarding occurred). Everything was deployed with orchestration using cloud foundry, so t-shirt sizing without much customization of what mongodb did nor the file-system tuning that also could have helped. Also the iSCSI storage was Kaminario tied to a big buffer leaf (0 discarding) to help incast buffering from many hosts on many other leafs in the environment. During the discard storm you would see storage latency on the Kaminario as 'fabric' indicating pipe or host issue. VMWare would get bad enough where it would loose it's datastore. Within the mongo VM's it appeared as a blink of an eye and didn't appear like anything was wrong. I'd love to go back in time and create a detailed post showing it all including the ECN/DCTCP breakdown of what it was doing.

      I know on the HCI designs for VMWare VSAN and all out of box solutions like vxRAIL recommend big buffer switches when using 6+ nodes. My guess is VMWare might be a bit slow and having a packet traverse it's large network stack causing an artificial problem coupled with the large ingestion of synchronized data. Again, if I had a lab to do it to get the facts. Sorry man, as always I enjoy your write ups. :)
  3. We identified a case in which packet losses mattered much for performance. In a research project, we build an HPC prototype that, because of reasons, had to employ Ethernet + IP + TCP instead of Infiniband or some other lossless technology. This network was used for inter-process communication (IPC) in parallel applications. Losing a packet means that one message of the communication of the application is lost and you have to wait for the Retransmission TimeOut (RTO); note that SACK does not work with the last segment of a flow, it detects the loss when the next segment is received with a gap. No matter the RTT, RTO has a minimal value (RTOmin) which might depend on the available timers on your CPU hardware or OS kernel; in our case, we couldn't configure this value below 5ms (default is 200ms, obviously set for WAN networks, not Datacenter environments).

    Parallel applications present communication phases which can last for some microseconds, and in some cases (e.g. synchronization barriers) they need to be completed before the next computation phase starts. Because of packet loses, we found traces of the execution in which these phases were delayed several milliseconds. This problem increases with the amount of traffic in the network, so larger executions (applications running on many more nodes) typically have more traffic, more losses and more delays, increasing overall execution time. Eventually, this problem restricts application scalability (number of nodes that can run the application in parallel with proportional increase in performance).

    We considered some alternatives already suggested in your text, such as using DC-TCP, but it was not available in our kernels (would had to be backported; not a problem nowadays) and some embedded Broadcom Ethernet switches used in our boards did not support ECN marking, so it would be useless anyway. BTW, using Ethernet flow control (pauses) did not really prevent packet loses (we speculated with packet losses at the NIC or OS level, or defective implementations since these were not 802.1Qbb with a lossless implementation in mind).

    Eventually, the obtained scalability was suboptimal because of these issues; I believe this is a clear motivation for most HPC environments using lossless technologies (Infiniband, Intel Omnipath, Bull BXI, Cray Aries & Slingshot, etc.), apart from RDMA support and low switching latency.

  4. One of these years ivan will mention "fq_"codel in a sentence. From a
    network operator's perspective:

    * fq_codel ensures statistical multiplexing
    * fq_codel ensures that congestion control algorithms across many flows
    fate share faster
    * fq_codel is more robust against packet floods
    * fq_codel is safer to use ecn with
    * codel drops from the head of the queue, not the tail, which makes flows where
    the most important data is the most current data - work better - voip, gaming and dns - with less latency.
    * fq_codel makes delay based and loss based TCPs co-exist better

    It's only the default on roughly 100% linux nowadays for all devices, with sch_fq a distant second. It's now available for freebsd also.

    The specialized version we did for wifi has taken off like a skyrocket
    (default in many qca based products, like google wifi and eero, intel just added support for iwl in linux 5.1), and we got it on OSX 2 years back.

    fq_codel is now the default QoS system for I think about 4/5ths the
    home router market - especially for inbound shaping. Etc.

    We're really not huge on running codel standalone on a single queue - it's too gentle.
    As single queued AQMs go, pie is better. sch_cake, even better, even in
    single queue mode (paper due out next month). (in no case am I
    recommending ecn at present due to the l4s/sce dispute).

    "fq_"codel (and for that matter "fq_pie", allows for delay based and loss based TCPs to co-exist better, and will in the end help break the logjam here - bbr works great with it,
    in particular.

    A very relevant paper on the futility of conventional congestion
    control came out recently ( https://arxiv.org/pdf/1903.03852.pdf ) and is being discussed on the BBR mailing list: https://groups.google.com/forum/#!topic/bbr-dev/chcftJgJ3vA

    But we have not seen *any* of the new aqm or fq technologies appear in hardware offloads yet. It seemed to be straightforward - most cards/switches have a 5 tuple hw hash already, drr was long ago (2008) made to work on netfpga - I think it's mostly market demand and awareness need to continue to be raised.

    "fq_codel provides great isolation… if you’ve got low-rate videoconferencing and low rate web traffic they never get dropped. A lot of the issues with iw10 go away, because all that other traffic sees is the front of the queue and you don’t know how big its window is and you don’t care because you are not affected by it.

    And: fq_codel increases utilization across your entire networking fabric especially for bidirectional traffic… If we’re sticking code into boxes to deploy codel, don’t do that. Deploy fq_codel. It’s just an across the board win.” - Van Jacobson, IETF 84

    Which we were hoping more folk would have done by now.
  5. This comment has been removed by the author.
Add comment
Sidebar