Per-packet Load Balancing Interferes with TCP Offload

A reader left the following comment on my Does Multipath TCP Matter blog post: “Why would I use MP-TCP in a data center? Couldn’t you use packet spraying at each hop and take care of re-ordering at the destination?

Short answer: You could, but you might not want to.

Packet reordering can cause several interesting problems:

  • There are (badly written highly optimized) applications running over UDP that cannot tolerate packet reordering because they don’t buffer packets (example: VoIP);
  • Out-of-order packets reduce TCP performance.
  • Out-of-order packets kill receive side coalescing.

Impact on TCP performance

According to this paper packet reordering causes (at least) these performance problems:

  • TCP receiver sends duplicate ACK packets (to trigger fast retransmit algorithm), wasting CPU cycles and bandwidth;
  • TCP sender reduces TCP window size after receiving duplicate ACKs (assuming packets were lost in transit), effectively reducing TCP throughput.
  • TCP receiver has to buffer and reorder TCP packets within the TCP stack, wasting buffer memory and CPU cycles.

More information on this topic would definitely be most welcome. Please share in the comments section. Thank you!

Impact on TCP offload

On top of the performance problems listed above, packet reordering interferes with TCP offload (in particular with the receive segment coalescing functionality).

Receive segment coalescing is not relevant to traditional data center workloads (with most of the traffic being sent from servers toward remote clients), but can significantly improve the performance of elephant flows sent toward the server (example: iSCSI or NFS traffic). I don’t think you want to play with that, do you?

Summary

There are several really good reasons almost nobody does per-packet ECMP load sharing by default (Brocade being a notable exception solving the packet reordering challenges in their switch hardware).

12 comments:

  1. See http://tools.ietf.org/agenda/89/slides/slides-89-tcpm-7.pdf for a recent discussion on the impact of reordering on TCP performance on recent stacks.
  2. Multipath TCP can deal efficiently with the reordering caused by different paths through the network. See http://multipath-tcp.org/pmwiki.php?n=Main.50Gbps on how to reach 50 Gbps with a single Multipath TCP connection on 6 10 Gbps interfaces
    Replies
    1. That would be reordering across multiple paths (i.e. TCP sessions), not within a single TCP session, right?

      Thanks for the links, they're great!
    2. Yes, MPTCP handles reordering accross different subflows/paths. The tcpm link above deals with reordering on a single tcp connection
  3. Also when you have a problem (for example faultry transceiver/sfp/fiber) per packet load balancing may lead to situations that are very hard to troubleshoot. often enough I get inaccurate/wrong information on an issue by the customer/admin so adding another layer which probably complicates the reproduction of the problem's effects is just bad
  4. Regarding TCP in the datacenter, though not specifically related to packet reordering, there is an specific variant denoted as DataCenter TCP (DCTCP) which leverages on ECN bits marked by switches when the queue occupancy is high. This DCTCP adjusts the congestion window of the transmitter based on the amount of packets received with the ECN bits marked per RTT, and obtains much better performance (in terms of latency, response to incast traffic and fairness between different flows sharing links) than traditional TCP implementations. DCTCP is active by default in Windows server 2012 (the server detects automatically which connections are from within the DC, to use DCTCP, and which ones are external to rely on traditional TCP), and there exist patches for linux. Additionally, some switches (see Nexus 3548, for example) are advertised as "DCTCP compatible", or simply said, they mark the ECN bits. I am not involved with it, I found the info in their website: http://simula.stanford.edu/~alizade/Site/DCTCP.html

    Knowing that TCP or DCTCP only adjust the transmission window once per RTT, it would be interesting to know how this mechanism (or a similar one) reacts to packet reordering.
  5. I've seen a couple of articles mentioning that the Cisco 9k will do something similar to Brocade regarding per-packet load-balancing:
    http://keepingitclassless.net/2013/11/insieme-and-cisco-aci-part-2-aci-and-programmability/
    http://lamejournal.com/2013/11/21/ciscos-aci-insieme-launch/

    A few searches later, it looks like Cisco will be using "Flowlet Switching (Kandula et al ’04)", as can be seen on page 20:
    http://www.cisco.com/web/strategy/docs/gov/application_centric_infrastructure.pdf

    I had never heard of "flowlet switching" before, but it does sound interesting:
    http://nms.lcs.mit.edu/~kandula/data/FLARE_HotNets04_web.ppt
    Replies
    1. Thanks a million for figuring this out!!! Windows Server 2012 R2 is also using flowlets, but I never found time to chase that down.

      See "Dynamic NIC teaming" section in this document:

      http://blogs.technet.com/b/networking/archive/2013/07/31/transforming-your-datacenter-networking.aspx
  6. Hi ivan, this raises an interesting point regarding the port channel load-balancing algorithm for PortChannels in vPCs on FlexPods. Essentially we have been advocating algorithms that use SRC port to spread the storage ethernet traffic across available paths to get the full use of 20, 30, 40Gbps channels (egress) from/to NetApp filers and Cisco UCS for NFS and iSCSI. So the jumbo storage frames would essentially not monopolise a single link when all NFS flows go from one SRC IP to one DST IP NFS mount. What we did see a lot in packet dumps for either VMware host storage or guest iSCSI traffic was out of order packets which I didn't think much of at the time... however in retrospect the default LRO/GRO, LSO/LRO behaviour on hosts and guests would be being slowed down and another reason to turn it off completely I guess? Thoughts?
    Replies
    1. 5-tuple load balancing should _not_ cause packet reordering. Per-packet load balancing will.

      Opinions on the usefulness of LRO/LSO are mixed, but if it works well, it can save significant amount of CPU cycles or improve throughput of high-bandwidth TCP sessions.
  7. Last year Juniper introduced an interesting feature for their VC-Fabric called Adaptive Flowlet Splicing which inspects the metrics of flows and links to determine when packets within a flow may be "safely sprayed" without risk of reordering on the receiving end.

    I believe it measures link utilization, queue depth, latency, and several flow-specific metrics to determine a flow's "burstiness" in making this decision.

    This Juniper post explains it very well.
    http://forums.juniper.net/t5/Data-Center-Technologists/Adaptive-Flowlet-Splicing-VCF-s-Fine-Grained-Adaptive-Load/ba-p/251674
    Replies
    1. Please search my blog for recent posts on ECMP load balancing and flowlets - I covered both topics.

      Thank you!
Add comment
Sidebar