Per-packet Load Balancing Interferes with TCP Offload
A reader left the following comment on my Does Multipath TCP Matter blog post: “Why would I use MP-TCP in a data center? Couldn’t you use packet spraying at each hop and take care of re-ordering at the destination?”
Short answer: You could, but you might not want to.
Packet reordering can cause several interesting problems:
- There are (
badly writtenhighly optimized) applications running over UDP that cannot tolerate packet reordering because they don’t buffer packets (example: VoIP); - Out-of-order packets reduce TCP performance.
- Out-of-order packets kill receive side coalescing.
Impact on TCP performance
According to this paper packet reordering causes (at least) these performance problems:
- TCP receiver sends duplicate ACK packets (to trigger fast retransmit algorithm), wasting CPU cycles and bandwidth;
- TCP sender reduces TCP window size after receiving duplicate ACKs (assuming packets were lost in transit), effectively reducing TCP throughput.
- TCP receiver has to buffer and reorder TCP packets within the TCP stack, wasting buffer memory and CPU cycles.
More information on this topic would definitely be most welcome. Please share in the comments section. Thank you!
Impact on TCP offload
On top of the performance problems listed above, packet reordering interferes with TCP offload (in particular with the receive segment coalescing functionality).
Receive segment coalescing is not relevant to traditional data center workloads (with most of the traffic being sent from servers toward remote clients), but can significantly improve the performance of elephant flows sent toward the server (example: iSCSI or NFS traffic). I don’t think you want to play with that, do you?
Summary
There are several really good reasons almost nobody does per-packet ECMP load sharing by default (Brocade being a notable exception solving the packet reordering challenges in their switch hardware).
Thanks for the links, they're great!
Knowing that TCP or DCTCP only adjust the transmission window once per RTT, it would be interesting to know how this mechanism (or a similar one) reacts to packet reordering.
http://keepingitclassless.net/2013/11/insieme-and-cisco-aci-part-2-aci-and-programmability/
http://lamejournal.com/2013/11/21/ciscos-aci-insieme-launch/
A few searches later, it looks like Cisco will be using "Flowlet Switching (Kandula et al ’04)", as can be seen on page 20:
http://www.cisco.com/web/strategy/docs/gov/application_centric_infrastructure.pdf
I had never heard of "flowlet switching" before, but it does sound interesting:
http://nms.lcs.mit.edu/~kandula/data/FLARE_HotNets04_web.ppt
See "Dynamic NIC teaming" section in this document:
http://blogs.technet.com/b/networking/archive/2013/07/31/transforming-your-datacenter-networking.aspx
Opinions on the usefulness of LRO/LSO are mixed, but if it works well, it can save significant amount of CPU cycles or improve throughput of high-bandwidth TCP sessions.
I believe it measures link utilization, queue depth, latency, and several flow-specific metrics to determine a flow's "burstiness" in making this decision.
This Juniper post explains it very well.
http://forums.juniper.net/t5/Data-Center-Technologists/Adaptive-Flowlet-Splicing-VCF-s-Fine-Grained-Adaptive-Load/ba-p/251674
Thank you!