Olivier Hault sent me an interesting challenge:
I cannot find any simple network-layer solution that would allow me to use total available bandwidth between a Hypervisor with multiple uplinks and a Network Attached Storage (NAS) box.
Network attached storage uses TCP-based protocols (iSCSI, NFS, CIFS/SMB) to communicate with the attached hosts. Trying to solve the problem at the network or TCP layer thus transforms the problem into: how can I load-balance packets from a single TCP session across multiple links?
The simple answer is: you cannot, unless you’re willing to tolerate packet reordering, which will interfere with receive-side TCP offload (Receive Segment Coalescing – RSC) and thus impact TCP performance (see also the comments in that blog post).
Link aggregation group (LAG, aka EtherChannel or Port Channel) is not a solution. Most LAG implementations will not send packets from the same TCP session across multiple links to avoid packet reordering.
Brocade solved the problem by keeping track of packet arrival times (and delaying packets that would arrive out-of-order), but it only works if all links are connected to the same ASIC – you cannot use the same trick across multiple ASICs, let alone across multiple switches (for hosts connected to two ToR switches for redundancy).
Conclusion: The problem MUST be solved above the network layer.
There are several solutions one can use to solve this challenge:
- Multipath TCP: insert a shim inside the TCP layer that allows a single application-facing TCP session to be spread across multiple network-facing TCP sessions. All sessions could use the same IP endpoints, and rely on 5-tuple ECMP load balancing between the network and the server (MP-TCP could open new sessions if it figures out the existing sessions hash to the same link)
- Multipath I/O: use multiple application-level TCP sessions, each one of them running across one of the uplinks. In most implementations the server-to-network links cannot be in a LAG or ECMP group, and the server uses a dedicated IP address (tied to a specific uplink) for each storage session in the MPIO group. MPIO is available in many iSCSI implementations; SMB3 has a similar feature called SMB Multichannel (HT: JP. Papillon).
- Slice the elephant into smaller chunks. I got the impression that at least in some OpenStack Cinder implementations each VM uses a separate session with the iSCSI/NFS target, which nicely slices the fat one-session-per-hypervisor-host elephant into dozens of smaller bits. In the ideal case, the number of smaller sessions is large enough to make load balancing work well without tweaking the TCP port number selection or load balancing algorithms.
There’s more to come
I’ll describe the storage challenges in a dedicated storage networking webinar sometime later this year.