Chris Wahl claimed in one of his recent blog posts that vSphere doesn't need LAG band-aids. He's absolutely right - vSphere's loop prevention logic alleviates the need for STP-blocked links, allowing you to use full server uplink bandwidth without the complexity of link aggregation. Now let’s consider the networking perspective.
The conclusions Chris reached are perfectly valid in a classic data center or VDI environment with majority of the VM-generated traffic leaving the data center; the situation is drastically different in data centers with predominantly east-west traffic (be it inter-server or IP-based storage traffic).
Let’s start with a simple scenario:
- Data center is small enough to have only two switches (for a total of 4 Tbps or more of throughput – that should be enough for most use cases).
- Two vSphere servers are connected to the two data center switches in a fully redundant setup (each server has one uplink to each ToR switch).
- Load-Based Teaming (LBT) is used within vSphere instead of IP-based hash (vSphere terminology for a Link Aggregation Group / LAG).
The two ToR switches are not aware of the exact VM placement, resulting in traffic flowing across inter-switch link even when it could be exchanged locally (yeah, I was writing about this issue almost exactly three years ago).
VM MAC reachability as seen by the switches
Traffic flow between VMs on adjacent hypervisor hosts
Can we fix it?
You can fix this problem by making most endpoints equidistant. You could introduce a second layer of switches (resulting in a full-blown leaf-and-spine fabric) or you could connect the servers to a layer of fabric extenders, which would also ensure the traffic between any two endpoints gets equal treatment.
Traffic between VMs appearing near to each other in a leaf-and-spine fabric gets better treatment than any other traffic (no leaf-to-spine oversubscription); in the FEX scenario all traffic gets identical treatment as FEX still doesn’t support local switching.
The leaf-and-spine (or FEX) solution obviously costs way more than a simple two-switch solution, so you just might consider using link aggregation and LACP with vSphere.
But LBT works so much better than IP hash mechanisms
Sure it does (and Chris provided some very good arguments for that claim in his blog post), but there’s nothing in the 802.1ax standard dictating the traffic distribution mechanism on a LAG. VMware could have decided to use LBT with a LAG, but they didn’t (because deep inside the virtual switch they tie the concept of a link aggregation group to the IP hash load balancing mechanism). Don’t blame standards and concepts for suboptimal implementations ;)
But aren’t static port channels unreliable?
Of course they are; I wouldn’t touch them with a 10-foot pole. You should always use LACP to form a LAG, but VMware supports LACP only in the distributed switch, which requires Enterprise Plus license. Yet again, don’t blame the standards or network design requirements, blame a vendor that wants to charge extra for baseline layer-2 functionality.
Is there another way out of this morass?
Buying extra switches (or fabric extenders) is too expensive. Buying Enterprise Plus license for every vSphere host just to get LACP support is clearly out of question. Is there something else we could do? Of course – you can make sure the inter-switch link gets enough bandwidth.
Typical high-end ToR switches (from almost any vendor) have 48 10GE ports and four 40GE ports. Using the 40GE ports for inter-switch links results in worst-case 3:1 oversubscription (48 10GE ports on the left-hand switch communicate exclusively with the 48 10GE ports on the right-hand switch over an equivalent of 16 10GE ports). Problem solved (until you have to add more servers).