vSwitch in Multi-chassis Link Aggregation (MLAG) environment

Yesterday I described how the lack of LACP support in VMware’s vSwitch and vDS can limit the load balancing options offered by the upstream switches. The situation gets totally out-of-hand when you connect an ESX server with two uplinks to two (or more) switches that are part of a Multi-chassis Link Aggregation (MLAG) cluster.

Let’s expand the small network described in the previous post a bit, adding a second ESX server and another switch. Both ESX servers are connected to both switches (resulting in a fully redundant design) and the switches have been configured as a MLAG cluster (using VSS with Catalyst 6500, vPC with Nexus 7000 or Nexus 5000, or IRF with the HP switches). Link aggregation is not used between the physical switches and ESX servers due to lack of LACP support in ESX.

The physical switches are unaware of the physical connectivity the ESX servers have. Assuming that vSwitches use per-VM load balancing and each VM is pinned to one of the uplinks, source MAC address learning in the physical switches produces the following logical topology:

Each VM appears to be single-homed to one of the switches. The traffic between VM A and VM C is thus forwarded locally by the left-hand switch; the traffic between VM A and VM D has to traverse the inter-switch link because neither switch knows VM D can also be reached by a shorter path.

In a Multi-chassis Link Aggregation scenario it’s thus almost mandatory to configure static port channel on the switches to which the vSphere servers are connected, otherwise you risk overloading the inter-switch link as the traffic between adjacent ESX servers is needlessly sent across that link. When doing that, you (probably) have to configure IP-hash-based load balancing in vSwitch (more information about the vSwitch-side implications if the NICs in ESX are configured in active/standby configuration).

It's much better (at least from Cisco’s perspective) to use Nexus 1000V – it supports LACP, ESX servers start behaving like access-layer switches and the traffic flow always remains optimal (at least within the boundaries of hot-potato switching).

More information

Interaction of vSwitch with link aggregation is just one of many LAG-related topics covered in my Data Center 3.0 for Networking Engineers webinar (buy a recording or yearly subscription).

12 comments:

  1. It's not necessarily as bad as the diagram indicates...yes, this is what happens when the VMs are on different ESX hosts, but any VMs in the same port group on the same host will be switched within the vSwitch. So in the diagram, if A and D are running on the same ESX and are on the same VLAN, they can be in the same port group and none of the traffic between them will leave the host. For this reason, any VMs that send high amounts of data over the network to each other, we will often add DRS affinity rules to keep them on the same host.

    ReplyDelete
  2. I think Ivan's example is on purpose build around the fact that A and D are on different ESX. DRS Affinity rules are nice, but somehow manual....

    ReplyDelete
  3. Ivan,

    Is this a workable scenario?

    If I understand correctly, you're saying that we've got two static MLAGs configured: one to each ESX host.

    Your second drawing appears to show MAC addresses being learned on *physical*ethernet*ports*, rather than the logical aggregate interfaces.

    Assuming the static aggregations are Po1 (left ESX) and Po2 (right ESX), then the resulting MAC->port mapping on *both* pSwitches should be:

    A -> Po1
    B -> Po1
    C -> Po2
    D -> Po2

    ...Because MACs don't get learned on link members of an aggregation.

    A frame from A to C will be okay, because it's path is:
    - ingress left pSwitch on Po1
    - egress left pSwitch on Po2
    - ingress right ESX on the *correct* pNIC

    A frame from A to D will fail, because its path is:
    - ingress left pSwitch on Po1
    - egress left pSwitch on Po2
    - ingress right ESX on the *wrong* pNIC.

    A->D frames will ingress the ESX host on the pNIC that's pinned to vm C. I expect that the vSwitch split horizon bridging will drop this frame. Maybe it doesn't?

    ReplyDelete
  4. The point is that you NEED static LAGs configured unless you want to get weird traffic flow shown in the diagrams. Without static LAGs you might get a lot of traffic across inter-switch link.

    And since you HAVE TO HAVE static LAGs, you also need IP-hash-based load balancing in ESX, otherwise the incoming frames arriving through the wrong port will get dropped.

    ReplyDelete
  5. Also - reworded the introductory description to make lack of link aggregation more explicit. Thank you!

    ReplyDelete
  6. Also - reworded the introductory description to make lack of link aggregation more explicit. Thank you, it was more than just a bit vague.

    ReplyDelete
  7. Hrm, I don't think we're on the same page yet.

    Certainly ESX hash-based and static LAG *must* go together. There's no disagreement there.

    But I disagree with "Each VM appears to be single-homed to one of the switches."

    From the pSwitch perspective, each VM is homed to an *aggregation*, and MAC learning will happen on the aggregation, regardless of the link member where they arrive.

    Taking just the left pSwitch, all of A's frames will arrive on link member 0 and all of C's frames will arrive on link member 1... But the pSwitch won't notice this. The pSwitch will associate both MACs with the aggregate interface, and will forward downstream frames according to whichever hashing method is configured, totally ignoring which MAC showed up on which link member.

    Traffic won't flow across the inter-switch link. It will flow *down* the aggregate, and only *maybe* get delivered to the VMs.

    ReplyDelete
  8. Chris, I think Ivan's statement, "Each VM appears to be single-homed to one of the switches," refers to the default port ID vSwitch NIC teaming policy, not the IP-hash policy, so there are no port channels to the ESX hosts in the scenario where that statement applies...

    ReplyDelete
  9. Exactly!

    ReplyDelete
  10. "the switches have been configured as a MLAG cluster"..."link aggregation is not used"

    Ah! Okay, I'd missed the "aggregation is not used" sentence until just now. Looking back at your previous comment, I guess this is the part that got clarified (after my misunderstanding of the topology was firmly cemented in my brain)

    The extra east-west hop would be nice to avoid.

    ReplyDelete
  11. VM A and D are connected to different vSwitches if you look at the above diagram, so traffic will still need to go thru the physical switch stack if those VMs need to communicate even if they're on the same physical host and on the same VLAN. There is no way the traffic can flow between different vSwitches within the same host unless you purposefully introduce bridging between two vSwitches.

    ReplyDelete
  12. Ivan Pepelnjak07 March, 2011 19:36

    VM-A and VM-D are in two different ESX servers.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.