Build the Next-Generation Data Center
6 week online course starting in spring 2017

vSphere Does Not Need LAG Bandaids – the Network Does

Chris Wahl claimed in one of his recent blog posts that vSphere doesn't need LAG band-aids. He's absolutely right - vSphere's loop prevention logic alleviates the need for STP-blocked links, allowing you to use full server uplink bandwidth without the complexity of link aggregation. Now let’s consider the networking perspective.

The conclusions Chris reached are perfectly valid in a classic data center or VDI environment with majority of the VM-generated traffic leaving the data center; the situation is drastically different in data centers with predominantly east-west traffic (be it inter-server or IP-based storage traffic).

Let’s start with a simple scenario:

  • Data center is small enough to have only two switches (for a total of 4 Tbps or more of throughput – that should be enough for most use cases).
  • Two vSphere servers are connected to the two data center switches in a fully redundant setup (each server has one uplink to each ToR switch).
  • Load-Based Teaming (LBT) is used within vSphere instead of IP-based hash (vSphere terminology for a Link Aggregation Group / LAG).

The two ToR switches are not aware of the exact VM placement, resulting in traffic flowing across inter-switch link even when it could be exchanged locally (yeah, I was writing about this issue almost exactly three years ago).


Physical connectivity


VM MAC reachability as seen by the switches


Traffic flow between VMs on adjacent hypervisor hosts

Can we fix it?

You can fix this problem by making most endpoints equidistant. You could introduce a second layer of switches (resulting in a full-blown leaf-and-spine fabric) or you could connect the servers to a layer of fabric extenders, which would also ensure the traffic between any two endpoints gets equal treatment.

Traffic between VMs appearing near to each other in a leaf-and-spine fabric gets better treatment than any other traffic (no leaf-to-spine oversubscription); in the FEX scenario all traffic gets identical treatment as FEX still doesn’t support local switching.

The leaf-and-spine (or FEX) solution obviously costs way more than a simple two-switch solution, so you just might consider using link aggregation and LACP with vSphere.

But LBT works so much better than IP hash mechanisms

Sure it does (and Chris provided some very good arguments for that claim in his blog post), but there’s nothing in the 802.1ax standard dictating the traffic distribution mechanism on a LAG. VMware could have decided to use LBT with a LAG, but they didn’t (because deep inside the virtual switch they tie the concept of a link aggregation group to the IP hash load balancing mechanism). Don’t blame standards and concepts for suboptimal implementations ;)

But aren’t static port channels unreliable?

Of course they are; I wouldn’t touch them with a 10-foot pole. You should always use LACP to form a LAG, but VMware supports LACP only in the distributed switch, which requires Enterprise Plus license. Yet again, don’t blame the standards or network design requirements, blame a vendor that wants to charge extra for baseline layer-2 functionality.

Is there another way out of this morass?

Buying extra switches (or fabric extenders) is too expensive. Buying Enterprise Plus license for every vSphere host just to get LACP support is clearly out of question. Is there something else we could do? Of course – you can make sure the inter-switch link gets enough bandwidth.

Typical high-end ToR switches (from almost any vendor) have 48 10GE ports and four 40GE ports. Using the 40GE ports for inter-switch links results in worst-case 3:1 oversubscription (48 10GE ports on the left-hand switch communicate exclusively with the 48 10GE ports on the right-hand switch over an equivalent of 16 10GE ports). Problem solved (until you have to add more servers).

Need more?

I’ll discuss typical challenges of building a reasonably sized data centers in my Building a Small Private Cloud webinar.

15 comments:

  1. Hi Ivan,

    What's the max reasonable oversubscription you recommend? Would Chassis Trunking from Brocade be interesting as Aggregation L3 Switch Solution ?

    Thanks

    ReplyDelete
  2. There's no "reasonable" number - you should know your traffic (and use 1:3 when you're clueless).

    ReplyDelete
  3. Okay, so just set all hosts to prefer port 1, problem solved..

    ReplyDelete
    Replies
    1. ... and waste half the ports and half the switching bandwidth. Congratulations.

      Delete
  4. Thanks for taking the time to go deeper into this topic, Ivan. Your 5K/2K architecture seems most common in the field, although many 10 Gb installs opt to use a 5K as the ToR switch. Perhaps this gives a further nod to the 1000v's vPC pinning using the "free" edition?

    ReplyDelete
  5. LACP is included in the free edition ov Nexus 1000v. There is no reason NOT to use LACP for your vhost uplinks except if you "hate" Cisco, which isn't an argument.

    Personally I would much happier if the Nexus 1000v implemented Fabricpath alleviating the need for any kind of LAG on the vhost uplinks. But this is a logical and efficient way to connect vhosts to a network so I'm not expecting Cisco to go that way!

    ReplyDelete
  6. The ToRs can get connected with a cheap 40 Gbps twinax cable. The extra switch hop is less than a microsecond of latency. Who cares about obtimizing that?

    ReplyDelete
    Replies
    1. That's perfectly fine - as long as the 40 Gbps E-W bandwidth is enough, and you have the extra ports. Once you get closer to the physical limits, optimization becomes more important.

      Delete
  7. We bumped into a problem that cause 30 second outage, probably spanning tree.
    Vswitch dual connected to N4k's that have multiple up links to vpc N5k pair.

    When we reload a N4k, the VM fail over is perfect, but when the N4k comes back we lose half the VM's for 30th.

    My understanding is that Vswitch starts using new up link immediately as physical port comes up but spanning tree on unlink ports is still learning topology.

    ReplyDelete
    Replies
    1. Thanks for sharing this one. You need "portfast" or an equivalent on the switch to prevent the ports from going through the "listening" phase.

      Delete
    2. Pieter check your vSwitch Failback setting. To avoid such failure scenario you may want to set it to No.

      To understand the issue, you may want to read this blog post: http://frankdenneman.nl/2010/10/22/vswitch-failover-and-high-availability/

      Delete
  8. What about iSCSI traffic?
    Isn't MPIO to be preferred over LACP since MPIO can use all the links simultaneously?

    ReplyDelete
    Replies
    1. MPIO is definitely better than LACP. Separate (logical) interfaces with Adapter FEX might be the answer.

      Delete
  9. Hi Ivan,

    I'm currently working on an environment where VMware ESXi servers are dual-homed (and statically LAGged) to a pair of FEX 2232. These extenders are currently single-homed to a pair of Nexus 5548 on vPC. I want to move away from the LAGs at the server level but by doing this I'll start running into the problems you describe above (potential over-subscription of the vPC peer-link, etc.). So, dual-homing the FEXes seems like the logical thing to do.

    I've been doing some research into the reasons why I would like to avoid dual-homing the FEXes and so far I've come up with three:

    a) Config of the FEX's interfaces need to be defined in both Nexus... for instance port 100/1/25 exists in both parent Nexus, so the config needs to be consistent across both --> not a big deal in my book... plus using config-sync and switch profiles should help simplify or avoid this problem

    b) The total limit of FEXes supported by the vPC is cut in half, since every Nexus parent switch will be connected to all the FEXes --> not a big deal either for the environment I'm working on

    c) On a single-homed deployment every FEX is dependant upon a single parent Nexus. If the parent Nexus "A" fails or crashes, only the "A" FEXes will be affected. On a dual-homed deployment all the FEXes are dependant upon both parent Nexus... under certain circumstances this could be defined as a SPOF. Of course an argument could be had regarding the vPC concept as being a SPOF itself... at some point you have to trust the vendor and its code, right?

    What's your take on these problems, particularly on the third one. Do you think there are any other limitations worth considering before migrating from a single-homed to a dual-homed config?

    Thanks!

    ReplyDelete
    Replies
    1. Dual-homed FEX-es might be the best option, as they ensure the load from server uplinks is (somewhat) evenly distributed across both switches.

      Never went into the details of how dual-homed FEXes handle upstream switch failure, but AFAIK it's not a SPOF.

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.