VMware vSwitch does not support LACP

This is very old news to any seasoned system or network administrator dealing with VMware/vSphere: the vSwitch and vNetwork Distributed Switch (vDS) do not support Link Aggregation Control Protocol (LACP). Multiple uplinks from the same physical server cannot be bundled into a Link Aggregation Group (LAG, also known as port channel) unless you configure static port channel on the adjacent switch’s ports.

When you use the default (per-VM) load balancing mechanism offered by vSwitch, the drawbacks caused by lack of LACP support are usually negligible, so most engineers are not even aware of what’s (not) going on behind the scenes.

Let’s start with the simplest possible topology: an ESX server connected to a switch with two parallel links. Ideally, the two parallel links would be placed in a LAG, or one of them would be blocked by STP. As vSwitch supports neither LACP nor STP, both links are active and forwarding loops in the network are prevented by vSwitch’s split horizon switching.

The upstream switch is not aware that the two parallel links terminate in the same physical device. It considers them connected to two separate hosts (or switches) and uses the standard source-MAC-address-based learning to figure out how to forward data to virtual machines A-D. Assuming that the VMs A and B use the first uplink and C and D use the second one, the switch builds the following view of the network in its MAC address table:

The split view of the ESX server is not a bad idea as long as the vSwitch performs per-VM load balancing – MAC address table is stable and all traffic flows are symmetrical; the only drawback is limited load balancing capability – a single VM can never use both links.

Do we really need static LAG?

The behavior of our small network becomes more erratic if we enable IP-hash-based load balancing on the vSwitch. All of a sudden the same source MAC address starts appearing on both links (the same VM can use both links for different TCP or UDP sessions) and the MAC address table on the switch becomes “somewhat” more dynamic.

VMware recommends enabling static LAG on the switch in combination with per-session vSwitch load balancing. This recommendation makes perfect sense, as it prevents MAC address table trashing, but it also disables detection of LAG wiring/configuration errors.

Update 2011-01-26 (based on readers’ comments)

Without synchronized ESX-switch configuration you can experience one of the following two symptoms:

  • Enabling static LAG on the physical switch (pSwitch), but not using IP-hash-based load balancing on vSwitch: frames from the pSwitch will arrive to ESX through an unexpected interface and will be ignored by vSwitch. Definitely true if you use active/standby NIC configuration in vSwitch, probably also true in active/active per-VM-load-balancing configuration (need to test it, but I suspect loop prevention checks in vSwitch might kick in).
  • Enabling IP-hash-based load balancing in vSwitch without corresponding static LAG on the pSwitch: pSwitch will go crazy with MACFLAP messages and might experience performance issues and/or block traffic from the offending MAC addresses (Duncan Epping has experienced a nice network meltdown in a similar situation).

More information

Interaction of vSwitch with link aggregation is just one of many LAG-related topics covered in my Data Center 3.0 for Networking Engineers webinar (buy a recording or yearly subscription).

22 comments:

  1. Can you define "per-session load balancing" in VMware terms? Do you mean the "Route based on IP-hash" policy? If so, then yes, you need to configure the port-channel interface and use "channel-group X mode on" in the member interface configs. It also works best if you configure global command "port-channel load-balance src-dst-ip". There are a couple items/issues I know of that it helps to be aware of if you go this route, linked below.

    http://kb.vmware.com/kb/1001938
    http://www.yellow-bricks.com/2010/08/06/standby-nics-in-an-ip-hash-configuration/

    If you're using vSphere 4.1 and a distributed vSwitch, you might also be in the Load Based Teaming option, http://kb.vmware.com/kb/1022590.

    Cheers,
    -Loren

    ReplyDelete
  2. Indeed I had the "IP-hash-based load balancing" in mind. Fixed the terminology. Thanks.

    Thank you for all the links. They give interesting insight into how vSwitch actually works, but the fundamental question remains: what happens (apart from MAC table trashing in the pSwitch) if you enable "IP-hash-based" LB in vSwitch but do not configure the EtherChannel on the pSwitch.

    ReplyDelete
  3. I use 'mode on' on the access switch, src-dst ip hashing on both the vswitch and the access-switch. I also enable bpdu-guard on the vmware-facing hosts

    I wish that you could do port-security on etherchannels so that you can limit MACs to some reasonable number and sticky MAC learning to prevent MAC thrashing.

    While we're in the neighborhood I'd like to remind everyone that you can run the port-channel hash with the 'test etherchannel load-balance interface' command. Fun.

    ReplyDelete
  4. Hey Ivan,

    Something funny happened - sorry if this shows up as a double post.

    "Are you aware of the specific drawbacks of using per-session load balancing without static LAG on the switch?"

    You're not suggesting that someone would run this way, ignore the "MAC is flapping" messages and let the CAM table thrash, are you?

    Some Cisco platforms will drop frames for "flapping" destinations each time that message is logged. It's a loop prevention thing: don't forward frames that might loop endlessly.

    The duration of the drop interval is in the tens of seconds each time the "flap" threshold (moves/interval) is exceeded.

    Frustratingly, exactly what constitutes a "flap", whether traffic is dropped and for how long is platform and OS dependent.

    Static LAG (channel-group X mode on) comes with its own set of drawbacks, of course. There are ways for things to go wrong that LACP would notice, but "on" will not.

    ReplyDelete
  5. I'm not suggesting you should do that. Maybe I need to reword the question ;)

    I did not know what exactly the reaction to CAM table trashing would be (never tested it in the lab) and you provided just the answer I needed. Thank you!

    ReplyDelete
  6. I may have made this up, but...

    I have the idea that programming TCAM is an expensive operation. Expensive relative to moving frames around anyway. I don't think it's something you want the switch to be doing thousands of times per second. :-)

    ...Never mind the logging overhead it creates.

    I opened a TAC case a few years ago to find out exactly what constitutes "a flap". The answer was: "what platform?"

    After a little digging, TAC replied:

    ------------------------------
    The host flapping detection behavior is somewhat different between Cat4k
    CatOS and Cat4k IOS. The big difference between Cat4k CatOS and Cat4k IOS
    is, in CatOS, the cat4k drops traffic from the flapping host for
    approximately 15 seconds. In IOS, the cat4k does not drop traffic because
    of host flapping.

    Both Cat4k CatOS and IOS use the following algorithm to declare a host
    flapping:

    If the supervisor see 4 or more moves between ports from a single source mac
    in a window of around 15 seconds, then it declares the host to be flapping.
    ------------------------------

    ReplyDelete
  7. Does the "IP-hash" vSwitch policy preclude access switch diversity?

    Assuming vPC/VSS/SMLT style MLAG isn't available, can the traditional vSwitch create two "IP-hash" aggregations to two different switches?

    We'd need the vSwitch algorithm to do split-horizon bridging between two uplink bundles, and then do IP-hash link selection within these "mode on" aggregations.

    If this sort of scheme isn't possible, then "IP-hash" balancing suggests that you can only have a single access switch for an ESX server. Not very resilient!

    ReplyDelete
  8. Oh the frustration of %SW_MATM-4-MACFLAP_NOTIF message :D Honestly, I am not aware of any precise flap suppression limits in different Cisco switches as this hasn't been well documented. I wonder if you could disable MAC-address learning on the upstream switch :D This, of course, will make it utilize both downstream ports equally, but will remove any flapping MAC address learning issues.

    ReplyDelete
  9. That Guest was me, apparently :)

    ReplyDelete
  10. Yes, that's one reason we chose not to use it. We have an absolute requirement for device redundancy at every layer... If the vSwitch interfaces are connected to different (non-VSS-connected) northbound switches, then you cannot use the IP-hash vswitch policy.

    And no, a vSwitch cannot create two LAGs to different switches...that's been on my wishlist for a couple years now...

    ReplyDelete
  11. Ivan, all that happens is exactly what you say...the vSwitch starts distributing packets across multiple northbound interfaces (based on a src-dst-ip hash). So the MAC shows up on multiple interfaces on the switch, confusing the heck out of the switch.

    If you do this on the vswitch with the Service Console interface, the ESX host will likely become unmanageable via the network and you'll have to console into it to fix things.

    ReplyDelete
  12. To me your article should of said "Dynamic LACP is not supported on vSwitches". To my understanding to use aggregates you must use ip hash on the NICs in the portgroup on the vSwitch and then on the physical switch ports that th NICs connect to you can use EtherChannel (it is Staic by definition) or Static LACP or Static 802.3ad. You can not use dynamic LACP or dynamic 802.3ad on the physical switch ports since LACP/802.3ad are not supported on the vSwitch. Dynamic means the protocol (LACP/802.3ad) is only enabled if the other side supports it.

    ReplyDelete
  13. Forgot to include this link from Scott Lowe that does a nice job in explaining NIC utilizaton: http://blog.scottlowe.org/2008/07/16/understanding-nic-utilization-in-vmware-esx/

    ReplyDelete
  14. There is no "dynamic LACP". A bundle of links (what is otherwise known as EtherChannel or Port Channel) is officially called Link Aggregation Group and is standardized in 802.3ad/802.1X. LAG could be statically configured or negotiated dynamically with LACP.

    There is no LACP in vSwitch, but it does have something that resembles static LAG.

    ReplyDelete
  15. The vswitch can use an LACP connection to the physical switch using it for years now....

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048

    http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/vmware/VMware.html

    ReplyDelete
  16. Michael, please note there is a "slight" distinction between being able to send packets across two or more links (Link Aggregation Group = LAG, thus the term "static LAG") and supporting the __standard signaling protocol__ defined in the 802.3ad standard (LACP).

    VMware supports static LAG (or EtherChannel or Port Channel), but not LACP.

    ReplyDelete
  17. Hi Ivan,

    So my question may be redundant, but is this why when looking at the port statistics of the LAG group on a 48 port DGS-1210-48 switch, one of the members of the LAG group (2 intel nics on a Dell Vmware server) is receiving and transmitting, while the 2nd member is ONLY transmitting packets? Is this because the MAC address of the virtual vswitch can really only be assigned to one of the nics?

    The vSwitch is set to Load Balance, route based on ip hash. The switch's LAG configuration is static. If we set it to LACP, we lose connection to the internal VMs. A third nic keeps us connected to the management interface.

    Luke

    ReplyDelete
  18. Ivan Pepelnjak28 March, 2012 23:51

    The setup you have seems OK (BTW, you can't use LACP. ESX does not support it, so the link will never come up), I'm guessing your problem might be the load balancing algorithm the switch (or VMware) uses. Just because you've enabled "ip hash based" load balancing doesn't mean that you'll get per-packet load sharing (you won't - that would break some applications). According to VMware's documentation they select the outbound link based on a hash of source+destination IP address.

    Also, just because you have a LAG on the switch doesn't mean that the switch won't do load balancing solely based on destination MAC address (or a combination of source+destination MAC).

    Hope this helps
    Ivan

    ReplyDelete
  19. Is IP Hash the best option here?

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.