VMware Virtual Switch: no need for STP

During the Data Center 3.0 webinar (register here) I always mention that you can connect a VMware ESX server (with embedded virtual switch) to the network through multiple active uplinks without link aggregation. The response is very predictable: I get a few “how does that work” questions in the next seconds.

VMware did a great job with the virtual switch embedded in the VMware hypervisor (vNetwork Standard Switch – vSS – or vNetwork Distributed Switch – vDS): it uses special forwarding rules (I call them split horizon switching, Cisco UCS documentation uses the term End Host Mode) that prevent forwarding loops without resorting to STP or port blocking.

Physical and virtual NICs

The ports of a virtual switch (Nexus 1000V uses the term Virtual Ethernet Module – VEM) are connected to either physical Network Interface Cards (uplinks) or virtual NICs of the virtual machines. The physical ports can be connected to one or more physical switches and don’t have to be aggregated into a port channel.

Ports are not equal

In a traditional Ethernet switch, the same forwarding rules are used for all ports. Virtual switch uses different forwarding rules for vNICs and uplinks.

No MAC address learning

The hypervisor knows the MAC addresses of all virtual machines running in the ESX server; there’s no need to perform MAC address learning.

Spanning Tree Protocol is ignored

Virtual switch is not running Spanning Tree Protocol (STP) and does not send STP Bridge Protocol Data Units (BPDU). STP BPDUs received by the virtual switch are ignored. Uplinks are never blocked based on STP information.

The switch ports to which you connect the ESX servers should be configured with bpduguard to prevent forwarding loops due to wiring errors. As ESX doesn’t run STP, you should also configure spanning-tree portfast on these ports.

Split-horizon forwarding

Packets received through one of the uplinks are never forwarded to other uplinks. This rule prevents forwarding loops through the virtual switch.

Limited flooding of broadcasts/multicasts

Broadcast or multicast packets originated by a virtual machine are sent to all other virtual machines in the same port group (VMware terminology for a VLAN). They are also sent through one of the uplinks like a regular unicast packet (they are not flooded through all uplinks). This ensures that the outside network receives a single copy of the broadcast.

The uplink through which the broadcast packet is sent is chosen based on the load balancing mode configured for the virtual switch or the port group.

Broadcasts/multicasts received through an uplink port are sent to all virtual machines in the port group (identified by VLAN tag), but not to other uplinks (see split-horizon forwarding).

No flooding of unknown unicasts

Unicast packets sent from virtual machines to unknown MAC addresses are sent through one of the uplinks (selected based on the load balancing mode). They are not flooded.

Unicast packets received through the uplink ports and addressed to unknown MAC addresses are dropped.

Reverse Path check based on source MAC address

The virtual switch sends a single copy of a broadcast/multicast/unknown unicast packet to the outside network (see the no flooding rules above), but the physical switch always performs full flooding and sends copies of the packet back to the virtual switch through all other uplinks. VMware thus has to check the source MAC addresses of packets received through the uplinks. Packet received through one of the uplinks and having a source MAC address belonging to one of the virtual machines is silently dropped.

Hyper-V is different

If your server administrators are used to the ESX virtual switch and start deploying Microsoft’s Hyper-V, expect a few fun moments (Microsoft is making sure you’ll have them: the Hyper-V Virtual Networking Best Practices document does not mention redundant uplinks at all).

A potential solution is to bridge multiple physical NICs in Hyper-V ... creating a nice forwarding loop that will melt down large parts of your network (more so if you do that on a powerful server with multiple 10GE uplinks like the Cisco UCS blade servers). Configuring bpduguard on all portfast switch ports is thus a must.

More information

7 comments:

  1. Hi Ivan

    This sounds like the same technology they use in UCS integral nexus 'interconnect'. They use mac-pinning on the uplinks and you can selectively choose which VLANs actively traverse each link if you so wish. Any particular reason you didn't use the term mac-pinning?

    ReplyDelete
  2. Are you talking about UCS interconnect (6100) or UCS fabric extender (2100)? UCS interconnect can work in "End host mode", which I've mentioned at the beginning.

    The "Mac pinning" is just one of the possible load balancing methods vSwitch can use. VMware uses the descriptive term "Route based on source MAC hash".

    ReplyDelete
  3. I'm talking about the 6100 yeah. I'm not sure if they have different configurable load balancing methods too, I thought mac pinning was the only option to be honest.

    ReplyDelete
  4. This might help: http://bradhedlund.com/2010/06/22/cisco-ucs-networking-best-practices/

    ReplyDelete
  5. Ah, Brad looks like a useful man to know!

    ReplyDelete
  6. I've seen some recommendations about enabling bpduguard with ESX before, but could never get an answer to my concern...What happens with spoofed bdpu frames from a VM? Assuming ESX forwards the bpdu frame, the switch will err-disable the interface due to bpduguard, then ESX will move the VM to the next interface, which err-disables as soon as it sees a bpdu, and so on until all interfaces are err-disabled. If the ESX host is in an HA cluster, when it becomes isolated because all interfaces are disabled, VMware HA will kick in, shutdown the VMs, and start the VMs on another host. At that point, the VM sending bpdus would isolate that that host as well and trigger another HA event. Soon, all the interfaces connected to the entire cluster would be err-disabled and none of the VMs would be powered up (due to the final HA event).

    Of course, if ESX doesn't forward the bpdus from a VM, then none of that happens and enabling bpduguard is safe. Definitely not worth the risk until I'm sure, though! :)

    ReplyDelete
  7. A VM may send traffic with a different source MAC than its MAC known to vSphere (e. g. Microsoft NLB). In such a case, I've experienced that Nexus 1000V learns it and places dynamic entries in the MAC address table.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.