VXLAN: MAC-over-IP-based vCloud networking
In one of my vCloud Director Networking Infrastructure rants I wrote “if they had decided to use IP encapsulation, I would have applauded.” It’s time to applaud: Cisco has just demonstrated Nexus 1000V supporting MAC-over-IP encapsulation for vCloud Director isolated networks at VMworld, solving at least some of the scalability problems MAC-in-MAC encapsulation has.
Nexus 1000V VEM will be able to (once the new release becomes available) encapsulate MAC frames generated by virtual machines residing in isolated segments into UDP packets exchanged between VEMs.
The MAC-in-IP encapsulation seems to be based on the VXLAN draft (released just a few days ago). The VXLAN packet header includes a 24-bit segment ID, allowing you to create 16 million virtual segments. Using pseudo-random source UDP ports (probably hash-generated based on original MAC frame), you can get very good load balancing between the Nexus 1000V VEM and the physical switch using the 5-tuple-based load balancing while still preserving inter-VM packet order.
IP multicast is used to handle layer-2 flooding (broadcasts, multicasts and unknown unicasts). Support for layer-2 flooding allows everyone involved to pretend they’re still dealing with a traditional L2 broadcast domain (and use dynamic MAC learning); not an ideal solution (I would like to see Amazon-like prohibition of flooding with ARP caching) but still much better than what vCDNI offers today. If a VM running in a MAC-over-IP virtual segment goes bonkers, the damage will be limited to the ESX servers hosting VMs in the same virtual segment and the multicast path between them; with MAC-in-MAC encapsulation, the whole data center is affected.
As one would expect from a Nexus-based product, the new Nexus 1000V has a decent range of QoS features, allowing you to define per-tenant SLA. With full support for 802.1p and DSCP markings, you can extend the per-tenant QoS into the physical network, giving the cloud providers the ability to offer differentiated IaaS services.
More good news: the new encapsulation is fully integrated with vCloud Director. Finally we’ll be able to roll out scalable vCloud Director-based networks.
Even more good news: good bye, large-scale bridging and EVB, we don’t need you for VM mobility anymore; we can go back to time-tested large-scale IP+multicast designs that kept the Internet running for the last few decades.
However, all is not rosy in the vCloud land. Cisco has implemented scalable virtual layer 2 segments, but the communication between segments still requires multi-NIC VMs (like vShield Edge) and traverses the userland, the traffic trombones still wind their way around the data center, and you cannot terminate the virtual segments on physical switches or tie them to physical VLANs.
Even with the remaining drawbacks, the MAC-in-IP encapsulation is way better than VLANs or MAC-in-MAC encapsulation we had so far, and I’m positive Cisco will eventually make the next logical steps.
More information
If you're new to virtual networking, you might want to start with the Introduction to Virtualized Networking webinar.
You’ll find in-depth description of VMware networking in my VMware Networking Deep Dive webinar. Data center architectures and basics of virtual networking are also described in Data Center 3.0 for Networking Engineers.
All three webinars are available as part of the yearly subscription.
https://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00 :
"It is recommended that the source port be a hash of the inner Ethernet frame's headers to obtain a level of entropy for ECMP/load balancing of the VM to VM traffic across the VXLAN overlay."
> cannot terminate the virtual segments on physical switches
"One deployment scenario is where the tunnel termination point is a physical server which understands VXLAN. Another scenario is where nodes on a VXLAN overlay network need to communicate with nodes on legacy networks which could be VLAN based. These nodes may be physical nodes or virtual machines. To enable this communication, a network can include VXLAN gateways (see Figure 3 below with a switch acting as a VXLAN gateway) which forward traffic between VXLAN and non-VXLAN environments."
Some exciting developments, indeed! List of authors on the draft is also quite telling. ;)
Here is a comparable example: imagine that MS NLB has been re-implemented using IP multicast, where client's IP packet destined to a VIP is encapsulated into a tunnel with multicast destination IP address and sprayed to all members of a HA cluster. Effectively, this allows stretching NLB over IP network, but would it make NLB more scalable or easier to troubleshoot? :)