VXLAN: MAC-over-IP-based vCloud networking

In one of my vCloud Director Networking Infrastructure rants I wrote “if they had decided to use IP encapsulation, I would have applauded.” It’s time to applaud: Cisco has just demonstrated Nexus 1000V supporting MAC-over-IP encapsulation for vCloud Director isolated networks at VMworld, solving at least some of the scalability problems MAC-in-MAC encapsulation has.

Nexus 1000V VEM will be able to (once the new release becomes available) encapsulate MAC frames generated by virtual machines residing in isolated segments into UDP packets exchanged between VEMs.

The MAC-in-IP encapsulation seems to be based on the VXLAN draft (released just a few days ago). The VXLAN packet header includes a 24-bit segment ID, allowing you to create 16 million virtual segments. Using pseudo-random source UDP ports (probably hash-generated based on original MAC frame), you can get very good load balancing between the Nexus 1000V VEM and the physical switch using the 5-tuple-based load balancing while still preserving inter-VM packet order.

IP multicast is used to handle layer-2 flooding (broadcasts, multicasts and unknown unicasts). Support for layer-2 flooding allows everyone involved to pretend they’re still dealing with a traditional L2 broadcast domain (and use dynamic MAC learning); not an ideal solution (I would like to see Amazon-like prohibition of flooding with ARP caching) but still much better than what vCDNI offers today. If a VM running in a MAC-over-IP virtual segment goes bonkers, the damage will be limited to the ESX servers hosting VMs in the same virtual segment and the multicast path between them; with MAC-in-MAC encapsulation, the whole data center is affected.

As one would expect from a Nexus-based product, the new Nexus 1000V has a decent range of QoS features, allowing you to define per-tenant SLA. With full support for 802.1p and DSCP markings, you can extend the per-tenant QoS into the physical network, giving the cloud providers the ability to offer differentiated IaaS services.

More good news: the new encapsulation is fully integrated with vCloud Director. Finally we’ll be able to roll out scalable vCloud Director-based networks.

Even more good news: good bye, large-scale bridging and EVB, we don’t need you for VM mobility anymore; we can go back to time-tested large-scale IP+multicast designs that kept the Internet running for the last few decades.

However, all is not rosy in the vCloud land. Cisco has implemented scalable virtual layer 2 segments, but the communication between segments still requires multi-NIC VMs (like vShield Edge) and traverses the userland, the traffic trombones still wind their way around the data center, and you cannot terminate the virtual segments on physical switches or tie them to physical VLANs.

Even with the remaining drawbacks, the MAC-in-IP encapsulation is way better than VLANs or MAC-in-MAC encapsulation we had so far, and I’m positive Cisco will eventually make the next logical steps.

More information

If you're new to virtual networking, you might want to start with the Introduction to Virtualized Networking webinar (register).

You’ll find in-depth description of VMware networking and (currently shipping) Nexus 1000V in my VMware Networking Deep Dive (recording or live session) webinar. Data center architectures and basics of virtual networking are also described in Data Center 3.0 for Networking Engineers (recording).

All three webinars are available as part of the yearly subscription.

7 comments:

  1. Dmitri Kalintsev31 August, 2011 00:00

    > (probably hash-generated based on original MAC frame)

    https://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00 :

    "It is recommended that the source port be a hash of the inner Ethernet frame's headers to obtain a level of entropy for ECMP/load balancing of the VM to VM traffic across the VXLAN overlay."

    > cannot terminate the virtual segments on physical switches

    "One deployment scenario is where the tunnel termination point is a physical server which understands VXLAN. Another scenario is where nodes on a VXLAN overlay network need to communicate with nodes on legacy networks which could be VLAN based. These nodes may be physical nodes or virtual machines. To enable this communication, a network can include VXLAN gateways (see Figure 3 below with a switch acting as a VXLAN gateway) which forward traffic between VXLAN and non-VXLAN environments."

    Some exciting developments, indeed! List of authors on the draft is also quite telling. ;)

    ReplyDelete
  2. Duh, still a weak solution :) Is that all industry have to offer us today?! What they've done is just slightly leveraged softswitching and pushed the tunneling mesh directly to the VM edge. However, dynamic learning is there, flooding is still occurring - none of the core scalability problems have been properly addressed. I mean, having the tunneling rooted at VM layer would allow switch to directly know all VM MACs and handle broadcast messages in distributed directory fashion. There have been numerous proposals to address that problem, but for some reasons the industry seems to ignore that :)

    Here is a comparable example: imagine that MS NLB has been re-implemented using IP multicast, where client's IP packet destined to a VIP is encapsulated into a tunnel with multicast destination IP address and sprayed to all members of a HA cluster. Effectively, this allows stretching NLB over IP network, but would it make NLB more scalable or easier to troubleshoot? :)

    ReplyDelete
  3. Ivan Pepelnjak31 August, 2011 07:08

    Absolutely agree. While it's better than what we had before, it's not even close to what Amazon EC2 does, and disappointing in its near-sightedness. However, any alternative more to our liking would require some L3 awareness in NX1K and it seems that for some L3 is a monster best avoided.

    ReplyDelete
  4. Ivan Pepelnjak31 August, 2011 07:09

    Yeah, I know you _can_ terminate VXLAN on physical devices (in principle), but you _can't_ do it today or any time soon.

    ReplyDelete
  5. Dmitri Kalintsev31 August, 2011 08:30

    Considering that this requirement is quite important (i.e. without it the functionality is not very useful), I was hoping it will be addressed soon... But then again, only time will tell.

    ReplyDelete
  6. Ivan Pepelnjak31 August, 2011 09:06

    Don't count on that happening soon. They've solved the immediate problem (isolated IaaS networks) and already have a "solution" for your other problem (vShield Edge or any other VM-based L3 device).

    ReplyDelete
  7. Great article... Thanks for taking the time to give your opinion about this, imo, exciting new technology...

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.