Nicira Open vSwitch inside vSphere/ESX

I got intrigued when reading Nicira’s white paper claiming their Open vSwitch can run within vSphere/ESX hypervisor. There are three APIs that you could use to get that job done: dvFilter API (intercepting VM NIC like vCDNI does), the API used by Cisco’s Nexus 1000V, or the device driver interface (intercepting uplink traffic). Turns out Nicira decided to use a fourth approach using nothing but publicly-available APIs.


The three development APIs one could use

As I wrote in the update to the Nicira Uncloaked post, the cool trick they used relies on a few obscure properties of the Distributed vSwitch (vDS) and statically bound Distributed Ports. Let me show you how it actually works step-by-step (if you don’t want to spoil the magic, stop reading right now)... but before starting the journey, remember where we want to end: we want to have virtual machines connected to Open vSwitch, which uses the transport network (VLAN tags or MAC-over-GRE tunneling) to build virtual networks as dictated by the OpenFlow controller (Nicira’s Network Virtualization Platform – NVP).

This blog post focuses on the intra-vSphere part of the solution. For more details on the "transport" part (which I left cloudy for a reason), read my other OpenFlow/Nicira blog posts, for example What is Nicira really up to and Decouple virtual networking from the physical world. And a short summary for the differently-attentive: the "transport" cloud is almost "NVGRE/VXLAN with a centralized control plane".

Start with a distributed switch (vDS). It seems like it spans across a number of hosts, but that’s just the management-plane perception; in reality, every vSphere host has an independent forwarding component.

Now imagine you have a vDS (or a port group within a vDS) with no uplinks. It seems to span across numerous ESX hosts, you can vMotion VMs between the hosts, but only the VMs inside the same host can actually communicate.

Next, start an Open vSwitch-hosting VM in every ESX host and connect it to the isolated port group as well as the outside transport network (another port group). The traffic between VMs connected to the isolated port group and the outside world has to pass through the OVS VM, and since there is no other way for the isolated VMs to reach the outside world, there can be no forwarding loops.

Still, the VMs connected to the same port group can communicate with each other. We need another trick – the per-port properties of statically bound Distributed Ports. If you use vDS, you can set numerous properties on individual ports (VM NICs), including access VLAN. Yes, you can run multiple VLANs within a single port group. Mindboggling. I never realized you could do that.

So this is what you do:

  • For every single VM connected to the port group, use Virtual Switch Tagging and set the access VLAN to a unique value (this does limit the number of VMs you can connect to the same port group to 409x, but that should be more than enough).
  • Configure the port connecting the OVS VM to the isolated port group to Virtual Guest Tagging and allow promiscuous mode.

The OVS VM will receive all traffic generated by the VMs, nicely tagged with per-VM VLAN tags.

Finally, let’s take a look deeper into the OVS VM. It needs three interfaces: VM-facing interface, transport interface (where it can use VLAN tags or MAC-over-GRE tunneling to send traffic between OVS switches), and management interface (over which it communicates with the NVP OpenFlow controller).

The VM-facing interface appears as a physical interface to Linux running inside the VM; you can create VLAN subinterfaces on top of it (one per VM) and connect individual subinterfaces (point-to-point VLAN-tagged links to individual VMs) to OVS ports.

Does this make sense?

The switch-inside-a-VM solution has two obvious drawbacks:

Does such a kludge make sense? It just might in (at least) three scenarios:

  • It enables a gradual migration from VMware environment to Xen/KVM/OpenStack.
  • It allows you to connect VMs that have to run on VMware for whatever reason to Xen/OpenStack/Quantum non-VLAN virtual networks (people complaining about VLAN limits in certain data center switches might appreciate this).
  • It makes for a nice test bed. You can test OpenFlow/OVS/NVP without fully committing to a Linux-based hypervisor.

More information

If you’re faced with the question “what is this virtual network stuff all about?” the Introduction to Virtual Networking webinar might give you the answers you need. VMware Networking Deep Dive webinar describes distributed switches, port groups, dvFilter API and virtual appliances; the Cloud Computing Networking one focuses on large-scale virtual networks needed in IaaS clouds. You get immediate access to all three webinars (and a dozen more) with the yearly subscription.

Need help?

If you need a second opinion or a review of your data center design, check out the ExpertExpress service. You could also engage our professional services team; after all, we were the first Cisco-certified Cloud Builders in the eastern hemisphere.

19 comments:

  1. Amazing insights here Ivan.

    You mention a limit of 409x ports in the port-group, tho I assume that this is a limit per Host/OVS? Now for sensible designs 409x hosts is more than enough, let alone 409x multiplied by a max of 32 hosts in a cluster, tho I can picture some instances where this may be beneficial.

    ReplyDelete
  2. That's the total number of VMs you can connect to the port group (across all hosts with the same vDS). They need per-VM VLAN to create a P2P link between VM and OVS-VM, and you only have 4K VLANs (and you can't recycle them because someone could vMotion a VM to another host).

    ReplyDelete
  3. Do you have a source for this claim? "(you can’t push more than a few Gbps through userland)." My understanding and experience has been that ESX can push as much as the OS can handle, and easily saturates 10Gpbs with things like vMotion if the physical network can handle it. Obviously, different interfaces and kernels here. I'm just wondering if perhaps you might be underestimating or downplaying the potential capabilities...

    ReplyDelete
  4. In my understanding and according to your previous blog post (http://blog.ioshints.info/2011/06/test-your-vmware-networking-skills.html ) we can't reuse VLANs even across different port groups, because port groups don't provide isolation.

    ReplyDelete
  5. I don't (yet) have a consistent theory behind anecdotal evidence and a few data points ... and the fact that every time someone describes a VM-based networking appliance solution to me I ask "and the performance is around a few Gbps" ... and get "yeah" as an answer.

    Two data points I already wrote about:
    http://blog.ioshints.info/2011/11/junipers-virtual-gateway-virtual.html
    http://www.ipspace.net/Embrane_heleos:_scale-out_distributed_virtual_appliance

    ReplyDelete
  6. ... also, please note that the "few Gbps" applies to VMs doing network-layer packet forwarding. Server VMs can easily saturate 10 Gbps uplink without consuming a whole core.

    ReplyDelete
  7. Good one. Absolutely true. You can however reuse them across different vSwitches/vDS (because they are independent bridging domains).

    Summary: create a totally new vDS for Nicira's needs.

    ReplyDelete
  8. Actually it means that to scale to more than 4K VMs you have create several vDS. Does it also mean that you have to provision a different OVS VM per vDS on the same ESX host or you can reuse the same VLANs across different vNIC trunks coming from different vDS to the same OVS VM?

    ReplyDelete
  9. A traditional vSwitch is just as much a SPOF, right? In fact it's worse if it runs inside the VMkernel.

    ReplyDelete
  10. Very Impressive break-down Ivan ;)

    ReplyDelete
  11. Thanks for this clarification, it wasn't until I read this that it clicked about the VLAN usage and p2p to the OVS VM. Originally I was thinking like Kurt if this was per host. But per 32 host cluster/VDS makes sense and does scale pretty well. ~126-7 VMs per host isn't too shabby.

    ReplyDelete
  12. Nicira + Open vSwitch + VMware = DOA (unfortunately)

    ReplyDelete
  13. This was true until x86 leaders came with new data plane architecture. We are a proved example that you can deliver dozens of Gbps with virtual networking appliance on userland. also very important, independent from he packet size (so consider the pps benchmarks!). We delivered all around the world high performance SDN for mobile core network and are ramping up now on the Cloud space...

    ReplyDelete
  14. Sounds absolutely interesting. If you're willing to tell me more, please contact me directly:

    http://www.ipspace.net/Contact

    ReplyDelete
  15. I wish I had 10GbE to the servers in my lab...this would be a dead simple test. Set up a test VM configured as a router and see what we get!

    ReplyDelete
  16. VM userland > dozens Mpps with 2vCPU (L3 forwarding), dozens Gbps with 2vCPU (IPsec). Scales linearly with number of cores assigned, no crypto engine, pure software. 8-) we have a booth at MWC (Hall 2 - 2B122)

    ReplyDelete
  17. Great post! Love the graphics. I labbed up GRE tunnels on a couple OpenVswitch boxes with KVM to test out some V-2-V migrations. Still trying to wrap my head around scale and op management.
    Notes from the setup for anyone needing a primer to test themselves in their environment.
    http://wp.me/p1AOVJ-2O

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.