vCloud Director Network Isolation (vCDNI) scalability

When VMware launched its vCloud Director Networking Infrastructure, Greg Ferro (of the Packet Pushers Podcast fame) and myself were very skeptical about its scaling capabilities, more so as it uses MAC-in-MAC encapsulation and bridging was never known for its scaling properties. However, Greg’s VMware contacts claimed that vCDNI scales to thousands of physical servers and Greg wanted to do a podcast about it.

As always, we prepared a few questions in advance, including “How are broadcasts, multicasts and unknown unicasts handled in vCDNI-based private networks?” and “what happens when a single VM goes crazy?” For one reason or another, the podcast never happened. After analyzing Wireshark traces of vCDNI traffic, I probably know why that’s the case.

The MAC-in-MAC encapsulation is proprietary (but we already knew that). It’s still the same encapsulation Lab Manager used years ago. However, Lab Manager was just that; VMware sells vCDNI as a scalable cloud-enabling platform (the only thing they managed to do was to move the MAC-in-MAC encapsulation from a dedicated VM into a hypervisor kernel module).

Ethernet II frame
    Destination: Akimbi_01:00:11 (00:13:f5:01:00:11)
    Source: Akimbi_01:00:21 (00:13:f5:01:00:21)
    Type: VMware Lab Manager (0x88de)
VMware Lab Manager, Portgroup: 1
    0000 0... = Unknown       : 0x00
    .... .0.. = More Fragments: Not set
    .... ..00 = Unknown       : 0x00
    Portgroup        : 1
    Address          : Vmware_90:33:6a (00:50:56:90:33:6a)
    Destination      : Vmware_90:33:6a (00:50:56:90:33:6a)
    Source           : Vmware_90:30:ab (00:50:56:90:30:ab)
    Encapsulated Type: IP (0x0800)
Internet Protocol, Src: 192.168.1.100, Dst: 192.168.1.101
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
    Total Length: 60
    Identification: 0x01c1 (449)
    Flags: 0x00
    Fragment offset: 0
    Time to live: 128
    Protocol: ICMP (1)
    Header checksum: 0xb4e6 [correct]
    Source: 192.168.1.100 (192.168.1.100)
    Destination: 192.168.1.101 (192.168.1.101)

Broadcasts on internal networks (“protected” with vCDNI) get translated into global broadcasts. This behavior totally destroys scalability. In VLAN-based designs, the number of hosts and VMs affected by a broadcast is limited by the VLAN configuration... unless you stretch VLANs all across the data center (but then you ask for trouble).

When you use vCDNI, every single ESX server connected to the same LAN will get affected if a single VM goes bonkers.

Ethernet II frame
    Destination: Broadcast (ff:ff:ff:ff:ff:ff)
    Source: Akimbi_01:00:21 (00:13:f5:01:00:21)
    Type: VMware Lab Manager (0x88de)
VMware Lab Manager, Portgroup: 1
    0000 0... = Unknown       : 0x00
    .... .0.. = More Fragments: Not set
    .... ..00 = Unknown       : 0x00
    Portgroup        : 1
    Address          : Broadcast (ff:ff:ff:ff:ff:ff)
    Destination      : Broadcast (ff:ff:ff:ff:ff:ff)
    Source           : Vmware_90:30:ab (00:50:56:90:30:ab)
    Encapsulated Type: ARP (0x0806)
    Trailer: 000000000000000000000000000000000000
Address Resolution Protocol (request)
    Hardware type: Ethernet (0x0001)
    Protocol type: IP (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (0x0001)
    [Is gratuitous: False]
    Sender MAC address: Vmware_90:30:ab (00:50:56:90:30:ab)
    Sender IP address: 192.168.1.100 (192.168.1.100)
    Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
    Target IP address: 192.168.1.101 (192.168.1.101)

Multicasts on internal networks are translated into global multicasts. To make matters worse, IGMP snooping can no longer help you, as the IGMP messages sent by VMs get encapsulated into the MAC-in-MAC envelope and are thus never seen by the switches.

Ethernet II frame
    Destination: IPv4mcast_00:00:01 (01:00:5e:00:00:01)
    Source: Akimbi_01:00:21 (00:13:f5:01:00:21)
    Type: VMware Lab Manager (0x88de)
VMware Lab Manager, Portgroup: 1
    0000 0... = Unknown       : 0x00
    .... .0.. = More Fragments: Not set
    .... ..00 = Unknown       : 0x00
    Portgroup        : 1
    Address          : IPv4mcast_00:00:01 (01:00:5e:00:00:01)
    Destination      : IPv4mcast_00:00:01 (01:00:5e:00:00:01)
    Source           : Vmware_90:30:ab (00:50:56:90:30:ab)
    Encapsulated Type: IP (0x0800)
Internet Protocol, Src: 192.168.1.100, Dst: 224.0.0.1
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
    Total Length: 60
    Identification: 0x01ce (462)
    Flags: 0x00
    Fragment offset: 0
    Time to live: 128
    Protocol: ICMP (1)
    Header checksum: 0x96e5 [correct]
    Source: 192.168.1.100 (192.168.1.100)
    Destination: 224.0.0.1 (224.0.0.1)
Internet Control Message Protocol
    Type: 8 (Echo (ping) request)
    Code: 0
    Checksum: 0x325c [correct]
    Identifier: 0x0200
    Sequence number: 6400 (0x1900)
    Sequence number (LE): 25 (0x0019)
    Data (32 bytes)

It’s totally insecure. Anyone connected to the same LAN as the ESX servers using vCDNI can insert packets into “protected” portgroups. Even a virtual machine connected to a portgroup using the same NIC could do it (assuming you got the VLAN setup wrong).

It’s also not hard to collect the VM MAC addresses and IP addresses from all “protected” internal networks, as you see every single ARP request (from every portgroup) as soon as you’re connected to the server LAN.

Conclusion: while I believe vCloud Director is a great product, more so for GUI-happy wizard-craving beginners (it does get boring after you have to create a few objects and I can only hope someone will write a good CLI for it), the current vCDNI implementation is not more than a proof-of-concept that I would never use in a large-scale network.

3 comments:

  1. Ivan brings up pretty valid points on scalability too but people who don't understand networking should not just wave it away.

    System is automatically creating 3rd grade networks, that is, I agree. They are good for small (<100 hosts) environment, I'd be curios to see what is going to happen in real production environment on scale of 0.001 of AMZN for starters. It will go down in flames would be my guess.

    ReplyDelete
  2. Nice job Ivan. But I still not understand why they're implementing cool "looking" features when they do all to code a backend-behaviour that matches the setup without layer 2 segmentation.

    I would like to host customers on a flat physical layer 2 network and virtualize a layer 2 network per customer (backed by portgroups ie.). I thougth the vCDNI was the right solution but I was curious about real mecanisms behind and read your article about. Now I'm confused, how can I get virtual Layer 2 segmentation whithout adding new VLANs on my physical switches each time a new customer comes in? For those who are behind an edge gateway it's not really problematic... but we have customers that have VMs directly based on the public network. And I would like to make other VMs in same class C /24 net invisible to them. Thanks!

    ReplyDelete
  3. Thanks for explications. But now I'm confused, how can I build virtual layer 2 networking on top of one physical layer 2 network (it means without adding new vlans each time I want to add customers, that would be around few hundreds ?

    Would portgroup backed network pools with vlan-id set + Brocade VDX/VCS switch be a solution?

    Thanks!

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.