Decouple virtual networking from the physical world

Isn’t it amazing that we can build the Internet, run the same web-based application on thousands of servers, give millions of people access to cloud services … and stumble badly every time we’re designing virtual networks. No surprise, by trying to keep vSwitches simple (and their R&D and support costs low), the virtualization vendors violate one of the basic scalability principles: complexity belongs to the network edge.

VLAN-based solutions

The simplest possible virtual networking technology (802.1Q-based VLANs) is also the least scalable, because of its tight coupling between the virtual networking (and VMs) and the physical world.

VLAN-based virtual networking uses bridging (which doesn’t scale), 12-bit VLAN tags (limiting you to approximately 4000 virtual segments), and expect all switches to know the MAC addresses of all VMs. You’ll get localized unknown unicast flooding if a ToR switch experiences MAC address table overflow and a massive core flooding if the same thing happens to a core switch.

In its simplest incarnation (every VLAN enabled on every server port on ToR switches), the VLAN-based virtual networking also causes massive flooding proportional to the total number of VMs in the network.

VM-aware networking scales better (depending on the number of VLANs you have and the number of VMs in each VLAN). The core switches still need to know all VM MAC addresses, but at least the dynamic VLAN changes on the server-facing ports limit the amount of flooding on the switch-to-server links; flooding becomes proportional to the number of VLANs active in a particular hypervisor host, and the number of VMs in those VLANs.

Other bridging-based solutions

vCDNI is the first solution that decouples at least one of the aspects of the virtual networks from the physical world. It uses MAC-in-MAC encapsulation and thus hides the VM MAC addresses from the network core. vCDNI also removes VLAN limitations, but causes massive flooding due to its suboptimal implementation – the amount of flooding is yet again proportional to the total number of VMs in the vCDNI domain.

Provider Backbone Bridging (PBB) or VPLS implemented in ToR switches fare better. The core network needs to know the MAC addresses (or IP loopbacks) of the ToR switches; all the other virtual networking details are hidden.

Major showstopper: dynamic provisioning of such a network is a major pain; I’m not aware of any commercial solution that would dynamically create VPLS instances (or PBB SIDs) in ToR switches based on VLAN changes in the hypervisor hosts ... and the dynamic adaptation to VLAN changes is a must if you want the network to scale.

While PBB or VPLS solves the core network address table issues, the MAC address table size in ToR switches cannot be reduced without dynamic VPLS/PBB instance creation. If you configure all VLANs on all ToR switches, the ToR switches have to store the MAC addresses of all VMs in the network (or risk unicast flooding after MAC address table experiences trashing).

MAC-over-IP solutions

The only proper way to decouple virtual and physical networks is to treat virtual networking like yet another application (like VoIP, iSCSI or any other “infrastructure” application). Virtual switches that can encapsulate L2 or L3 payloads in UDP (VXLAN) or GRE (NVGRE/Open vSwitch) envelopes appear as IP hosts to the network; you can use the time-tested large-scale network design techniques to build truly scalable data center networks.

However, MAC-over-IP encapsulation might not bring you to seventh heaven. VXLAN does not have a control plane and thus has to rely on IP multicast to perform flooding of virtual MAC frames. All hypervisor hosts using VXLAN have to join VXLAN-specific IP multicast groups, creating lots of (S,G) and (*,G) entries in the core network. The virtual network data plane is thus fully decoupled from the physical network, the control plane isn’t.

A truly scalable virtual networking solution would require no involvement from the transport IP network. Hypervisor hosts would appear as simple IP hosts to the transport network, and use only unicast IP traffic to exchange virtual network payloads; such a virtual network would use the same transport mechanisms as today’s Internet-based applications and could thus run across huge transport networks. I’m positive Amazon has such a solution, and it seems Nicira’s Network Virtualization Platform is another one (but I’ll believe that when I see it).

More information

All the diagrams in this post were taken from the Cloud Computing Networking – Under the Hood webinar (register for the live session) that focuses on virtual networking scalability.

You might also want to check Introduction to Virtual Networking and VMware Networking Deep Dive recordings ... and don’t forget to listen to the Packet Pushers Podcast #71.

Finally, if you’d like to have a second opinion on the scalability of your design (or any other similar topic), consider the ExpertExpress IaaS.

11 comments:

  1. Great post Ivan. What methods are available to provide scalable L3 isolation between the virtual L2 networks? MPLS/VPN and VPLS would seem reasonable though obviously not an available solution for vswitch at this point, not to mention cost of hardware goes up, choice of vendors goes down, and I don't believe there are any high port density 10G ToR switches available with the necessary protocol support. What other scalable options are there?

    ReplyDelete
  2. I'm actually not convinced Amazon does something like that, at least for the majority of their VMs. Remember, they don't support any VM or IP mobility. Their solution to maintenance on a host box is just to reboot or kill all the VMs running on the box. As a result, they can build an incredibly simple, aggregatable, entirely L3-routing based network. I wouldn't be surprised if the routing begins on the host itself, using proxy-ARP to get the packets.

    It's true that Virtual Private Cloud lets you bring your own addresses and has more flexibility in assigning them, but even there I don't believe they give you an L2 domain with the usual broadcast/multicast semantics -- it's just a bunch of machines with IPs from the same subnet. They probably use IP-in-IP or IP-in-MPLS to handle it; I strongly doubt ethernet headers get beyond the virtualization host.

    ReplyDelete
  3. Ivan, nice write up. I wouldn't mind hearing your thoughts and speculation on using CAPWAP between virtual switches. Cough cough Nicira. Maybe it will help with de-coupling the control plane reliance from the physical network, but not so helpful with VLAN scale?

    ReplyDelete
  4. Amazon runs IP-over-IP with L3 switch in the hypervisor (or something equivalent). Follow the link in the blog post for more details.

    ReplyDelete
  5. I don't think Nicira uses more than CAPWAP encapsulation in the Open vSwitch. They seem to be relying exclusively on P2P inter-hypervisor tunnels and use whatever encapsulation comes handy, be it GRE, CAPWAP or VXLAN.

    ReplyDelete
  6. As long as you stay in the L2 domain (be it VLANs, VXLAN-based virtual networks or anything else), the L3 isolation is automatic. Once you cross L3 boundaries and want to keep L3 path separation, MPLS/VPN is the only widely-available scalable technology (MultiVRF doesn't scale).

    ReplyDelete
  7. I am not sure I understand why using Core Multicast for L2 flooding is a bad thing, that needs to be avoided? Any modern Core IP platform is designed to handle large-scale Multicast forwarding, and majority of Enterprise datacenter Core networks are already Multicast enabled.

    Why invent something new and kludgy (like full mesh of hypervisor tunnels), when the most efficient IP-based solution has already been invented and proven in real networks?

    Not to mention, Unicast-based flooding will be inherintly less efficient than Multicast-based. Think about it - a hypervisor that needs to flood a frame to 10 other hypervisors needs to send that frame to the IP Core (to Multicast group for that L2 domain) just once.. as opposed to forwarding that frame 10 times via Unicast across 10 tunnels. Multicast was invented for this.

    ReplyDelete
  8. Arista: 4000 multicast entries per linecard
    Nexus 7000: 32000 MC entries, 15000 in vPC environment
    Nexus 5548: 2000 (verified), 4000 (maximum)
    QFabric: No info. OOPS?

    Not that much if you want to have an IP MC group per VNI. A blog post coming in early January (it's already on the to-write list).

    ReplyDelete
  9. The latter example of MAC encapsulation struck me as an interesting place to apply some form of SDN. Instead of applying solutions like OpenFlow to the transport network, VM vendors could implement (ideally) open SDN APIs in the hypervisor networking, separating physical networking from virtual networking.

    Though I'm neither an SDN or a virtualisation guy, so this is likely nothing new.

    ReplyDelete
  10. ... this is exactly how I understand Open vSwitch/Nicira controller to work. OVS already supports OpenFlow API.

    ReplyDelete
  11. Heh, I'll be exercising my caveat now.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.