Decouple virtual networking from the physical world

Friday, December 9, 2011 06:54 +0100

Decouple virtual networking from the physical world

Isn’t it amazing that we can build the Internet, run the same web-based application on thousands of servers, give millions of people access to cloud services … and stumble badly every time we’re designing virtual networks. No surprise, by trying to keep vSwitches simple (and their R&D and support costs low), the virtualization vendors violate one of the basic scalability principles: complexity belongs to the network edge.

VLAN-based solutions

The simplest possible virtual networking technology (802.1Q-based VLANs) is also the least scalable, because of its tight coupling between the virtual networking (and VMs) and the physical world.

VLAN-based virtual networking uses bridging (which doesn’t scale), 12-bit VLAN tags (limiting you to approximately 4000 virtual segments), and expect all switches to know the MAC addresses of all VMs. You’ll get localized unknown unicast flooding if a ToR switch experiences MAC address table overflow and a massive core flooding if the same thing happens to a core switch.

In its simplest incarnation (every VLAN enabled on every server port on ToR switches), the VLAN-based virtual networking also causes massive flooding proportional to the total number of VMs in the network.

VM-aware networking scales better (depending on the number of VLANs you have and the number of VMs in each VLAN). The core switches still need to know all VM MAC addresses, but at least the dynamic VLAN changes on the server-facing ports limit the amount of flooding on the switch-to-server links; flooding becomes proportional to the number of VLANs active in a particular hypervisor host, and the number of VMs in those VLANs.

MAC-over-IP solutions

The only proper way to decouple virtual and physical networks is to treat virtual networking like yet another application (like VoIP, iSCSI or any other “infrastructure” application). Virtual switches that can encapsulate L2 or L3 payloads in UDP (VXLAN) or GRE (NVGRE/Open vSwitch) envelopes appear as IP hosts to the network; you can use the time-tested large-scale network design techniques to build truly scalable data center networks.

However, MAC-over-IP encapsulation might not bring you to seventh heaven. VXLAN does not have a control plane and thus has to rely on IP multicast to perform flooding of virtual MAC frames. All hypervisor hosts using VXLAN have to join VXLAN-specific IP multicast groups, creating lots of (S,G) and (*,G) entries in the core network. The virtual network data plane is thus fully decoupled from the physical network, the control plane isn’t.

A truly scalable virtual networking solution would require no involvement from the transport IP network. Hypervisor hosts would appear as simple IP hosts to the transport network, and use only unicast IP traffic to exchange virtual network payloads; such a virtual network would use the same transport mechanisms as today’s Internet-based applications and could thus run across huge transport networks. I’m positive Amazon has such a solution, and it seems Nicira’s Network Virtualization Platform is another one (but I’ll believe that when I see it).

More information

All the diagrams in this post were taken from the Cloud Computing Networking – Under the Hood webinar that focuses on virtual networking scalability.

You might also want to check Introduction to Virtual Networking and VMware Networking Deep Dive recordings.

Recent posts in the same categories

VXLAN

switching

workshop

11 comments:

sh0x 09 December 2011 07:36

Great post Ivan. What methods are available to provide scalable L3 isolation between the virtual L2 networks? MPLS/VPN and VPLS would seem reasonable though obviously not an available solution for vswitch at this point, not to mention cost of hardware goes up, choice of vendors goes down, and I don't believe there are any high port density 10G ToR switches available with the necessary protocol support. What other scalable options are there?

Ben Jencks 09 December 2011 22:00

I'm actually not convinced Amazon does something like that, at least for the majority of their VMs. Remember, they don't support any VM or IP mobility. Their solution to maintenance on a host box is just to reboot or kill all the VMs running on the box. As a result, they can build an incredibly simple, aggregatable, entirely L3-routing based network. I wouldn't be surprised if the routing begins on the host itself, using proxy-ARP to get the packets.

It's true that Virtual Private Cloud lets you bring your own addresses and has more flexibility in assigning them, but even there I don't believe they give you an L2 domain with the usual broadcast/multicast semantics -- it's just a bunch of machines with IPs from the same subnet. They probably use IP-in-IP or IP-in-MPLS to handle it; I strongly doubt ethernet headers get beyond the virtualization host.

@jedelman8 09 December 2011 22:02

Ivan, nice write up. I wouldn't mind hearing your thoughts and speculation on using CAPWAP between virtual switches. Cough cough Nicira. Maybe it will help with de-coupling the control plane reliance from the physical network, but not so helpful with VLAN scale?

Ivan Pepelnjak 10 December 2011 13:42

Amazon runs IP-over-IP with L3 switch in the hypervisor (or something equivalent). Follow the link in the blog post for more details.

Ivan Pepelnjak 10 December 2011 13:43

I don't think Nicira uses more than CAPWAP encapsulation in the Open vSwitch. They seem to be relying exclusively on P2P inter-hypervisor tunnels and use whatever encapsulation comes handy, be it GRE, CAPWAP or VXLAN.

Ivan Pepelnjak 12 December 2011 20:47

As long as you stay in the L2 domain (be it VLANs, VXLAN-based virtual networks or anything else), the L3 isolation is automatic. Once you cross L3 boundaries and want to keep L3 path separation, MPLS/VPN is the only widely-available scalable technology (MultiVRF doesn't scale).

hitekalex 18 December 2011 18:31

I am not sure I understand why using Core Multicast for L2 flooding is a bad thing, that needs to be avoided? Any modern Core IP platform is designed to handle large-scale Multicast forwarding, and majority of Enterprise datacenter Core networks are already Multicast enabled.

Why invent something new and kludgy (like full mesh of hypervisor tunnels), when the most efficient IP-based solution has already been invented and proven in real networks?

Not to mention, Unicast-based flooding will be inherintly less efficient than Multicast-based. Think about it - a hypervisor that needs to flood a frame to 10 other hypervisors needs to send that frame to the IP Core (to Multicast group for that L2 domain) just once.. as opposed to forwarding that frame 10 times via Unicast across 10 tunnels. Multicast was invented for this.

Ivan Pepelnjak 19 December 2011 20:02

Arista: 4000 multicast entries per linecard
Nexus 7000: 32000 MC entries, 15000 in vPC environment
Nexus 5548: 2000 (verified), 4000 (maximum)
QFabric: No info. OOPS?

Not that much if you want to have an IP MC group per VNI. A blog post coming in early January (it's already on the to-write list).

Mikkel Shinn 04 January 2012 20:50

The latter example of MAC encapsulation struck me as an interesting place to apply some form of SDN. Instead of applying solutions like OpenFlow to the transport network, VM vendors could implement (ideally) open SDN APIs in the hypervisor networking, separating physical networking from virtual networking.

Though I'm neither an SDN or a virtualisation guy, so this is likely nothing new.

Ivan Pepelnjak 04 January 2012 21:41

... this is exactly how I understand Open vSwitch/Nicira controller to work. OVS already supports OpenFlow API.

Mikkel Shinn 04 January 2012 21:53

Heh, I'll be exercising my caveat now.

Add comment