Are Overlay Networking Tunnels a Scalability Nightmare?

Every time I mention overlay virtual networking tunnels someone starts worrying about the scalability of this approach along the lines of “In a data center with hundreds of hosts, do I have an impossibly high number of GRE tunnels in the full mesh? Are there scaling limitations to this approach?

Not surprisingly, some ToR switch vendors abuse this fear to the point where they look downright stupid (but I guess that’s their privilege), so let’s set the record straight.

2013-09-05: Slightly rewrote the post based on feedback by Ben Pfaff. Thank you, Ben!

What are these tunnels?

The tunnels mentioned above are point-to-point GRE (or STT or VXLAN) virtual tunnel interfaces between Linux-based hypervisors. VXLAN implementations on Cisco Nexus 1000V, VMware vCNS or (probably) VMware NSX for vSphere don’t use tunnel interfaces (or at least we can’t see them from the outside).

Why do we need the tunnel interfaces?

The P2P overlay tunnels are an artifact of OpenFlow-based forwarding implementation in Open vSwitch. OpenFlow forwarding model assumes point-to-point interfaces (switch-to-switch or switch-to-host links) and cannot deal with multipoint interfaces (mGRE tunnels in Cisco IOS parlance).

OpenFlow controller (Nicira NVP) thus cannot set the transport network next hop (VTEP in VXLAN) on a multi-access tunnel interface in a forwarding rule; the only feasible workaround is to create numerous P2P tunnel interfaces, associating one (or more) of them with every potential destination VTEP.

The tunnel 'interfaces' are no longer real Linux interfaces - they are just entries in the OVS Interface table.

Do I have to care about them?

Absolutely not. They are auto-provisioned by ovsdb-server daemon (which uses ovsdb-proto to communicate with the controller(s)), exist only on Linux hosts, and add no additional state to the transport network (apart from the MAC and ARP entries for the hypervisor hosts which the transport network has to have anyway).

Will they scale?

Short summary: Yes. The real scalability bottleneck is the controller and the number of hypervisor hosts it can manage.

Every hypervisor host has only the tunnels it needs. If a hypervisor host runs 50VMs and every VM belongs to a different logical subnet with another 50VMs in the same subnet (scattered across 50 other hypervisor hosts), the host needs 2500 tunnel interfaces going to 2500 destination VTEPs.

Obviously distributed L3 forwarding makes things an order of magnitude worse (more about that in a future blog post), but as each hypervisor host has a single tunnel to a transport network destination, a host never has more tunnels than the number of physical servers in your cloud. Since a single NVP controller cluster doesn’t scale beyond 5000 hypervisors at the moment that puts an upper bound on the number of tunnel interfaces a Linux host might need.

So what’s all the fuss then?

As I wrote in the introductory paragraph – it’s pure FUD created by hardware vendors. Now that you know what’s going on behind the scenes lean back and enjoy every time some mentions it (and you might want to ask a few pointed questions ;).

Before leaving check out these links

If you’re a regular reader you know what’s coming next.

I have several webinars focused on large-scale cloud networking solutions: Cloud Computing Networking, Overlay Virtual Networking and VMware NSX Architecture (this one is sponsored by VMware and thus free – register now).

3 comments:

  1. This post seems to conflate two different uses of the word "interface". Open vSwitch has a table named Interface, each row of which represents one OpenFlow port. NVP does currently populate the Interface table with one row per tunnel. This post seems to also use the word "interface" to refer to a Linux network device. Open vSwitch does not create a Linux network device per tunnel, mostly for scale reasons (e.g. "ifconfig" with thousands of tunnels would generate voluminous output and take forever, and we found that it made other software such as XAPI unacceptably slow). Older versions of Open vSwitch did create one Open vSwitch kernel datapath module port per tunnel, but we found a better way and current versions only create a single kernel port per tunnel type (e.g. one for GRE, one for VXLAN, ...).

    Recent versions of Open vSwitch do not actually require setting up an Interface row per tunnel. Instead, one may set up a single tunnel-based Interface and handle everything in the flow table.

    I believe that Open vSwitch can actually handle multicast GRE. We don't use it because it requires the physical network to be configured correctly for multicast. In my understanding, many are not.

    The OVSDB daemon is named ovsdb-server (not ovsdb-proto).

    ReplyDelete
  2. Bhargav Bhikkaji05 September, 2013 06:24

    What about NVO HW GW ?.

    Another concern is monitoring these tunnels using protocols like BFD. The amount of messages generated is going to be huge, leave out processing capacity of the end hosts.

    -Bhargav

    ReplyDelete
  3. Thanks a lot for this informative post.I've got more information to read this post.Really,it's awesome.
    Friends,I've a fan blog based on Bradley vs Marquez fight.It's so interesting,exciting & enjoyable site.If you want,you can
    Visit: http://malaysiasquash.org/marquez-vs-bradley-bradley-vs-marquez/
    I'm sure that you must get more entertain & news of boxing.
    Don't miss this site.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.