Are Overlay Networking Tunnels a Scalability Nightmare?
Every time I mention overlay virtual networking tunnels someone starts worrying about the scalability of this approach along the lines of “In a data center with hundreds of hosts, do I have an impossibly high number of GRE tunnels in the full mesh? Are there scaling limitations to this approach?”
Not surprisingly, some ToR switch vendors abuse this fear to the point where they look downright stupid (but I guess that’s their privilege), so let’s set the record straight.
2013-09-05: Slightly rewrote the post based on feedback by Ben Pfaff. Thank you, Ben!
What are these tunnels?
The tunnels mentioned above are point-to-point GRE (or STT or VXLAN) virtual tunnel interfaces between Linux-based hypervisors. VXLAN implementations on Cisco Nexus 1000V, VMware vCNS or (probably) VMware NSX for vSphere don’t use tunnel interfaces (or at least we can’t see them from the outside).
Why do we need the tunnel interfaces?
The P2P overlay tunnels are an artifact of OpenFlow-based forwarding implementation in Open vSwitch. OpenFlow forwarding model assumes point-to-point interfaces (switch-to-switch or switch-to-host links) and cannot deal with multipoint interfaces (mGRE tunnels in Cisco IOS parlance).
OpenFlow controller (Nicira NVP) thus cannot set the transport network next hop (VTEP in VXLAN) on a multi-access tunnel interface in a forwarding rule; the only feasible workaround is to create numerous P2P tunnel interfaces, associating one (or more) of them with every potential destination VTEP.
The tunnel 'interfaces' are no longer real Linux interfaces - they are just entries in the OVS Interface table.
Do I have to care about them?
Absolutely not. They are auto-provisioned by ovsdb-server daemon (which uses ovsdb-proto to communicate with the controller(s)), exist only on Linux hosts, and add no additional state to the transport network (apart from the MAC and ARP entries for the hypervisor hosts which the transport network has to have anyway).
Will they scale?
Short summary: Yes. The real scalability bottleneck is the controller and the number of hypervisor hosts it can manage.
Every hypervisor host has only the tunnels it needs. If a hypervisor host runs 50VMs and every VM belongs to a different logical subnet with another 50VMs in the same subnet (scattered across 50 other hypervisor hosts), the host needs 2500 tunnel interfaces going to 2500 destination VTEPs.
Obviously distributed L3 forwarding makes things an order of magnitude worse (more about that in a future blog post), but as each hypervisor host has a single tunnel to a transport network destination, a host never has more tunnels than the number of physical servers in your cloud. Since a single NVP controller cluster doesn’t scale beyond 5000 hypervisors at the moment that puts an upper bound on the number of tunnel interfaces a Linux host might need.
So what’s all the fuss then?
As I wrote in the introductory paragraph – it’s pure FUD created by hardware vendors. Now that you know what’s going on behind the scenes lean back and enjoy every time some mentions it (and you might want to ask a few pointed questions ;).
Recent versions of Open vSwitch do not actually require setting up an Interface row per tunnel. Instead, one may set up a single tunnel-based Interface and handle everything in the flow table.
I believe that Open vSwitch can actually handle multicast GRE. We don't use it because it requires the physical network to be configured correctly for multicast. In my understanding, many are not.
The OVSDB daemon is named ovsdb-server (not ovsdb-proto).
Another concern is monitoring these tunnels using protocols like BFD. The amount of messages generated is going to be huge, leave out processing capacity of the end hosts.
-Bhargav
Friends,I've a fan blog based on Bradley vs Marquez fight.It's so interesting,exciting & enjoyable site.If you want,you can
Visit: http://malaysiasquash.org/marquez-vs-bradley-bradley-vs-marquez/
I'm sure that you must get more entertain & news of boxing.
Don't miss this site.