Myths That Refuse to Die: Scalability of Overlay Virtual Networking
If you watched the Network Field Day videos, you might have noticed an interesting (somewhat one-sided) argument I had with Sunay Tripathi, CTO and co-founder of Pluribus Networks (start watching at around 32:00 to get the context). Let’s try to get the record straight.
TL&DR Summary
Data plane performance of overlay virtual networks run with well-implemented virtual switches does not vary based on number of tenants, remote hypervisors (VTEPs), or virtual machines (MAC addresses or IP host routes). Hypervisor-based overlay virtual networks might have other scalability concerns, but forwarding performance is not one of them.
Basics
The crucial part of my argument was the difference between performance (how fast can you do something in software versus hardware) and scalability (how large can something grow). That nuance somehow got lost in the translation.
Early host-based VXLAN implementations actually had significant performance limitations (when someone calls 1Gbps ludicrous speed in 2014 you have to wonder how bad it was before that), and VMware clearly documented them in their technical white paper… but that topic will have to wait for another blog post.
Overlay Virtual Networking Scalability
In November 2014 I did a 2-hour public webinar on scalability challenges of overlay virtual networks, so you might want to watch that one (or at least the Distributed Data Plane video) first.
Disclosure: As you’ll notice in the introduction to each video, Nuage Networks sponsored the webinar, but I would never accept to work on a sponsored webinar if I didn’t believe in what I would be telling you. It’s impossible to buy integrity once you compromise it.
The scalability aspect of overlay virtual networking we’re discussing here is the scalability of hypervisor data plane – how many MACs, IPs and remote VTEPs can a hypervisor have and what’s the impact of large-scale environment on forwarding performance.
Layer-2 forwarding in hardware or software is extremely simple:
- Extract destination MAC address from the packet;
- Look up destination MAC address in a hash table. Hash table lookups have almost linear time when the table is sparsely populated, which is easier to achieve in software than in hardware;
- Send the packet to output port specified in the hash table entry, potentially adding tunnel encapsulation (VXLAN, PBB, VPLS…)
With proper implementation, the number of MAC addresses has absolutely no impact on the forwarding performance until the MAC hash table overflows, and the number of VTEPs doesn’t matter at all (VTEP information needed by encapsulation headers is referred to in the MAC address entries).
Layer-3 forwarding is similar to layer-2 forwarding, but requires more complex data structures… or not. In the case of distributed L3 forwarding one could combine ARP entries and connected subnets into host routes (that’s what most ToR switches do these days) and do a simple hash-based lookup on destination IP addresses. Longest-prefix matches (for non-connected destinations) would still require a walk down an optimized tree structure.
It’s obvious that the number of tenants present on a hypervisor has zero impact on performance (every tenant has an independent forwarding table), the number of hosts in tenant virtual network has almost no impact on performance (see the hosts routes and layer-2 forwarding above), and the longest-prefix match can usually be done in two to four lookups (or more for IPv6). In many implementations the number of lookups doesn’t depend on the size of the forwarding table.
Summary
From the forwarding performance standpoint a properly implemented virtual switch remains (almost) infinitely scalable. Suboptimal implementations might have scalability challenges, and every implementation eventually runs into controller scalability issues, which some vendors like Juniper (Contrail), Nuage (VSP) and Cisco (Nexus 1000V) solved with scale-out controller architecture.
Scalability of hypervisor-based overlay virtual networking might have been an issue in early days of technology. Talking about its challenges in 2015 is mostly FUD (physical-to-virtual connectivity is a different story).
Finally, the hardware table sizes (primarily the MAC and ARP table sizes) limit the scalability of hardware-based forwarding. Software-based forwarding has significantly higher limits (how many MAC addresses can you cram into 1GB of RAM?).
Want to know more?
- Read all the linked-to blog posts. Repeat for 2-3 levels of indirection ;)
- Watch the VMware NSX Architecture and Scaling Overlay Virtual Networks webinar;
- For even more details, watch the Overlay Virtual Networking webinar (which includes packet walks for all major hypervisor-based overlay virtual networking solutions).
More disclosure: Pluribus Networks presented @ NFD9.Presenting companies indirectly cover part of my travel expenses, but that never stopped me from expressing my own opinions.
Comparison should be made between bare-metal with *No TCP* offload to a system having vSwitch in it. Obviously, one may need to burn more CPUs for this.
Comparing vSwitch performance with TCP-offloaded system seems not right but comparing Offloaded-vSwitch + TCP to pure TCP offloaded system is a probably right. Intelligent NICs have scalability limit too, so is it beneficial to have intelligent NICs or have them on HW boxes ? I am not sure if there any pointers that take in to consideration for all the variables like #CPUs, Offload, vSwitch table size, cost.
Finally,
Performance & Scalability are inversely proportional after certain point. If one can size vSwitch to 4K, works without performance impact for 4K flows only (assuming hashing works perfectly). Anything more than 4K will have performance issue. Of-course, one could off-set it by adding more CPUs.
On the topic of hash table sizing - most implementations resize the table once the load factor exceeds a certain limit. Resizing is admittedly hard in real-time environments, but even Wikipedia lists a few tricks you can use.
I am not sure about this, but from what i understood, Sunay was only saying that forwarding in software is not optimized for tunnels, but only for direct TCP/IP/Ethernet. As a result, the third bullet in your list is actually (as currently implemented) quite expensive.
Due to this expense, they are trying to offload the third step of the list (the "adding tunnel encapsulation" part) away from traditional OS kernels which are not optimized for this function.
These are not my opinions, this is just what i understood from reading and watching the video.
(A) Doing tunnel encapsulation in software is expensive
(B) Hypervisor-based tunnels don't scale, it's better to do them on the ToR switches.
This blog post is focused on (B), tomorrow I'll cover (A) ;)
Waiting for post about (A).