Performance of Hypervisor-Based Overlay Virtual Networking

Years ago I managed to saturate a 10GE uplink on a vSphere server I tested with a single Linux VM using less than one vCPU. On the other hand, squeezing 1 Gbps out of Open vSwitch using GRE encapsulation was called ludicrous speed not so long ago. Implementing overlay virtual networking in the hypervisor obviously carries a huge performance penalty, right? Not so fast…

TL&DR Summary

Just because the release of your preferred virtual switch that you chose for your deployment doesn’t work as fast as you want it to work doesn’t mean that linerate overlay virtual networking cannot be done in hypervisor kernels (despite claims to the contrary).

Data points

It seems hypervisor-based virtual networking sucks:

  • There have been multiple reports of out-of-box OVS pushing around 1Gbps with GRE encapsulation (use your Google-Fu to find them);
  • Nicira decided to use STT encapsulation to improve performance for I/O intensive workloads;

On the other hand, hypervisor-based virtual networking might rock:

  • VMware measured only small performance loss with VXLAN encapsulation in vSphere 5.1;
  • VMware claims NSX achieves line-rate throughput on two 10GE uplinks with reasonable CPU overhead in their NET1883 VMworld presentation (not sure how you can get it online);

What’s going on?

The Root Cause

There’s a reason we get hugely disparate performance report. It’s called TCP Offload.

Long story short: TCP/IP stack in Linux kernel is slow. The last time I looked, kernel-based packet-by-packet processing on a single CPU core resulted in ~3 Gbps throughput at 1500-byte MTU. Disagree? Write a comment!

An obvious way to increase performance is to bypass the Linux kernel as much as possible. NIC-based TCP offload helps regular TCP-based applications. High-performance solutions use a custom TCP/IP stack (examples: Intel DPDK, 6Wind, A10, Linerate Systems – now F5).

Want to measure the impact of TCP offload? Disable it on a Linux VM and run your favorite TCP performance test (and post the results in a comment ;).

Summary: anything that interferes with VM NIC TCP offload capabilities will kill the forwarding performance.

Keeping the TCP Offload Running

Newest generation of server NICs (Intel XL710, Emulex OneConnect┬« OCe14000, Mellanox ConnectX-3 Pro) support full TCP offload functionality with VXLAN encapsulation, and both vSphere and Hyper-V can use their enhanced functionality. If you use one of these NICs and still experience significant performance drop with overlay virtual networking, it’s time to have a serious talk with whoever wrote the code for your virtual switch.

Widely deployed NICs (example: Intel 82599) cannot do full TCP offload (TCP segmentation and receive-side coalescing) in combination with tunneling protocols like VXLAN. Virtual switch implementations could either:

  • Disable TCP offload in VM NICs and pass MTU-sized packets between VMs and physical NIC (seems like this might be what some versions of OVS are doing);
  • Implement TCP offload in virtual switch or device driver.

I spent hours reading the Intel 82599 datasheet and it seems there are creative ways a device driver programmer could use the existing hardware to get TCP segmentation work with VXLAN encapsulation (email me if you want more details), but it’s absolutely impossible to get receive-side coalescing (RSC) of TCP-over-VXLAN streams work in hardware.

However, as VXLAN uses UDP encapsulation, it’s possible to spread processing of incoming VXLAN packets across multiple CPU cores (Receive Side Scaling – RSS), resulting in faster overall throughput, and it seems that’s exactly the trick VMware’s vSwitch is using.

My contacts within VMware tell me that the existing vSphere drivers for Intel 82599-based NICs support TCP segmentation with VXLAN encapsulation (solving the transmit-side performance issues) and that it takes a single command to enable RSS on an ESXi host.

2 comments:

  1. Also be very careful with shared storage and client mounted storage in a hypervisor or shared environment. Depending upon use case latency can be the killer moreso that throughput here (storage / voice). Disable TSO/GSO/LRO/GRO on guests and on hypervisor if mounting remote storage and for non-local flows. However, watch the video below to see the 3-5Gbps observed limits vs.10Gbps... profound consequences for VXLAN and any encapsulation/fragmentation I would have thought etc...

    === *OLDY BUT A MUST WATCH* ===

    Virtual Networking Performance: LinuxConfAU 2011 "Stephen Hemminger (Linux Kernel Dev" also of Vyatta
    https://www.youtube.com/watch?v=acHXGURF070
    4m55s 15 Rules :)
    7m40s LRO not good for any routing/classifying.. only for guest to guest...

    Note: VMWare: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2055140
  2. RPS (in software) also makes quite a difference if RSS isn't available, but obviously at the cost of CPU overhead. You should try and get Tom Herbert of Google on the podcast; that would be something for those of us digging deeper into this subject.

    As per Donal, worth noting most offload features will probably cause problems if you are forwarding/routing.

    Cheers
Add comment
Sidebar