Years ago I managed to saturate a 10GE uplink on a vSphere server I tested with a single Linux VM using less than one vCPU. On the other hand, squeezing 1 Gbps out of Open vSwitch using GRE encapsulation was called ludicrous speed not so long ago. Implementing overlay virtual networking in the hypervisor obviously carries a huge performance penalty, right? Not so fast…
Just because the release of your preferred virtual switch that you chose for your deployment doesn’t work as fast as you want it to work doesn’t mean that linerate overlay virtual networking cannot be done in hypervisor kernels (despite claims to the contrary).
It seems hypervisor-based virtual networking sucks:
- There have been multiple reports of out-of-box OVS pushing around 1Gbps with GRE encapsulation (use your Google-Fu to find them);
- Nicira decided to use STT encapsulation to improve performance for I/O intensive workloads;
On the other hand, hypervisor-based virtual networking might rock:
- VMware measured only small performance loss with VXLAN encapsulation in vSphere 5.1;
- VMware claims NSX achieves line-rate throughput on two 10GE uplinks with reasonable CPU overhead in their NET1883 VMworld presentation (not sure how you can get it online);
What’s going on?
The Root Cause
There’s a reason we get hugely disparate performance report. It’s called TCP Offload.
Long story short: TCP/IP stack in Linux kernel is slow. The last time I looked, kernel-based packet-by-packet processing on a single CPU core resulted in ~3 Gbps throughput at 1500-byte MTU. Disagree? Write a comment!
An obvious way to increase performance is to bypass the Linux kernel as much as possible. NIC-based TCP offload helps regular TCP-based applications. High-performance solutions use a custom TCP/IP stack (examples: Intel DPDK, 6Wind, A10, Linerate Systems – now F5).
Want to measure the impact of TCP offload? Disable it on a Linux VM and run your favorite TCP performance test (and post the results in a comment ;).
Summary: anything that interferes with VM NIC TCP offload capabilities will kill the forwarding performance.
Keeping the TCP Offload Running
Newest generation of server NICs (Intel XL710, Emulex OneConnect┬« OCe14000, Mellanox ConnectX-3 Pro) support full TCP offload functionality with VXLAN encapsulation, and both vSphere and Hyper-V can use their enhanced functionality. If you use one of these NICs and still experience significant performance drop with overlay virtual networking, it’s time to have a serious talk with whoever wrote the code for your virtual switch.
Widely deployed NICs (example: Intel 82599) cannot do full TCP offload (TCP segmentation and receive-side coalescing) in combination with tunneling protocols like VXLAN. Virtual switch implementations could either:
- Disable TCP offload in VM NICs and pass MTU-sized packets between VMs and physical NIC (seems like this might be what some versions of OVS are doing);
- Implement TCP offload in virtual switch or device driver.
I spent hours reading the Intel 82599 datasheet and it seems there are creative ways a device driver programmer could use the existing hardware to get TCP segmentation work with VXLAN encapsulation (email me if you want more details), but it’s absolutely impossible to get receive-side coalescing (RSC) of TCP-over-VXLAN streams work in hardware.
However, as VXLAN uses UDP encapsulation, it’s possible to spread processing of incoming VXLAN packets across multiple CPU cores (Receive Side Scaling – RSS), resulting in faster overall throughput, and it seems that’s exactly the trick VMware’s vSwitch is using.
My contacts within VMware tell me that the existing vSphere drivers for Intel 82599-based NICs support TCP segmentation with VXLAN encapsulation (solving the transmit-side performance issues) and that it takes a single command to enable RSS on an ESXi host.