The first question everyone asked after Nicira had published yet another MAC-over-IP tunneling draft was probably “do we really need yet another encapsulation scheme? Aren’t VXLAN or NVGRE enough?” Bruce Davie tried to answer that question in his blog post (and provided more details in another one), and I’ll try to make the answer a bit more graphical.
The three drafts (VXLAN, NVGRE and STT) have the same goal: provide emulated layer-2 Ethernet networks over scalable IP infrastructure. The main difference between them is the encapsulation format and their approach to the control plane:
- VXLAN ignores the control plane problem and relies on flooding emulated with IP multicast;
- NVGRE authors handwave over the control plane issue (“the way to obtain [MAC-to-IP mapping] information is not covered in this document”);
- STT authors claim the draft describes just the encapsulation format.
Everything else being equal, why does STT make sense at all? The answer lies in the capabilities of modern server NICs.
TCP Segmentation Offload
Applications using TCP (for example, a web server) are not aware of the intricacies of TCP (window size, maximum segment size, retransmissions) and perceive a TCP connection as a reliable byte stream. Applications send streams of bytes to an open socket and the operating system’s TCP/IP stack slices and dices the data into individual TCP+IP packets, prepends MAC header (built from the ARP cache) in front of the IP header, and sends the L2 frames to the Network Interface Card (NIC) for transmission.
Modern NICs allow the TCP stacks to offload some of the heavy lifting to the hardware – most commonly the segmentation and reassembly (retransmissions are still performed in software). A TCP stack using a Large Segment Offload (LSO)-capable NIC would send a single jumbo MAC frame to the NIC and the NIC would slice it into properly sized TCP segments (increasing byte counts and computing IP+TCP checksums while doing that).
LSO significantly increases the TCP performance. If you don’t believe me (and you shouldn’t), run iperf tests on your server with TCP offload turned on and off (and report your results in a comment).
MAC-over-IP kills TCP offload
Typical NICs can segment TCP-IP-MAC frames. They cannot segment TCP-IP-MAC-VXLAN-UDP-IP-MAC frames (or TCP-IP-MAC-NVGRE-IP-MAC frames). Sending L2 frames over VXLAN or NVGRE incapacitates TCP offload on most server NICs available today (I didn’t want to write all – if you’re aware of a NIC that could actually handle IP-over-MAC-over-GRE encapsulation, please write a comment). Does that matter? Do the tests I suggested in the previous paragraph to figure out whether it matters to you.
STT – a clever TCP offload hack
STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform Large Segment Offload on what it thinks is a TCP datagram.
The reality behind the scenes is a bit more complex: what gets handled to the NIC is an oversized TCP-IP-MAC frame (up to 64K long) with STT-IP-MAC header. The “TCP” segments produced by the NIC are thus not the actual TCP segments, but segments of STT frame passed to the NIC.
Why do we have three different standards
Here’s my cynical view: every single vendor launching a MAC-over-IP encapsulation scheme tried to make its life easy. Cisco already has VXLAN-capable hardware (VXLAN header format is similar to OTV and LISP), you can probably figure out who has GRE-capable hardware by going through the list of NVGRE draft authors, and Nicira focused on what they see as the most important piece of the puzzle – the performance of the servers where the VMs are running.
Randy Bush called this approach to standard development “throwing spaghetti at the wall to see what sticks”, which is definitely an amusing pastime … unless you happen to be the wall.