Do we really need Stateless Transport Tunneling (STT)

The first question everyone asked after Nicira had published yet another MAC-over-IP tunneling draft was probably “do we really need yet another encapsulation scheme? Aren’t VXLAN or NVGRE enough?” Bruce Davie tried to answer that question in his blog post (and provided more details in another one), and I’ll try to make the answer a bit more graphical.

The three drafts (VXLAN, NVGRE and STT) have the same goal: provide emulated layer-2 Ethernet networks over scalable IP infrastructure. The main difference between them is the encapsulation format and their approach to the control plane:

  • VXLAN ignores the control plane problem and relies on flooding emulated with IP multicast;
  • NVGRE authors handwave over the control plane issue (“the way to obtain [MAC-to-IP mapping] information is not covered in this document”);
  • STT authors claim the draft describes just the encapsulation format.

Everything else being equal, why does STT make sense at all? The answer lies in the capabilities of modern server NICs.

TCP Segmentation Offload

Applications using TCP (for example, a web server) are not aware of the intricacies of TCP (window size, maximum segment size, retransmissions) and perceive a TCP connection as a reliable byte stream. Applications send streams of bytes to an open socket and the operating system’s TCP/IP stack slices and dices the data into individual TCP+IP packets, prepends MAC header (built from the ARP cache) in front of the IP header, and sends the L2 frames to the Network Interface Card (NIC) for transmission.

Modern NICs allow the TCP stacks to offload some of the heavy lifting to the hardware – most commonly the segmentation and reassembly (retransmissions are still performed in software). A TCP stack using a Large Segment Offload (LSO)-capable NIC would send a single jumbo MAC frame to the NIC and the NIC would slice it into properly sized TCP segments (increasing byte counts and computing IP+TCP checksums while doing that).

LSO significantly increases the TCP performance. If you don’t believe me (and you shouldn’t), run iperf tests on your server with TCP offload turned on and off (and report your results in a comment).

MAC-over-IP kills TCP offload

Typical NICs can segment TCP-IP-MAC frames. They cannot segment TCP-IP-MAC-VXLAN-UDP-IP-MAC frames (or TCP-IP-MAC-NVGRE-IP-MAC frames). Sending L2 frames over VXLAN or NVGRE incapacitates TCP offload on most server NICs available today (I didn’t want to write all – if you’re aware of a NIC that could actually handle IP-over-MAC-over-GRE encapsulation, please write a comment). Does that matter? Do the tests I suggested in the previous paragraph to figure out whether it matters to you.

STT – a clever TCP offload hack

STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform Large Segment Offload on what it thinks is a TCP datagram.

The reality behind the scenes is a bit more complex: what gets handled to the NIC is an oversized TCP-IP-MAC frame (up to 64K long) with STT-IP-MAC header. The “TCP” segments produced by the NIC are thus not the actual TCP segments, but segments of STT frame passed to the NIC.

Why do we have three different standards

Here’s my cynical view: every single vendor launching a MAC-over-IP encapsulation scheme tried to make its life easy. Cisco already has VXLAN-capable hardware (VXLAN header format is similar to OTV and LISP), you can probably figure out who has GRE-capable hardware by going through the list of NVGRE draft authors, and Nicira focused on what they see as the most important piece of the puzzle – the performance of the servers where the VMs are running.

Randy Bush called this approach to standard development “throwing spaghetti at the wall to see what sticks”, which is definitely an amusing pastime … unless you happen to be the wall.

5 comments:

  1. I enjoyed the "approach to standard development" PDF that you referenced, right on!

    ReplyDelete
  2. Brent Salisbury18 March, 2012 17:53

    What I love about the STT approach and the whole business model of the truly software SDN startups, is being agnostic to the hardware underneath. Abstraction will allow for commoditization of the lower ends of the stack. TOE is on every NIC out there. Very elegant solution for what I hope is a short term problem but am starting to think tunnels will be all that is left someday. Build a tunnel to gmail to check my mail etc.

    I just hope I will see a crumb of the magical creatures from fantasy land over the next 20 years :)

    Great post as always Ivan!

    https://twitter.com/#!/CCIE11972

    ReplyDelete
  3. Do you know what the firewalls and IPSs will do with your STT frames?

    I can tell you something pretty bad for them...

    Unless you flirt with your security admin guy :)

    ReplyDelete
    Replies
    1. If STT frames traverse firewalls and IPSs in your environment, it's high time to redesign your network and try to introduce some order into the sphaghetti mess that it must be.

      Delete
  4. Hi,

    My basic doubt from the draft was, the STT frame in turn contains a tcp like header. Who will be filling that header if nic is going to fragment the data. Does Nic has the intel to fill the required fields in the stt frame header(meta data as well as the tcp like header.)

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.