VXLAN and OTV: I’ve been suckered

When VXLAN came out a year ago, a lot of us looked at the packet format and wondered why Cisco and VMware decided to use UDP instead of more commonly used GRE. One explanation was evident: UDP port numbers give you more entropy that you can use in 5-tuple-based load balancing. The other explanation looked even more promising: VXLAN and OTV use very similar packet format, so the hardware already doing OTV encapsulation (Nexus 7000) could be used to do VXLAN termination. Boy have we been suckered.

It turns out nobody took the time to analyze an OTV packet trace with the Wireshark; everyone believed whatever IETF drafts were telling us. Here’s the packet format from draft-hasmit-otv-03:

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version|  IHL  |Type of Service|          Total Length         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol = 17 | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source-site OTV Edge Device IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination-site OTV Edge Device (or multicast) Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port = xxxx | Dest Port = 8472 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP length | UDP Checksum = 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R| Overlay ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Instance ID | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Frame in Ethernet or 802.1Q Format |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

And here’s the packet format from draft-mahalingam-dutt-dcops-vxlan. Apart from a different UDP port number, the two match perfectly.

Outer IP Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Outer Source Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Outer Destination Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Outer UDP Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port = xxxx | Dest Port = VXLAN Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP Length | UDP Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ VXLAN Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|R|R|R|I|R|R|R| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VXLAN Network Identifier (VNI) | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Inner Ethernet Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Inner Destination MAC Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Inner Destination MAC Address | Inner Source MAC Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Inner Source MAC Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Optional Ethertype = C-Tag | Inner.VLAN Tag Information |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Payload:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ethertype of Original Payload | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Original Ethernet Payload |
| |
| (Note that the original Ethernet Frame's FCS is not included) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

However, it turns out the OTV draft four Cisco’s engineers published in 2011 has nothing to do with the actual implementation and encapsulation format used by Nexus 7000. It seems Brian McGahan was the first one to actually do the OTV packet capture and analysis and published his findings. He discovered that OTV is nothing else than the very familiar EoMPLSoGREoIP. No wonder the first VXLAN gateway device Cisco announced at Cisco Live is not the Nexus 7000 but a Nexus 1000V-based solution (at least that’s the way I understood this whitepaper).

5 comments:

  1. ohh come on Ivan, this post is practically on its knees begging for reference to RFC1925 section 2.11 :)

    ReplyDelete
  2. Piotr Jablonski29 August, 2012 22:06

    GRE as an encapsulation for OTV was known already. Example of links:
    http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9402/qa_c67-574969.html
    http://docwiki.cisco.com/wiki/Nexus_7000_-_OTV_-_Design_and_Configuration_Example
    Frankly, nothing new. ;)

    ReplyDelete
  3. This is pretty eye-opening. Having GRE as the encapsulation is not well known - yes its *lightly* documented, but Cisco has promoted OTV/VXLAN (and LISP) as having encapsulation formats (IE ..."VXLAN, OTV, and LISP frame formats share a similar-looking packet encapsulation structure..." - google it) leading those interested to believe that with a similar encapsulation structure, hardware enabled encap/decap will be consistent across the Nexus product line once it has been integrated into the next module line (F3?). Will that be the case or will we continue to need module 1 for OTV, module 2 for VXLAN and module 3 for LISP for scalable, hardware-based performance for cloud overlays.
    I think that's the reason that some may feel suckered, as do I. I give them the benefit of the doubt until the next round of modules come out and we see what's supported.

    ReplyDelete
  4. Hey Ivan:

    So, you know, the path forward is not always as straightforward as we might like. :)

    In this case, the OTV header format proposed in draft-hasmit-otv-03 (http://tools.ietf.org/html/draft-hasmit-otv-03) is the original proposed OTV header and has clear benefits in terms of its ability to be handled by the transit network providing connectivity for an overlay. This header is the header we want to converge to for all overlay encapsulations moving forward, hence the bit-by-bit match observed with the VXLAN and LISP headers.

    However, in order for Cisco to deliver OTV in a timely manner, we released an implementation on the Nexus 7000 that used an alternate encapsulation format that could be supported by existing switching hardware. The work has been taking place at Cisco (and across the industry) to support the proposed UDP encapsulation and Cisco's newer lines of ASICs will support the UDP encapsulation, but in the intervening 2+ years customers have an option for a hardware accelerated solution they can work with.

    Since the goal of standards bodies is to achieve convergence and consensus, we elected to maintain a crisp forward-looking message and focus our IETF proposal on the UDP encapsulation. We feel the approach has paid off as the newer proposals such as VXLAN adopted the proposed header format.

    The use of an alternate encapsulation for the initial release of OTV has been openly socialized by Cisco since OTV was first released (and well ahead of the publication of the IETF draft) in forums such as Cisco Live, public Webinars and in docs like http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9402/qa_c67-574969.html .

    Our goal was for clarity and certainly not to imply that ASIC lines that precede the invention of OTV could actually support the proposed new UDP encapsulation scheme.

    Hope this helps helps.

    Regards,

    Omar Sultan (@omarsultan)
    Cisco

    ReplyDelete
  5. And so there is no chance that since OTV lost their charter due to IPR issues that Cisco used VMware and VXLAN to try to sneak in the OTV header and try to get vendors to support OTV in hardware "by accident"?

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.