VXLAN runs over UDP – does it matter?

Scott Lowe asked a very good question in his Technology Short Take #20:

VXLAN uses UDP for its encapsulation. What about dropped packets, lack of sequencing, etc., that is possible with UDP? What impact is that going to have on the “inner protocol” that’s wrapped inside the VXLAN UDP packets? Or is this not an issue in modern networks any longer?

Short answer: No problem.

Somewhat longer one: VXLAN emulates an Ethernet broadcast domain, which is not reliable anyway. Any layer-2 device (usually known as a switch although a bridge would be more correct) can drop frames due to buffer overflows or other forwarding problems, or the frames could become corrupted in transit (although the drops in switches are way more common in modern networks).

UDP packet reordering is usually not a problem – packet/frame reordering is a well-known challenge and all forwarding devices take care not to reorder packets within a layer-4 (TCP or UDP) session. The only way to introduce packet reordering is to configure per-packet load balancing somewhere in the path (hint: don’t do that).

Brocade uses very clever tricks to retain proper order of packets while doing per-packet load balancing across intra-fabric links.

Using UDP to transport Ethernet frames thus doesn’t break the expected behavior. Things might get hairy if you’d extend VXLAN across truly unreliable links with high error rate, but even then VXLAN-over-UDP wouldn’t perform any worse than other L2 extensions (for example, VPLS or OTV) or any other tunneling techniques. None of them uses a reliable transport mechanism.

The Light At The End Of The Tunnel - Demotivational Poster

Getting academic: Running TCP over TCP (which would happen in the end if one would want to run VXLAN over TCP) is a really bad idea. This paper describes some of the nitty-gritty details, or you could just google for TCP-over-TCP.

Some history: The last protocol stacks that had reliable layer-2 transport were SNA and X.25. SDLC or LAPB (for WAN links) and LLC2 (for LAN connections) were truly reliable – LLC2 peers acknowledged every L2 packet ... but even LLC2 was running across Token Ring or Ethernet bridges that were never truly reliable. We used reliable SNA-over-TCP/IP WAN transport (RSRB and later DLSW+) simply because the higher error rates experienced on WAN links (transmission errors and packet drops) caused LLC2 performance problems if we used plain source-route bridging.

And finally storage digression: Some people think Fiber Channel (FC) offers reliable transport. It doesn’t ... it just tries to minimize the packet loss by over-provisioning every device in the path because its primary application (SCSI) lacks fast retransmission/recovery mechanisms. We use FCIP (FC-over-TCP) on WAN links to reduce the packet drop rate, not to retain the end-to-end reliable transport.

Does it all matter?

Still not sure whether you should care about VXLAN? These blog posts might help you:

You’ll find more details in my webinars: Introduction to Virtual Networking and Cloud Computing Networking Under the Hood. You can buy their recordings individually or get them as part of the yearly subscription.

6 comments:

  1. Technically, the Fibre Channel standard defines multiple classes of service to be able to offer different types of transport. Class 1 offers a connection-oriented transport with frame acknowledgement, offering a completely reliable transport. But in the real world nobody implemented such classes of service, and the one used for storage 99,999% of the time is class 3 which is unreliable.

    Also, FC does not minimize packet loss by overprovisioning the network. It just uses flow control, it's that simple.
  2. In my simplistic understanding of FC, Class 1 offers reliable transport because frames are acknowledged by the final receiver, very similarly to what TCP does in IP world. Am I wrong?

    As for "over provisioning", you do need large buffers if you want to have reasonable performance with high-speed flow-controlled links, don't you?
  3. I think Infiniband gets pretty good utilization with small buffers thanks to credit-based flow control. Incast can cause congestion trees, though.
  4. Completely agree. From the end application point of view, VXLAN and transport over UDP are completely transparent. The applications will see only the internal packet (TCP or not) and expect the reliability associated with the internal protocol.
  5. You only need large amounts of buffers for long-distance links to keep the link utilized. For local links you really don't need that many. For end devices you typically allocate 16 buffers. In FC all buffers have the same size (a full frame or 2148 bytes) and you use one up whether you transmit a full-size frame or a smaller frame. Is that really that much? That, in my book, it not overprovisioning.

    Especially if you say "overprovisioning" with the same meaning as Greg Ferro seems to like so much: FC switch vendors (and the whole storage industry) are greedy and have been stealing your money over the past 15 years just because they like to, to the point of calling users "idiots" for buying into such "bullshit". And since you linked to his article where he expresses just this position, I thought you might endorse that position.

    FC switches have the memory resources they need to provide the reliability they need to provide for the criticality of the applications that run on them. They are not overprovisioned so that FC switch vendors can be rich.
  6. BTW you are correct about Class 1. And not only that. Class 1 is a connection-oriented class of service so you create, maintain and dedicate a path through the fabric for a particular data flow, and all bandwidth along that path is dedicated for that flow. Class 4 is also connection-oriented but allows for a fraction of the bandwidth to be reserved. Again, neither class 1 nor 4 have ever been implemented by any FC switch vendor.

    Class 2 is a connectionless class of service but provides reliability by acknowledging frame delivery. It also supports end-to-end flow control by use of end-to-end credits (EE_credits) on top of the buffer-to-buffer flow control (BB_credits) on every hop. Although most (probably all) FC switch vendors have always implemented this class of service, very few end devices have ever used it. Some HBAs supported it but there were basically no storage devices that did. So if the two devices didn't support class 2, they reverted back to class 3 which ended up being the only class of service ever actually used in practice.

    Some more background: http://intranet.tataelxsi.co.in/Training_Web/Articles/SSG_Articles/Fibre_Channe_Services.PDF
Add comment
Sidebar