The Never-Ending "My Overlay Is Better Than Yours" Saga

I published a blog post describing how complex the underlay supporting VMware NSX still has to be (because someone keeps pretending a network is just a thick yellow cable), and the tweet announcing it admittedly looked like a clickbait.

[Blog] Do We Need Complex Data Center Switches for VMware NSX Underlay

Martin Casado quickly replied NO (probably before reading the whole article), starting a whole barrage of overlay-focused neteng-versus-devs fun.

While I loved the concept Nicira worked on, the execution wasn’t exactly stellar, and once it got merged with traditional VMware approach to networking we got what I described. The end result makes me infinitely sad every time I think about its potentials… but then that water has long reached the ocean.

The best response to Martin’s claim was made by Mat Jovanovic:

Depends… Are we looking at a PPT, or a “I’ve tried it on a commodity underlay” version of the answer? Something tells me it’s quite different…

In the meantime, the debate veered into “my overlay is better than your overlay”, starting with Martin's claim that:

Good news for you – there are many fast growing overlay solutions that are adopted by apps and security teams and bypass the networking teams altogether.

Martin furthermore pointed to Nebula as one of his favorites. I did a quick look at their GitHub repo, and it looks like they did things the right way and built a badly needed session layer.

However, being sick-and-tired of everyone claiming how great it is to build overlays on top of overlays (like we didn’t learn anything in the decades building GRE and IPsec tunnels), I decided to troll a bit more:

We had a fast-growing overlay solution in 1970s. It was called TCP. I’ve heard it might still be used. Why do people insist of heaping layers upon layers instead of writing decent code?

Martin's response was almost as-expected:

App developers : “I’ve created this amazing overlay solution that solves a bunch of our problems”

Networking : “TCP has been around since the 70’s, write better code”

… this is why you’re not being invited to the party ;)

Someone must have had some traumatic experiences... Anyhow, as you probably know I’m well-aware of the popularity of pointing out the state of Emperor’s wardrobe (or lack thereof), and I’m way too old for FOMO, so I don’t care what parties I get invited to.

However, what makes me truly sad is watching highly intelligent people ignorant of environmental limitations (see also: fallacies of distributed computing and RFC 1925 rule 4) reinventing the wheels, and ending with what we already had (in a different disguise, see also RFC 1925 rule 11) after spending years figuring it out and repeating the mistakes we made in the past.

For example:

The networking engineers should know better, but even they can’t resist the lure of reinventing broken wheels, for example overlays with cache-based forwarding like LISP. No surprise, such solutions quickly encounter endpoint liveliness problem (and a few others).


  • I’m guessing LISP is not yet widespread enough to encounter severe cache trashing behavior that still triggers PTSD in anyone remotely involved in the days when Fast Switching crashed the Internet. That rerun might be fun to watch…
  • Of course I probably messed up at least some of these examples, so please feel to correct me in the comments.

Now here’s a crazy idea: what if we’d start communicating with people who understand how stuff works, learn from them, and implement stuff in an optimal way. IT seems to be one of the few areas where we allow people to build sandcastles and ignore the tides, and then blame someone else when the water inevitable arrives.


  1. Thank you, Ivan, for this enormous effort of discussion over the years! I perceive it as a good step in technology development. :)

    We can't resist broken wheels because in that sense every protocol and solution is broken. Is LISP a cache-based mechanism? Yes, but the control-plane or controls in the app can tame it. Ok, let's use something else. OTV? The end to end loop can kill it. Just a few supported HW causes still too large xSTP domains. VXLAN? Without the control-plane, there is a flood and learn behaviour, limited multihoming. EVPN? The best if not spanning L2. Otherwise, detection of L2 data-plane loops can also cause a headache. Don't use overlays and back to spanning-tree? Still, with these limitations, I would prefer overlays for several reasons.

    A generation by generation next broken wheels are giving more options for ubiquitous applications. It doesn't mean that networking is simpler. The other way around, it is way more complicated! Today we can have three embedded VXLAN overlays at the same time like VM based K8s Flannel over host-based NSX VXLAN over the access switch EVPN VXLAN. Why? Because three different departments can take care of their parts of the infrastructure. Is it good or bad? It depends. As always, there are tradeoffs. In my opinion, as to the network architects, we should proactively reach business, application and developers' levels to ask for their needs, to get sync with them, to educate each other and try to work out a subjectively right solution. We don't need to afraid of broken solutions. We should be afraid of living in silos. Then a waste of time may happen when developers reinvent VLANs with the same broken story we already had.
  2. Ivan, this is the quote of the year!!! "IT seems to be one of the few areas where we allow people to build sandcastles and ignore the tides, and then blame someone else when the water inevitable arrives." And it rhymes.....
  3. If only we could finally start solving problems where they really exist (poor app/system design?) instead building tons of workarounds...
    1. Won't happen anytime soon. Whole economies are making tons of money off customer gullibility and workarounds.
    2. at least try to separate security segregation from network designs. All the sudden half (more?) needs to build an overlay falls off the requirements table.
  4. With a blank sheet of paper, of course, the correct answer is that a work queue in an application shard is given an identity when it is created, and those application shards allowed to send to that work queue are given a capability (in the capabilities architecture of the 1970s sense) to that queue. Part of instantiating a particular application shard in a particular place is updating whatever forwarding tables are necessary.

    The concept of overlay then becomes simply one of application naming. The concept of underlay becomes one of moving payload from authorized sender-applications to application-receive-queues. And the concept of containers, VMs, OS instances, hypervisors, or anything else having any role in the data plane becomes an implementation choice inferior to simply giving the application direct access.

    Who will get this to critical mass and in what decade? Who knows? But the status quo is broken, and we all know it.
  5. That Martin is now at a VC, looking for "powerful, disruptive ideas often were once bad ideas", more than explains his reaction ...
  6. LISP is incomplete and not perfect and cannot make wonders. However, its latest variant is the only hybrid push/pull routing protocol that is available commercially on a large scale and also in reasonable open source implementations. Without enough flexibility you cannot build a good generic solution. Exclusively push or poll architectures have very strong limitations.
    There are some outstanding issues with LISP, such as selective subscription and cache management by external applications (so you could fit into the TCAM resources even when you have policy based routing).
    If you need "seamless" mobility, still LISP is your best friend. We could achive below 5 ms vertical handovers and failover switching in simple scenarios. But you really have to understand what is going under the hood for this. Large scale testing is still an open issue.
    Global mobility with BGP has no chance to come to that performance level. Even with significant money incentives, the best BGP experts could not fine tune to that handover or failover speed. A few hundred ms was their best result.
    You always have to make compromises, but if you do not have the toolsets to express your policies, then you will have no success. In LISP at least we have some chance to further improve.
    Without active probing you cannot have a reasonable assurance that you would be able to forward your packets. But even probing has some time expiration, so at the time you are actually sending, the situation might be already different. You cannot avoid backholing perfectly...
    In most cases, we have excess bandwidth, so critical services could use simulcast on independent paths. This is the way, how radar surveillance and air traffic control voice works. But classical best stream selection has its limits, so we have to move to packet by packet deduplication or combination (network coding, etc.). It is not without challenges, but it could give a reasonable improvement, since the probability of really receiving your packet will increase significantly. It has a price in resource consumption, but in some cases you are willing to pay for that...

Add comment