Repost: L2 Is Bad
Roman Pomazanov documented his thoughts on the beauties of large layer-2 domains in a LinkedIn article and allowed me to repost it on ipSpace.net blog to ensure it doesn’t disappear
First of all: “L2 is a single failure domain”, a problem at one point can easily spread to the entire datacenter.
The most common problem is broadcast storms and malformed broadcast frames. As representative example of what it can lead to, recall a case from 2018 on the network of a large American provider Centurylink, when a problem in a single L2-domain suspended the 911 service in several states. More details can be found in an FCC document.
Single L2 makes scaling difficult, the larger the datacenter - the more of broadcast traffic on the network.
Yes, there are storm control mechanisms, but they drop everything indiscriminately. It is impossible to understand where legitimate traffic is and where it is not. ARP requests stop working, etc
Other than that, “problems” on L2 are hard to troubleshoot, and once something has happened it’s usually too late.
Another disadvantage is the lack of spoofing protection mechanism, a MAC-address can be “accidentally” assigned to the one already used somewhere and we will get a situation with MAC-address flapping - and as a consequence disabling the MAC learning mechanism in a particular VLAN on modern datacenter switches.
Is Using EVPN/VXLAN the Solution?
EVPN certainly reduces the amount of broadcast traffic and has some sort of loop protection mechanism, but nevertheless L2 remains L2. In addition:
- EVPN will not protect against malformed broadcast packets;
- Implementation of VXLAN/EVPN in the network operating system code is much more complex than the implementation of simple routing. As a consequence, it has a larger code base, and as a consequence, it increases the potential number of bugs;
- When dealing with “new” vendors of network hardware, it is not clear in advance what difficulties in operation we will encounter;
- In spite of the fact that EVPN is an open standard, all the leading companies refrain to interop vendors within the framework of a single vendor - it is not clear who to blame when the problem is “at the junction”. Because of this, there is a lock-in to a single vendor within a single site. I’d love to hear stories about EVPN/VXLAN inter-vendor control-plane ;)
- VXLAN is still “insecure”, segmentation of what’s inside is impossible at the switch level - we “can’t see” what’s in the tunnel. We need to strip VXLAN headers “somewhere” and filter traffic (if necessary).
- Ethernet wrapped in UDP, wrapped in IP, wrapped in another Ethernet is complicated to troubleshoot. CRC errors in the original frame “fly” through the fabric, are not dropped anywhere and reach the destination in their original form. (Of course, if cut-through switches are used in the fabric).
What Else Is on Offer?
Use pure L3 routing. No overlay in the fabric . All overlay should be inside the servers - in SDNs.
By only needing routing, we reduce the “complexity” of the network - thus reducing the number of potential problems. Fewer features means fewer places where problems can occur.
It is much easier to realize interaction between devices of different vendors using only routing than EVPN - we are not tied to the vendor.
The ideal scheme is generally L3 up to the server where the services are located. The scheme covers fault tolerance, bandwidth extensibility, and scalability - the service (e.g. web site) hangs on a dummy/loopback interface, IP-address is announced to the physical network by some dynamic routing protocol (BGP as a defacto standard) - through each link, for example, and the announcements go upstairs.
Unfortunately it’s (still) impossible to implement such a simple design with network virtualization software from a major enterprise vendor.
Thanks for the re-post, I would not have seen this otherwise. :-)
Regarding "EVPN […] has some sort of loop protection mechanism", EVPN has loop prevention inside the EVPN network, but a loop outside the EVPN part of network can still affect the whole EVPN network.
Regarding the complexity of EVPN implementations, I have seen a bug inside a vendor implementation create a loop inside the EVPN network, with the usual results.
All in all I concur, large L2 domains are bad.
So I would argue that EVPN/VXLAN is another proof for RFC 1925 rule 6a.
This post echos my viewpoint as well; you're probably well-aware of it by now. Of course it won't happen due to massive conflict of interest between vendors and customers.
"Fewer features means fewer places where problems can occur."
This is a cold, hard fact, not even an opinion. The more parts (physical or otherwise), the bigger the probability of something failing at some point, let alone a failure that results from interaction of features, which is exponentially harder to troubleshoot.
We've been running EVPN with MLAG some 2 yrs now. Already there're cryptic errors happening no one knows why. In one case, it manifests in printers unable to email big scan jobs -- small ones go through fine. We tried lots of things, from the network to the Exchange Hub transport and mail servers. Some 8 months later, a colleague accidentally found out there seemed to be some problem with one of the MLAG links. Rebooting the switch having that MLAG fixed the problem. Still, no one knows why it manifested in that manner. And this is just one example. Feature-creep is just bad. Stick to the basics. KISS.
Good post, didn't we go through this L2 is bad so lets try to evolve(kludge it) it with the Fabric wars of 2009-13(FabricPath, Trill, etc )and then our SDN Open flow pipeline dreams of 2013-15?
My cynical (you know me ;) take:
"Use pure L3 routing. No overlay in the fabric . All overlay should be inside the servers - in SDNs." Would running overlays (that emulate L2) inside virtualization hosts not have the same issues as running the overlay in the fabric? Mainly referring to NSX here (although not specified by the author), and despite the fact that you cannot use routing between TOR and NSX TEP on ESX host.
Not if you do them right (= loopback interface advertised on all uplinks with a routing protocol). Of course that's not how VMware does stuff, see https://blog.ipspace.net/2020/02/do-we-need-complex-data-center-switches.html for details.
Overlays inside virtualization hosts do not need to emulate transparent bridging (see, e.g., Azure).
The common interface between servers and the network is IP over Ethernet, thus an overlay implemented in the network for the usual virtualization use cases¹ emulates transparent bridging. This also emulates all the problems of transparent bridging and creates a single failure domain.
¹ e.g., moving a VM to a different host without changing IP address(es) of the VM
I was indeed mainly referring to the 'single failure domain' aspect. And indeed looking to the enterprise world where VMware NSX is the most dominant player in the 'overlay in virtualization layer'.
You are indeed correct that this aspect is avoided in a environment such as Azure where BUM traffic is eliminated and ARP proxy is used.