We have school holidays this week, so I’m reposting wonderful comments that would otherwise be lost somewhere in the page margins. Today: Minh Ha on complexity of emulating layer-2 networks with VXLAN and EVPN.
Dmytro Shypovalov is a master networker who has a sophisticated grasp of some of the most advanced topics in networking. He doesn’t write often, but when he does, he writes exceptional content, both deep and broad. Have to say I agree with him 300% on “If an L2 network doesn’t scale, design a proper L3 network. But if people want to step on rakes, why discourage them.”
Decades ago there was a trick question on the CCIE exam exploring the intricate relationships between MAC and ARP table. I always understood the explanation for about 10 minutes and then I was back to I knew why that’s true, but now I lost it.
Fast forward 20 years, and we’re still seeing the same challenges, this time in EVPN networks using in-subnet proxy ARP. For more details, read the excellent ARP problems in EVPN article by Dmytro Shypovalov (I understood the problem after reading the article, and now it’s all a blur 🤷♂️).
One of ipSpace.net subscribers sent me this question after watching the EVPN Technical Deep Dive webinar:
Do you have a writeup that compares and contrasts the hardware resource utilization when one uses flood-and-learn or BGP EVPN in a leaf-and-spine network?
I don’t… so let’s fix that omission. In this blog post we’ll focus on pure layer-2 forwarding (aka bridging), a follow-up blog post will describe the implications of adding EVPN IP functionality.
Namex, an Italian IXP, decided to replace their existing peering fabric with a fully automated leaf-and-spine fabric using VXLAN and EVPN running on Cumulus Linux.
They documented the design, deployment process, and automation scripts they developed in an extensive blog post that’s well worth reading. Enjoy ;)
One of my readers sent me this question (probably after stumbling upon a remark I made in the AWS Networking webinar):
You had mentioned that AWS is probably not using EVPN for their overlay control-plane because it doesn’t work for their scale. Can you elaborate please? I’m going through an EVPN PoC and curious to learn more.
It’s safe to assume AWS uses some sort of overlay virtual networking (like every other sane large-scale cloud provider). We don’t know any details; AWS never felt the need to use conferences as recruitment drives, and what little they told us at re:Invent described the system mostly from the customer perspective.
I claimed that “EVPN is the control plane for layer-2 and layer-3 VPNs” in the Using VXLAN and EVPN to Build Active-Active Data Centers interview a long long while ago and got this response from one of the readers:
To me, that doesn’t compute. For layer-3 VPNs I couldn’t care less about EVPN, they have their own control planes.
Apart from EVPN, there’s a single standardized scalable control plane for layer-3 VPNs: BGP VPNv4 address family using MPLS labels. Maybe EVPN could be a better solution (opinions differ, see EVPN Technical Deep Dive webinar for more details).
Not only that - his blog post includes detailed setup instructions, and the corresponding GitHub repository contains all the source code you need to get it up and running.
I got this question about the use of AS numbers on data center leaf switches participating in an MLAG cluster:
In the Leaf-and-Spine Fabric Architectures you made the recommendation to have the same AS number on all members of an MLAG cluster and run iBGP between them. In the Autonomous Systems and AS Numbers article you discuss the option of having different AS number per leaf. Which one should I use… and do I still need the EBGP peering between the leaf pair?
As always, there’s a bit of a gap between theory and practice ;), but let’s start with a leaf-and-spine fabric diagram illustrating both concepts:
One of the attendees of our Building Next-Generation Data Center online course submitted a picture-perfect solution to scalable layer-2 fabric design challenge:
- VXLAN/EVPN based data center fabric;
- IGP within the fabric;
- EBGP with the WAN edge routers because they’re run by a totally different team and they want to have a policy enforcement point between the two;
- EVPN over IBGP within the fabric;
- EVPN over EBGP between the fabric and WAN edge routers.
The only seemingly weird decision he made: he decided to run the EVPN EBGP session between loopback interfaces of core switches (used as BGP route reflectors) and WAN edge routers.
I’d like to bring back EVPN context. The discussion is more nuanced, the common non-arguable logic here - reachability != functionality.
While running the Using VXLAN And EVPN To Build Active-Active Data Centers workshop in early December 2019 I got the usual set of questions about using BGP as the underlay routing protocol in EVPN fabrics, and the various convoluted designs like IBGP-over-EBGP or EBGP-between-loopbacks over directly-connected-EBGP that some vendors love so much.
I got a question along the same lines from one of the readers of my latest EPVN rant who described how convoluted it is to implement the design he’d like to use with the gear he has (I won’t name any vendor because hazardous chemical substances get mentioned when I do).
Aldrin wrote a well-thought-out comment to my EVPN Dilemma blog post explaining why he thinks it makes sense to use Juniper’s IBGP (EVPN) over EBGP (underlay) design. The only problem I have is that I forcefully disagree with many of his assumptions.
He started with an in-depth explanation of why EBGP over directly-connected interfaces makes little sense:
Another EVPN reader question, this time focusing on auto-RD functionality and how it works with duplicate MAC addresses:
If set to Auto, RD generated will be different for the same VNI across the EVPN switches. If the same route (MAC and/or IP) is present under different leaves of the same L2VNI, since the RD is different there is no best path selection and both will be considered. It’s a misconfiguration and shouldn’t be allowed. How will the BGP deal with this?
Got this interesting question from one of my readers:
BGP EVPN message carries both VNI and RT. In importing the route, is it enough either to have VNI ID or RT to import to the respective VRF?. When importing routes in a VRF, which is considered first, RT or the VNI ID?
A bit of terminology first (which you’d be very familiar with if you ever had to study how MPLS/VPN works):
Got an interesting set of questions from a networking engineer who got stuck with the infamous “let’s push the **** down the stack” challenge:
So I am a rather green network engineer trying to solve the typical layer two stretch problem.
I could start the usual “friends don’t let friends stretch layer-2” or “your business doesn’t really need that” windmill fight, but let’s focus on how the vendors are trying to sell him the “perfect” solution: