VXLAN-to-VXLAN Bridging in DCI Environments
Almost exactly a decade ago I wrote that VXLAN isn’t a data center interconnect technology. That’s still true, but you can make it a bit better with EVPN – at the very minimum you’ll get an ARP proxy and anycast gateway. Even this combo does not address the other requirements I listed a decade ago, but maybe I’m too demanding and good enough works well enough.
However, there is one other bit that was missing from most VXLAN implementations: LAN-to-WAN VXLAN-to-VXLAN bridging. Sounds weird? Supposedly a picture is worth a thousand words, so here we go.
Most VXLAN-with-EVPN implementations can handle a single unified bridging domain – an ingress VTEP sends traffic directly to an egress VTEP.
That works well in a data center environment but might result in two challenges when used over WAN links:
- You’re probably using ingress replication (assuming you’re not a great fan of enabling large-scale IP multicast), which means that every ingress ToR switch sends a separate copy of a flooded packet over the WAN link to every egress ToR switch in the remote data center. Not exactly what you’d like to see on your expensive WAN link, right?
- Switching ASICs support a limited number of VXLAN neighbors (usually 256) and a limited number of entries in the ingress replication list (usually 128). You might hit those limits when extending your VXLAN network across multiple sites1
Those challenges have a beautiful solution: VXLAN-to-VXLAN bridging between LAN and WAN bridging domains on the WAN edge switches:
- WAN edge switches act as final VXLAN VTEP for LAN and WAN peers. LAN peers do not need to care about VTEPs in remote sites. WAN peers do not need to care about local VTEPs.
- WAN edge switches receive a single copy of a flooded packet (from LAN or WAN side) and flood it further.
For more details, watch the excellent Using VXLAN and EVPN in Multi-Pod and Multi-Site Fabrics presentation by Lukas Krattiger, or read the Multi-Domain EVPN VXLAN document on Arista’s web site (warning: regwall).
There’s just a tiny little problem – the switching ASIC on the WAN edge devices has to implement VXLAN-to-VXLAN bridging which includes:
- Split-horizon forwarding: whatever is received from LAN peers should not be sent to WAN peers and vice versa
- Split-horizon flooding: whatever is received from LAN peers must be flooded to WAN peers and vice versa.
- No cheating with VXLAN VNI – identification of LAN and WAN peers must be done based on source IP addresses, not based on different VNIs
For years, it looked like the only ASIC capable of doing VXLAN-to-VXLAN bridging was Cisco’s Cloud Scale ASIC… until Arista decided that’s a problem worth solving and figured out how to do it with Broadcom Jericho chipset. According to the 2022 EANTC test report, the VXLAN-to-VXLAN stitching also works on Juniper QFX10K and Nokia 7750 SR-1.
More details
- Lukas Krattiger (and myself) talked about multi-pod and multi-site fabrics in Leaf-and-Spine Fabric Architectures webinar.
- I mentioned VXLAN as a potential layer-2 DCI transport technology in Data Center Interconnects webinar.
- We discussed the use of VXLAN and EVPN as DCI technologies in June 2022 design clinic.
Thank You
Remi Locherer sent me a nice email after the June 2022 design clinic saying “your information is a bit outdated” and included the link to 2022 EANTC test report and Arista documentation. I solemnly promise to augment those videos with I was wrong callouts once I get them back from the editor.
-
Should that be the case, I’m hoping you’re not designing your network based on generic blog posts. I’m trying to be less biased than vendor white papers, but if you have such a large network you’re deep in the It Depends territory and need a proper network design. ↩︎
First diagram is missing a link between WAN edge switch and LAN peer on the right. I find it more accurate to draw two WAN edge switches per data center for the sake of redundancy (resulting in at least two WAN connections) as there are already two LAN peers per data center in the diagram.
I've never saw a multi-site EVPN VXLAN implementation in production. IMO it's too complex and cumbersome. I only saw multi-pod implementations for multiple data centers even for larger deployments which ran just fine.
"I've never saw a multi-site EVPN VXLAN implementation in production. IMO it's too complex and cumbersome."
If your deployment doesn't exceed the hardware limitations, and the amount of flooding is reasonable, then I totally agree with you.
OTOH, ingress replication can become a nice amplification mechanism chewing up WAN bandwidth... of course only after you get into a broadcast storm.
Hi Ivan,
If I understand correctly, cisco multisite evpn uses draft-Sharma-bess-multisite which Lukas Krattiger Co author, while others (juniper Nokia arista) uses rfc 9014. According to draft-Sharma it is compatible with rfc 9014. But myself haven't tried it. Do you know whether it's compatible? I am also not familiar with the differences between the two..