Remember how Arista promoted VXLAN coupled with deep buffer switches as the perfect DCI solution a few years ago? Someone took Arista’s marketing too literally, ran with the idea and combined VXLAN-based DCI with traditional MLAG+STP data center fabric.
While I love that they wrote a blog post documenting their experience (if only more people would do that), it doesn’t change the fact that the design contains the worst of both worlds.
Here are just a few things that went wrong:
I’ve seen tons of STP- or MLAG-induced data center meltdowns. The first thing I would want to do in a new data center design would be to get rid of MLAG as much as possible. Most hypervisors work just fine without MLAG, and bare-metal Linux or Windows servers need MLAG only if you want to fully utilize all server uplinks. WAN edge routers should use routing with the fabric, and in some cases you can use the same trick with network services appliances.
End result: you MIGHT need MLAG to connect network services boxes that use static routing. Connect all of them to a single pair of ToR switches and get rid of MLAG everywhere else.
Even worse, MLAG-based design limits scalability. Most data center switching vendors support at most two switches in an MLAG cluster, limiting a MLAG+STP fabric to two spine switches.
Regardless of how you implement them, large layer-2 fabrics are a disaster waiting to happen. With VXLAN-over-IP fabric you have at least a stable L3-only transport fabric, and keep the crazy bits at the network edge - the way Internet worked for ages.
When interconnecting fabrics, you should connect leaf switches not spines. I described the challenge in details in Multi-Pod and Multi-Site Fabrics part of Leaf-and-Spine Fabric Architectures webinar and might write a blog post on the topic; in the meantime the proof is left as an exercise for the reader.
Deep buffers are not a panacea. When Arista started promoting deep buffer switches (because they were the first vendor deploying Jericho chipset - now you can buy them from Cisco as well) I asked a number of people familiar with real-life data center designs, ASIC internals, and TCP behavior whether you really need deep buffer switches in data centers.
While the absolutely correct answer is always “it depends”, in this particular case we got to “mostly NO”. You need deep buffers when going from low latency/high bandwidth environment to high latency/low bandwidth one (data center WAN edge); in the core of a data center fabric they do more harm than good. Another reason to connect DCI links to fabric edge.
What Should They Have Done?
The blog post I quoted at the beginning of this article is a few years old, and it’s possible that Arista didn’t have VXLAN-capable low-cost ToR switches at that time, but here’s what I would do today:
- Build two layer-3 leaf-and-spine fabrics;
- Deploy VXLAN with EVPN or static ingress replication on top of them;
- Connect DCI link to two deep-buffer leaf switches.
Need more details?
- Start with Leaf-and-Spine Fabrics Architectures
- Explore individual technologies with VXLAN Technical Deep Dive and EVPN Technical Deep Dive
- Data center interconnects are covered in yet another webinar
- JR Rivers did a great job discussing switch buffer sizes and the importance of drops versus delays
All ipSpace.net webinars are included with standard ipSpace.net subscription. For even more details check out Building Next Generation Data Centers online course available with Expert ipSpace.net Subscription.