Q&A: Building a Layer-2 Data Center Fabric in 2016
One of my readers designing a new data center fabric that has to provide L2 transport across the data center sent me this observation:
While we don’t have plans to seek an open solution in our DC we are considering ACI or VXLAN with EVPN. Our systems integrator partner expressed a view that VXLAN is still very new. Would you share that view?
Assuming he wants to stay with Cisco, what are the other options?
Hardware: Nexus 9000 or Nexus 7x00/5x00? Honestly, I wouldn’t buy anything N7K-based these days, and assuming Nexus 9000 feature set fits my needs (they even have FCoE these days if you still care) the only consideration when choosing between N5K and N9K would be price-per-port.
Features: There are a few things missing on N9Ks like OTV or LISP. Maybe you don’t need them. I still don’t know why I’d need LISP and EVPN is not much worse than OTV (it does lack broadcast domain isolation features of OTV). Assuming you need OTV or LISP for whatever reason, it might be cheaper to buy an extra ASR than a Nexus 7K.
Stability: While I wouldn’t necessarily deploy ACI, I haven’t heard anything bad about N9K with VXLAN recently.
And now for the elephant in the room: L2 fabrics.
If you want to build a Cisco-based L2 fabric these days you have four design options (see also: standards):
- STP + MLAG (vPC). When was the last time you checked your calendar?
- FabricPath. While it’s elegant, it’s also clearly a dead-end technology. Every data center switching vendor (apart from Avaya) is rushing to board the VXLAN+EVPN train. Brocade, Cisco and Juniper have shipping implementations. Arista is supposedly talking about one. TRILL and SPBM are dying (in the data center), as are proprietary L2 fabrics (it was about time). I wouldn’t invest in one of those in 2016
- ACI. Maybe not. It’s a lot of hidden complexity, particularly if you need nothing more than a-bit-more-stable VLANs.
What’s left? VXLAN, in one of its three incarnations:
- Multicast-based. Why should you introduce IP multicast in your data center network just because someone tried to shift the problem around?
- Static ingress node replication. Perfect for small or fully-automated networks that need nothing more than L2 connectivity.
- EVPN. Ideal for people who believe virtual networking (including L2+L3 fabrics) should be done on ToR switches and not in the hypervisors.
So please don’t tell me not to go with VXLAN (particularly if you claim you need L2 fabric). There’s no real alternative.
Want to know more?
- Building Next-Generation Data Center online course is an intensive interactive deep dive into data center design challenges.
- Leaf-and-Spine Fabric Designs webinar covers common fabric designs, including L2, L3 and mixed L2+L3 fabrics.
- Data Center Fabrics webinar documents what the vendors are actually shipping (as opposed to promising).
- You can also ask me for a second opinion (well, not before early 2017).
This is a fair standard and should be available on most platform, almost on any vendor.
Also they have no license requirement for any feature that is supported on the box.
Disclaimer: I've used many of them, but never configured MPLS/VPLS. From the docs, it works and is supported.
As for L2 transport, as long as virtual switches keep pretending the earth is flat (I'm looking at VMware ;), we'll be asked to provide it.
The learning curve is no different than anything else - if you haven't done EVPN, then you'd have to learn that too, just in ACI it can be admittedly frustrating since you're trying to figure out a GUI that should be "simple" instead of a new "complex" technology. (FWIW I think its a decent - not great - GUI, but I'm used to it at this point)
I also think ACI is just like anything else -- there are no rails put up to keep you from doing stupid shit! You can do lots of dumb things with PBR and VACLs and stupid things like that and ACI is no different. On the whole though I think (especially from 1.2+) its a very solid platform that if nothing else handles firmware management and fault reporting quite nicely (I think its great for lots more than that, but at a minimum I think those are two nice aspects).
VXLAN/EVPN gives you a thousand manual config points, each of which can crash your network in wondrous and unpredictable ways. You'll get really good at troubleshooting VXLAN/EVPN, and will master things like "symmetric IRB" and "bud nodes" and BGP type-2 advertisements. And, you will learn the hard way how not to engineer and secure the underlay, how to not do code upgrades, what not to do with VPC, etc. (Cisco highly recommends you buy Nexus Fabric Manager software to help smooth over some of that complexity)
In contrast, ACI is the "easy button." An optimized L2/L3 fabric for dozens or hundreds of leaf switches, all centrally managed, usable as fast as you can rack and cable the leaves. For L2 VLANs and L3 SVI's, ACI is goofproof. vCenter integration is easy, free, and immediately useful. It's only when you get into service graphs and contracts that ACI gives you enough rope to hang yourself.
ACI has a learning curve, but it's at a high level, it encourages consistency, and it yields immediate value in terms of DC-wide visibility. VXLAN/EVPN's learning curve is all about the nuts-and-bolts of the forwarding technology -- MAC, ARP, VNIs-to-VLAN mapping, uplink versus downlink ports, BGP RR's, etc. With VXLAN/EVPN, it's easy for the undisciplined engineer to get lost, lose site of the high level, and build an unsupportable monster.
People have been running Layer 2 only DCs for ages, some implementations were broken and others were done little cleaner.
We should also remember that sometimes even the most clean solution on paper or technically also fails at some point, so there are always known unknowns and unknown unknowns :)
Last week encountered a OTV bug causing Data Centre meltdown.
Another important question would be how you migrate your current fat DC to New DC fabric with minimal or no disruption.
Here is my take around some of considerations when it comes to DC fabrics and don't think mentioned VxLAN or L2 vs L3 fabric any where :)
http://deepakarora1984.blogspot.in/2016/12/data-centre-fabric-design-considerations.html
HTH...
Evil CCIE
I'm talking about 4 meshed core switches (10g/40g), 2 per DC with private redundant fiber between DCs.
ESX hosts will be directly connected and there will be a few L2 only switches for physical hosts. Those connections should be redundant towards both of the core switches per DC.
As this is based on Juniper gear what would be the advantage of using VXLAN + EVPN vs. MC-LAG? I think for using BGP there is an extra license required and it seems a little more complicated to setup and operate than going with MC-LAG between the two core switches in each DC. Also note there is no NSX license available so an overlay would have to be terminated on the core switches.
In our system we considered low-diameter topologies to reduce power consumption, requiring per-packet non-minimal adaptive routing. This is not supported by any of the four "commodity" alternatives that you detail, so we proposed an extension to OpenFlow switches to be able to react in micro-second time (as you argue in other post, apart from being dead, OpenFlow does not react quickly). If you have any interest, we have a more detailed discussion in the paper.
Thanks for the comment. I'm positive there are tons of niche environments with special requirements (yours is obviously one of them); I'm just continuously upset that we get "need large L2 domains" by default because someone didn't even consider the impact of what they want to have.
What you're proposing definitely makes sense in your particular environment, but I don't expect it to be implemented anytime soon (not sure whether you could do it yourself using Switch Light OS as the baseline). It might be better to work with a more traditional Linux-based switch and control the forwarding rules (MAC / TCAM tables) from your own userspace agent.