Q&A: Building a Layer-2 Data Center Fabric in 2016

One of my readers designing a new data center fabric that has to provide L2 transport across the data center sent me this observation:

While we don’t have plans to seek an open solution in our DC we are considering ACI or VXLAN with EVPN. Our systems integrator partner expressed a view that VXLAN is still very new. Would you share that view?

Assuming he wants to stay with Cisco, what are the other options?

Hardware: Nexus 9000 or Nexus 7x00/5x00? Honestly, I wouldn’t buy anything N7K-based these days, and assuming Nexus 9000 feature set fits my needs (they even have FCoE these days if you still care) the only consideration when choosing between N5K and N9K would be price-per-port.

Features: There are a few things missing on N9Ks like OTV or LISP. Maybe you don’t need them. I still don’t know why I’d need LISP and EVPN is not much worse than OTV (it does lack broadcast domain isolation features of OTV). Assuming you need OTV or LISP for whatever reason, it might be cheaper to buy an extra ASR than a Nexus 7K.

Stability: While I wouldn’t necessarily deploy ACI, I haven’t heard anything bad about N9K with VXLAN recently.

And now for the elephant in the room: L2 fabrics.

If you want to build a Cisco-based L2 fabric these days you have four design options (see also: standards):

  • STP + MLAG (vPC). When was the last time you checked your calendar?
  • FabricPath. While it’s elegant, it’s also clearly a dead-end technology. Every data center switching vendor (apart from Avaya) is rushing to board the VXLAN+EVPN train. Brocade, Cisco and Juniper have shipping implementations. Arista is supposedly talking about one. TRILL and SPBM are dying (in the data center), as are proprietary L2 fabrics (it was about time). I wouldn’t invest in one of those in 2016
  • ACI. Maybe not. It’s a lot of hidden complexity, particularly if you need nothing more than a-bit-more-stable VLANs.

What’s left? VXLAN, in one of its three incarnations:

  • Multicast-based. Why should you introduce IP multicast in your data center network just because someone tried to shift the problem around?
  • Static ingress node replication. Perfect for small or fully-automated networks that need nothing more than L2 connectivity.
  • EVPN. Ideal for people who believe virtual networking (including L2+L3 fabrics) should be done on ToR switches and not in the hypervisors.

So please don’t tell me not to go with VXLAN (particularly if you claim you need L2 fabric). There’s no real alternative.

Want to know more?

18 comments:

  1. Just for completeness, if one is keen to a little more additional complexity (but not that much if compared with VXLAN/EVPN) you can use VPLS/VPWS over MPLS.
    This is a fair standard and should be available on most platform, almost on any vendor.
    Replies
    1. The only major data center switching vendor with decent MPLS support on reasonably-priced switches is Juniper. Arista has no real MPLS control plane, Cisco has MPLS on Nexus 7000...
    2. Actually also HPE 5900 supports it, and seems to be very cheap.

      Also they have no license requirement for any feature that is supported on the box.

      Disclaimer: I've used many of them, but never configured MPLS/VPLS. From the docs, it works and is supported.
  2. "Our systems integrator partner expressed a view that VXLAN is still very new" - my personal advice, change immediately your system integrator. VXLAN has been around for 4-5 years, it is not "very new". Ivan wrote the first post on VXLAN in August 2011 ! But I have a more philosophical question, why would you need an L2 transport ? Isn't much better to build an L3-only Data Center, using well known and well proven standards, and getting rid of all terrible L2 stuff ? OpenContrail operating in L3-only mode is my dream, but it needs a better support, unfortunately with current support it is not realistic to run it in production networks !!!
    Replies
    1. OMG, we're getting old... Haven't realized it's been THAT long.

      As for L2 transport, as long as virtual switches keep pretending the earth is flat (I'm looking at VMware ;), we'll be asked to provide it.
  3. More than right Ivan, but just think about how much the world without MAC addresses and L2 switches would be easier (no ARP, no broadcast storms, no STP, and so forth) !!! Networks would be much simpler to design and operate (but probably our friends in Juniper and Cisco would not be so happy ...).
  4. HPE 5940 now too supports VXLAN/EVPN ;)
  5. My $0.02 on ACI:

    The learning curve is no different than anything else - if you haven't done EVPN, then you'd have to learn that too, just in ACI it can be admittedly frustrating since you're trying to figure out a GUI that should be "simple" instead of a new "complex" technology. (FWIW I think its a decent - not great - GUI, but I'm used to it at this point)

    I also think ACI is just like anything else -- there are no rails put up to keep you from doing stupid shit! You can do lots of dumb things with PBR and VACLs and stupid things like that and ACI is no different. On the whole though I think (especially from 1.2+) its a very solid platform that if nothing else handles firmware management and fault reporting quite nicely (I think its great for lots more than that, but at a minimum I think those are two nice aspects).
    Replies
    1. I tell people that Cisco VXLAN/EVPN and ACI fabrics cost and perform virtually the same. It's the same HW after all.

      VXLAN/EVPN gives you a thousand manual config points, each of which can crash your network in wondrous and unpredictable ways. You'll get really good at troubleshooting VXLAN/EVPN, and will master things like "symmetric IRB" and "bud nodes" and BGP type-2 advertisements. And, you will learn the hard way how not to engineer and secure the underlay, how to not do code upgrades, what not to do with VPC, etc. (Cisco highly recommends you buy Nexus Fabric Manager software to help smooth over some of that complexity)

      In contrast, ACI is the "easy button." An optimized L2/L3 fabric for dozens or hundreds of leaf switches, all centrally managed, usable as fast as you can rack and cable the leaves. For L2 VLANs and L3 SVI's, ACI is goofproof. vCenter integration is easy, free, and immediately useful. It's only when you get into service graphs and contracts that ACI gives you enough rope to hang yourself.

      ACI has a learning curve, but it's at a high level, it encourages consistency, and it yields immediate value in terms of DC-wide visibility. VXLAN/EVPN's learning curve is all about the nuts-and-bolts of the forwarding technology -- MAC, ARP, VNIs-to-VLAN mapping, uplink versus downlink ports, BGP RR's, etc. With VXLAN/EVPN, it's easy for the undisciplined engineer to get lost, lose site of the high level, and build an unsupportable monster.
  6. Well as a Network Engineer or Architect I believe the right question should be How L2 vs L3 fabric differs and what are Pros and Cons of each approach.

    People have been running Layer 2 only DCs for ages, some implementations were broken and others were done little cleaner.

    We should also remember that sometimes even the most clean solution on paper or technically also fails at some point, so there are always known unknowns and unknown unknowns :)

    Last week encountered a OTV bug causing Data Centre meltdown.

    Another important question would be how you migrate your current fat DC to New DC fabric with minimal or no disruption.

    Here is my take around some of considerations when it comes to DC fabrics and don't think mentioned VxLAN or L2 vs L3 fabric any where :)

    http://deepakarora1984.blogspot.in/2016/12/data-centre-fabric-design-considerations.html

    HTH...
    Evil CCIE
    Replies
    1. While I totally agree with your take on the subject, unfortunately many engineers don't have the luxury of starting the discussion at that point, and once you're faced with "build us a L2 fabric" decision made without even involving the networking team in the process, you have to find a solution that will do the least damage ;)
  7. As Ivan wrote MLAG seems dated, however what is the opinion for a very small deployment?
    I'm talking about 4 meshed core switches (10g/40g), 2 per DC with private redundant fiber between DCs.
    ESX hosts will be directly connected and there will be a few L2 only switches for physical hosts. Those connections should be redundant towards both of the core switches per DC.

    As this is based on Juniper gear what would be the advantage of using VXLAN + EVPN vs. MC-LAG? I think for using BGP there is an extra license required and it seems a little more complicated to setup and operate than going with MC-LAG between the two core switches in each DC. Also note there is no NSX license available so an overlay would have to be terminated on the core switches.
    Replies
    1. I did a design almost exactly like this with a customer a few months ago. In the end we decided to go for VXLAN between data centers.
    2. Thanks for the quick response Ivan! What is the advantage of using VXLAN vs LACP between DCs in that scenario?
    3. You don't pretend two boxes are a single device (and thus are less likely to hit "interesting" bugs).
  8. As part of a research project we studied the use of Ethernet in large HPC systems. In addition to the typical VM mobility, we found a couple of requirements for large L2 domains in such environments: transport protocols that don't run on top of IP (such as RoCEv1 or Open-MX) and service announcement mechanisms using L2 broadcast (again, such as Open-MX). I guess you would label them as "wrongly designed stacks", but the thing is that they are used.

    In our system we considered low-diameter topologies to reduce power consumption, requiring per-packet non-minimal adaptive routing. This is not supported by any of the four "commodity" alternatives that you detail, so we proposed an extension to OpenFlow switches to be able to react in micro-second time (as you argue in other post, apart from being dead, OpenFlow does not react quickly). If you have any interest, we have a more detailed discussion in the paper.
    Replies
    1. Hi Enrique,

      Thanks for the comment. I'm positive there are tons of niche environments with special requirements (yours is obviously one of them); I'm just continuously upset that we get "need large L2 domains" by default because someone didn't even consider the impact of what they want to have.

      What you're proposing definitely makes sense in your particular environment, but I don't expect it to be implemented anytime soon (not sure whether you could do it yourself using Switch Light OS as the baseline). It might be better to work with a more traditional Linux-based switch and control the forwarding rules (MAC / TCAM tables) from your own userspace agent.
    2. That's a nice idea, depending on its response time it could be feasible. We will explore it, thanks!
Add comment
Sidebar