Implications of Valley-Free Routing in Data Center Fabrics

As I explained in a previous blog post, most leaf-and-spine best-practices (as in: what to do if you have no clue) use BGP as the IGP routing protocol (regardless of whether it’s needed) with the same AS number shared across all spine switches to implement valley-free routing.

This design has an interesting consequence: when a link between a leaf and a spine switch fails, they can no longer communicate.

For example, when the link between L1 and C1 in the following diagram fails, there’s no connectivity between L1 and C1 as there’s no valley-free path between them.

Link failure results in connectivity loss due to lack of valley-free path

Link failure results in connectivity loss due to lack of valley-free path

No big deal, right? After all, we built the data center fabric to exchange traffic between external devices attached to leaf nodes. Well, I hope you haven’t connected a firewall or load balancer straight to the spine switches using MLAG or a similar trick.

Lesson learned: connect all external devices (including network services devices) to leaf switches. Spine switches should provide nothing more than intra-fabric connectivity.

ipSpace.net subscribers can find more details in leaf-and-spine fabrics webinar.

There’s another interesting consequence. As you might know, some vendors love designs that use IBGP or EBGP EVPN sessions between loopback interfaces of leaf and spine switches on top of EBGP underlay.

Guess what happens after the L1-C1 link failure: the EVPN session between loopback interfaces (regardless of whether it’s an IBGP or EBGP session) is lost no matter what because L1 cannot reach C1 (and vice versa) anyway.

EVPN session failure following a link loss

EVPN session failure following a link loss

Inevitable conclusion: all the grand posturing explaining how EVPN sessions between loopback interfaces running on top of underlay EBGP are so much better than EVPN running as an additional address family on the directly-connected EBGP session in a typical data center leaf-and-spine fabric is plain merde (pardon my French).

Even worse, I’ve seen a vendor-produced design that used:

  • EBGP in a small fabric that would work well enough with OSPF for the foreseeable future;
  • IBGP EVPN sessions between loopback interfaces of leaf and spine switches;
  • Different AS numbers on spine switches to make it all work, turning underlay EBGP into a TCP version of RIPv2 using AS path length as the hop count.

As I said years ago: the road to broken design is paved with great recipes.

Lesson learned: whenever evaluating a design, consider all possible failure scenarios.

Don’t get me wrong. There might be valid reasons to use IBGP EVPN sessions on top of EBGP underlay. There are valid reasons to use IBGP route reflectors implemented as VNF appliances for scalability… but the designs promoted by most networking vendors these days make little sense once you figure out how routing really works.

For the Few People Interested in the Red Pill

If you want to know more about leaf-and-spine fabrics (and be able to figure out where exactly the vendor marketers cross the line between unicorn-colored reality and plain bullshit), start with the Leaf-and-Spine Fabric Architectures and EVPN Technical Deep Dive webinars (both are part of Standard ipSpace.net subscription).

Finally, when you want to be able to design more than just the data center fabrics, check out the Building Next-Generation Data Center online course.

Blog posts in Valley-Free Routing series

14 comments:

  1. If you lose reachability to C1 from L1 just because a single link fails, then let's hope L1 isn't the switch where your management workstation/Nagios host/et.c is connected. Because then you won't be able to ssh to C1, or talk SNMP to C1, to see what's wrong with it...
    Replies
    1. "If you lose reachability to C1 from L1 just because a single link fails, then let's hope L1 isn't the switch where your management workstation/Nagios host/et.c is connected." << Correct. In most data center fabrics you'd have an out-of-band management network (oftentimes configured as isolated management VRF by default).
  2. I don't understand your concerns about EVPN underlay. Normally you would have ECMP for the loopbacks. So in the case of a link failure the reconvergence time depends on:
    1) How fast the device detects the failure
    2) How long does it take to recompute a backup path
    3) How long does it take to propagate the changes to the neighbors
    4) Time to install the backup path in hardware

    So with directly connected links and backup paths already installed in hardware, I don't see a big problem or am I overlooking something?
    Replies
    1. This time it's not about convergence, it's about reachability. If L1 cannot reach C1 after a link failure, having BGP session between directly-connected interfaces or between loopbacks makes absolutely no difference (apart from one being more complex than the other).
    2. Why should L1 reach C1? C1 is only in the data plane not in the control plane for EBGP underlay, isn't it?
    3. Underlay = physical transport network. Every device is in the control plane for underlay routing protocol (which in this particular case is assumed to be EBGP because RFC 7938).
    4. So your ideal solution has:

      - An EBGP session interface-to-interface
      - Use this session to advertise IP connectivity (i.e. act as the IGP)
      - Use this session to advertise EVPN routes

      In this case, what do you expect the BGP next hop for the EVPN routes to be? I think it's got to be the loopback address so that the other leaf nodes can reach it via ECMP across C1 and C2. Is the reason that some vendors want to run a separate EBGP session loopback-to-loopback for EVPN because they can't advertise the EVPN routes with a different next hop to that of the IP routes?
    5. I knew there was another gotcha ;))

      Yes, the EVPN next hop must always remain the VTEP (loopback) IP address of the egress leaf. You just gave me another reason why some vendors insist so adamantly on using BGP sessions between loopbacks...
  3. I pains me to see all the discussion about the details of using BGP in the underlay (fabric). Why so complex ? 20-25 Years ago we already established that the best routing design for networks is: an IGP in the underlay, to discover the topology, and BGP in the overlay to carry reachability for large numbers of destinations.

    I would think this is still the simplest and most elegant solution. Use IS-IS in the fabric, and EVPN in the overlay. Why isn't this the most popular design still ? Is IS-IS not good enough ? Are IS-IS implementations not good enough these days ?

    If that is the case, improve IS-IS. Don't misuse BGP. Using BGP as an IGP is like using assembly to build a website. It works, but it is ugly. Fix the shortcomings in the IS-IS protocol. Fix IS-IS implementations. Is the industry really unable to do this ? Rift, LSVR, openFabric, they all seem overkill to me. Just fix what's broken, don't re-invent the wheel.
    Replies
    1. You're absolutely right. See also

      https://blog.ipspace.net/2018/05/is-ospf-or-is-is-good-enough-for-my.html
      https://blog.ipspace.net/2018/08/is-bgp-good-enough-with-dinesh-dutt-on.html


      Unfortunately this industry is brimming with lemmings... and what was good enough for Microsoft data center must be good enough for my two switches, right? RIGHT? RIGHT???
  4. Couple of comments:

    1) One can put the Spine switches in different ASNs - Is a bit of extra I/O but does the trick
    2) IS-IS in the underlay would be a better solution as it can be deployed with existing implementation -
    a bit of mesh-group config on the leafs and you can control the flooding explosion of vanilla IS-IS implementations - finally UUNets top scaling feature is been used again after almost 20 years of hibernation ;-)
  5. “amen”
    everything could be made work with infinite efforts of turning all possible knobs one could imagine OR let’s make eBGP behave like iBGP?!?! Why going so far if the solution is that near?
  6. At the risk of heresy, given that this is a well known leaf spine configuration within a data center, wouldn't it be far simpler to avoid the protocols entirely and simply calculate the forwarding tables of the underlying ASICs directly, a la an InfiniBand solver?

    Yes, I understand the link and switch failure/repair cases need to be dealt with, and no I'm not talking about the theoretical SDN case so popular a decade ago.

    This is a 10,000 line of code problem at the switch level, not a 10,000,000 line of code problem. Just saying.
    Replies
    1. When you spent three decades developing a hammer, every sandwich (and tin can) looks like a nail :D

      However, you have to solve fundamental challenges like failure detection, path checks, fast rerouting/convergence, minimization of FIB rewrites... so in the end it's not exactly a 10K line-of-code problem either as some (religiously correct) SDN vendors discovered a while ago.
Add comment
Sidebar