Is OSPF or IS-IS Good Enough for My Data Center?

Our good friend mr. Anonymous has too many buzzwords and opinions in his repertoire, at least based on this comment he left on my Using 4-byte AS Numbers with EVPN blog post:

But IGPs don't scale well (as you might have heard) except for RIFT and Openfabric. The others are trying to do ECMP based on BGP.

Should you be worried about OSPF or IS-IS scalability when building your data center fabric? Short answer: most probably not. Before diving into a lengthy explanation let's give our dear friend some homework.

TL&DR summary: OSPF or IS-IS is most probably good enough for your data center… and if it isn’t, I sincerely hope you have an architecture/design team in place and don’t design your data center fabrics based on free information floating around the ’net.

What Are the Real Limits of IGPs?

Now that our Anonymous friend is (hopefully) busy, let’s try to put the IGPs don’t scale well claim in perspective:

  • There are service providers having several thousand routers in a single IS-IS area. IS-IS traditionally scaled a bit better than OSPF because it was exposed to more abuse, but it shouldn’t be hard to push OSPF (should you prefer it) to several hundred devices in a single area. I’ve heard of networks having 300+ routers in an OSPF area in times when CPUs were an order of magnitude slower than they are today;
  • We tried to scope the problem with Dr. Tony Przygienda during our Data Center Routing with RIFT discussion, and while he pointed out that the real challenge OSPF and IS-IS are facing in leaf-and-spine fabric is not topology database size but the amount of redundant flooding, he put a comfortable limit of what OSPF or IS-IS could handle today at ~100 switches.

RIFT and OpenFabric were designed to perform better in larger environments where you might hit the scaling limitations of traditional OSPF and IS-IS flooding, but we don’t know whether that’s true yet – as of mid-May 2018, you could get RIFT as experimental code running on Junos, and OpenFabric was still in very early stages the last time I chatted with Russ White

What Can We Build with 100 Switches?

Let’s assume for a moment that we’d like to stick with an IGP and are therefore limited to ~100 switches in a single data center fabric. Is that good enough?

Assuming you’re building your leaf-and-spine fabric with most common switch models, you’d have:

  • 48-port switches (10/25GE) with four uplinks (40/100GE) at the leaf layer;
  • 32-port higher-speed (40GE or 100GE) switches at the spine layer.

The largest fabric you can build with these devices without going into breakout cables or superspines is a 32-leaf fabric with a total of 1536 ports.

If you use larger spine switches with 64 ports like Arista’s 7260CX3-64 or Cisco’s 9364C, you could get to 3072 ports with 68 switches (64 leaves, 4 spines).

Quick Detour into Even Larger Fabrics

Finally, if you need an even larger fabric, you could use modular switches at the spine layer, or build a superspine layer (we covered both options in Physical Fabric Design section of Leaf-and-Spine Fabric Architectures webinar).

A superspine architecture with 176 switches (using 32-port switches at the spine layer) gives you 6144 ports, so it might be cheaper to go with breakout cables in a leaf-and-spine fabric (144 switches). Both of them are at the high end of what you might consider comfortable, but still somewhat within reasonable bounds for a single-area IGP.

The detailed designs are left as an exercise for the reader. You’ll find all the information you need to make them work in Leaf-and-Spine Fabric Architectures webinar.

Back to Reality

The largest data center fabric we could build without investing anything into understanding how things really work has around 70 switches and around 3000 edge-facing ports, and there’s no reason to feel limited by IGP scalability at this size.

Assuming you want redundant server connectivity that’s 1500 bare-metal servers. Assuming you didn’t buy them in a junkyard sale, you could easily put 30-50 reasonably-sized VMs on each one of them, for a total of around 50.000 (application) servers.

Is that good enough? It definitely is for most enterprises as well as for smaller cloud providers… and if your data center network is larger than that, please don’t listen to whatever is being said (overly generalized) on the Internet – you need a proper design done by someone who understands why he’s doing what he’s doing. Buzzwords and opinions won’t cut it.

Why Is Everyone So Focused on BGP Then?

Dinesh Dutt outlined a few technical reasons why you might consider replacing OSPF or IS-IS with BGP in his part of the Layer-3 Fabrics section of leaf-and-spine fabrics webinar (you need at least free ipSpace.net subscription to watch those videos).

Here are a few more cynical ones:

  • We’re telling you BGP is good for you because Petr (and RFC 7938) said so. I’ve seen vendor SEs doing exactly that;
  • I always wanted to play with BGP and now I have an excuse to do so;
  • I want my network to be as cool as Microsoft’s (that’s where Petr started using BGP as better IGP);
  • I need to pad my resume.

While you might decide to replace OSPF or IS-IS with BGP for any one of these reasons, IGP scalability limitations are most probably not on the very top of the list of potential challenges you might be facing in your data center fabric design.

Master Data Center Fabric Designs

Latest blog posts in BGP in Data Center Fabrics series

14 comments:

  1. hi,

    just yesterday randy bush at RIPE76:
    https://ripe76.ripe.net/wp-content/uploads/presentations/30-180514.ripe-clos.pdf

    thank you
    --
    antonio
    Replies
    1. I know. Daniel Dib sent me a tweet before Randy even started ;) Randy was talking about an order of magnitude (or more) larger data centers, and set the IGP scalability limit to ~500 nodes (so I'm way more conservative than he is ;).
  2. I just finished my homework so I'm not busy anymore. OSPF or IS-IS in the fabric is something that old grumpy greyed half-baldies use because it worked for centuries (and to change something means a lot of work). If you really want to be on the cutting edge you program your switch asics (FPGA) yourself with a programming language (probably P4). Also you would develop your own directory service for reachability information distribution (and you also have the invent your own UDP encapsulation but that's the easiest part). So with your own solution you would have a huge competitive advantage.
    Replies
    1. Yep, I was just like you when I was your age. Then reality intervened ;)
  3. Some additional thoughts:
    * ISIS/OSPF scales actually to something more like 3K in very good implementations (on a sparse mesh) but other problems than scalability become relevant most of the time before this number is hit
    * Limiting scalability IGP factor IME is not really "switches", limiting factor is how much and how many links you have to flood out & process flooding on so the #switches is an easily understood but not so meaningful number
    * some mobility/container architectures I see talking about exceed your 30-50 numbers ;-) and the lifetimes talked about seem to put the assumption of "not much ever changes" in question a bit
    * Generally, I end up in many more discussions about ZTP than "scalability" when discussing the IP fabric problems with the relevant parties. ZTP is boring of course but seems operationally much more of a pain-point though talking scalability is much cooler of course ;-)

    And ultimately, one flavor does not fit all tastes, especially in networking ...


    Replies
    1. Agree. I just wanted to point out that if you're dealing with nails, sometimes a small hammer is good enough (no need to invest in a fancy multi-tool ;).
    2. An interesting discussion is also whether the IP control plane will extend "down" to the servers which seriously stretches any current scalability assumptions. There are lots pluses to that IMO if you can manage the scale and need IP multi-homing and/or simpler hypervisor integration. Because it's currently not being done often/considered feasible/being solved over MC-LAG does not mean it can't be done ;-)
    3. "An interesting discussion is also whether the IP control plane will extend "down" to the servers" << in case I'd be running BGP no matter what, as I believe servers and networking gear shouldn't be in the same trust (and flooding) domain.

      Whether I'd do it in combination with IGP (assuming the fabric is small enough for that) or go for BGP-only fabric would depend only on what's easiest to do on the gear the customer wants to use.
  4. I thought the reason to choose BGP over OSPF/ISIS is not just for scalability but for easier, more flexible/extensible policy control, easier integration with central controller -- whether you need those functionalities is a different story.
    Replies
    1. Right, and I don't understand what's wrong with EBGP everywhere (overlay/underlay) - lowering default timers and/or BFD solves convergence concerns, there certainly aren't scale issues (hello Internet), as you mention above the flexible policy control, and one protocol to know/troubleshoot is certainly better than many so why are people (Ivan) holding on to IGP's at all?
    2. Explore my blog a bit more, and you might find that I have nothing against using BGP in the data center.

      However, once vendors start promoting overlay IBGP or EBGP (between loopbacks) over underlay EBGP (between directly-connected interfaces) to implement EVPN, it's time to say "if you can't fix your stuff so I'm able to use a simple design like EBGP between directly-connected interfaces for underlay and overlay, then maybe I should move back to IBGP-over-IGP"

      Also, the point of this blog post was "don't say stupid things like IGP can't scale in an average data center, because it can" not "don't use BGP"
    3. ""if you can't fix your stuff so I'm able to use a simple design like EBGP between directly-connected interfaces for underlay and overlay, then maybe I should move back to IBGP-over-IGP"" <<<
      Can you point to any specific vendor that "promoting overlay IBGP or EBGP (between loopbacks) over underlay EBGP (between directly-connected interfaces) to implement EVPN" and can't do "simple design like EBGP between directly-connected interfaces for underlay and overlay"?
      Maybe they promote this more complex design simply because they really believe in its value? But no, they should be blamed, because they simply provide you with more options.
      On the other hand, I can point to vendors that can't implement this more complex designs yet, and therefore promote their "simple and perfect" solution (and politely do not mention it's restrictions).
  5. So here is the math for the 6144 ports mystery: We want to achieve a non-blocking spine layer, so each spine has 16 downstream and 16 upstream ports. So 32 spines get connected to 16 superspines. Every leaf has 4 uplinks and every leaf gets connected to 4 spines. So a group of 4 spines can take 16 leaves which makes 8 * 16 = 128 leaves. Now the calculation: 128 leaves x 48 ports = 6144 ports.

    And here is the math for the 144 switches with breakout cables (how nice is that) mystery: With breakout cables every leaf now has 4 times the uplinks so 16 uplinks per leaf. Each leaf gets connected to 16 spines. Each spine has 128 breakout ports (32 x 4). So 128 leaves x 16 uplinks = 16 spines x 128 downstream ports

    But remember ECMP isn't the same as load balancing. ECMP is just some sort of load distribution (based on a hashed function). As my namesake rightly said you have to program your asics in your switches with forwarding information to do proper load balancing. Otherwise your fabric doesn't perform well and your oversubscription gets even worse.
  6. one benefit to is-is that i see is single topology ipv6. there are certainly pros and cons there, but for our use case shared fate isn't a bad thing. however, i chose bgp for my datacenter because of rfc 5549. my internal links are all ipv6 only, but they transport the ipv4 from the edge servers fine with no tunnelling needed.
Add comment
Sidebar