Is OSPF or IS-IS Good Enough for My Data Center?
Our good friend mr. Anonymous has too many buzzwords and opinions in his repertoire, at least based on this comment he left on my Using 4-byte AS Numbers with EVPN blog post:
But IGPs don't scale well (as you might have heard) except for RIFT and Openfabric. The others are trying to do ECMP based on BGP.
Should you be worried about OSPF or IS-IS scalability when building your data center fabric? Short answer: most probably not. Before diving into a lengthy explanation let's give our dear friend some homework.
TL&DR summary: OSPF or IS-IS is most probably good enough for your data center… and if it isn’t, I sincerely hope you have an architecture/design team in place and don’t design your data center fabrics based on free information floating around the ’net.
What Are the Real Limits of IGPs?
Now that our Anonymous friend is (hopefully) busy, let’s try to put the IGPs don’t scale well claim in perspective:
- There are service providers having several thousand routers in a single IS-IS area. IS-IS traditionally scaled a bit better than OSPF because it was exposed to more abuse, but it shouldn’t be hard to push OSPF (should you prefer it) to several hundred devices in a single area. I’ve heard of networks having 300+ routers in an OSPF area in times when CPUs were an order of magnitude slower than they are today;
- We tried to scope the problem with Dr. Tony Przygienda during our Data Center Routing with RIFT discussion, and while he pointed out that the real challenge OSPF and IS-IS are facing in leaf-and-spine fabric is not topology database size but the amount of redundant flooding, he put a comfortable limit of what OSPF or IS-IS could handle today at ~100 switches.
RIFT and OpenFabric were designed to perform better in larger environments where you might hit the scaling limitations of traditional OSPF and IS-IS flooding, but we don’t know whether that’s true yet – as of mid-May 2018, you could get RIFT as experimental code running on Junos, and OpenFabric was still in very early stages the last time I chatted with Russ White
What Can We Build with 100 Switches?
Let’s assume for a moment that we’d like to stick with an IGP and are therefore limited to ~100 switches in a single data center fabric. Is that good enough?
Assuming you’re building your leaf-and-spine fabric with most common switch models, you’d have:
- 48-port switches (10/25GE) with four uplinks (40/100GE) at the leaf layer;
- 32-port higher-speed (40GE or 100GE) switches at the spine layer.
The largest fabric you can build with these devices without going into breakout cables or superspines is a 32-leaf fabric with a total of 1536 ports.
If you use larger spine switches with 64 ports like Arista’s 7260CX3-64 or Cisco’s 9364C, you could get to 3072 ports with 68 switches (64 leaves, 4 spines).
Quick Detour into Even Larger Fabrics
Finally, if you need an even larger fabric, you could use modular switches at the spine layer, or build a superspine layer (we covered both options in Physical Fabric Design section of Leaf-and-Spine Fabric Architectures webinar).
A superspine architecture with 176 switches (using 32-port switches at the spine layer) gives you 6144 ports, so it might be cheaper to go with breakout cables in a leaf-and-spine fabric (144 switches). Both of them are at the high end of what you might consider comfortable, but still somewhat within reasonable bounds for a single-area IGP.
The detailed designs are left as an exercise for the reader. You’ll find all the information you need to make them work in Leaf-and-Spine Fabric Architectures webinar.
Back to Reality
The largest data center fabric we could build without investing anything into understanding how things really work has around 70 switches and around 3000 edge-facing ports, and there’s no reason to feel limited by IGP scalability at this size.
Assuming you want redundant server connectivity that’s 1500 bare-metal servers. Assuming you didn’t buy them in a junkyard sale, you could easily put 30-50 reasonably-sized VMs on each one of them, for a total of around 50.000 (application) servers.
Is that good enough? It definitely is for most enterprises as well as for smaller cloud providers… and if your data center network is larger than that, please don’t listen to whatever is being said (overly generalized) on the Internet – you need a proper design done by someone who understands why he’s doing what he’s doing. Buzzwords and opinions won’t cut it.
Why Is Everyone So Focused on BGP Then?
Dinesh Dutt outlined a few technical reasons why you might consider replacing OSPF or IS-IS with BGP in his part of the Layer-3 Fabrics section of leaf-and-spine fabrics webinar (you need at least free ipSpace.net subscription to watch those videos).
Here are a few more cynical ones:
- We’re telling you BGP is good for you because Petr (and RFC 7938) said so. I’ve seen vendor SEs doing exactly that;
- I always wanted to play with BGP and now I have an excuse to do so;
- I want my network to be as cool as Microsoft’s (that’s where Petr started using BGP as better IGP);
- I need to pad my resume.
While you might decide to replace OSPF or IS-IS with BGP for any one of these reasons, IGP scalability limitations are most probably not on the very top of the list of potential challenges you might be facing in your data center fabric design.
Master Data Center Fabric Designs
- Want to learn the basics of data center fabrics and figure out what individual vendors are doing? Check out the Data Center Fabrics webinar.
- Want to learn how to design leaf-and-spine fabrics? Go for Leaf-and-Spine Fabric Architectures webinar.
- Looking for a guided and mentored tour with plenty of peer- and instructor support? You probably need Building Next-Generation Data Center online course.
- Want to know more about EVPN technology? Watch the EVPN Technical Deep Dive webinar.
just yesterday randy bush at RIPE76:
https://ripe76.ripe.net/wp-content/uploads/presentations/30-180514.ripe-clos.pdf
thank you
--
antonio
* ISIS/OSPF scales actually to something more like 3K in very good implementations (on a sparse mesh) but other problems than scalability become relevant most of the time before this number is hit
* Limiting scalability IGP factor IME is not really "switches", limiting factor is how much and how many links you have to flood out & process flooding on so the #switches is an easily understood but not so meaningful number
* some mobility/container architectures I see talking about exceed your 30-50 numbers ;-) and the lifetimes talked about seem to put the assumption of "not much ever changes" in question a bit
* Generally, I end up in many more discussions about ZTP than "scalability" when discussing the IP fabric problems with the relevant parties. ZTP is boring of course but seems operationally much more of a pain-point though talking scalability is much cooler of course ;-)
And ultimately, one flavor does not fit all tastes, especially in networking ...
Whether I'd do it in combination with IGP (assuming the fabric is small enough for that) or go for BGP-only fabric would depend only on what's easiest to do on the gear the customer wants to use.
However, once vendors start promoting overlay IBGP or EBGP (between loopbacks) over underlay EBGP (between directly-connected interfaces) to implement EVPN, it's time to say "if you can't fix your stuff so I'm able to use a simple design like EBGP between directly-connected interfaces for underlay and overlay, then maybe I should move back to IBGP-over-IGP"
Also, the point of this blog post was "don't say stupid things like IGP can't scale in an average data center, because it can" not "don't use BGP"
Can you point to any specific vendor that "promoting overlay IBGP or EBGP (between loopbacks) over underlay EBGP (between directly-connected interfaces) to implement EVPN" and can't do "simple design like EBGP between directly-connected interfaces for underlay and overlay"?
Maybe they promote this more complex design simply because they really believe in its value? But no, they should be blamed, because they simply provide you with more options.
On the other hand, I can point to vendors that can't implement this more complex designs yet, and therefore promote their "simple and perfect" solution (and politely do not mention it's restrictions).
And here is the math for the 144 switches with breakout cables (how nice is that) mystery: With breakout cables every leaf now has 4 times the uplinks so 16 uplinks per leaf. Each leaf gets connected to 16 spines. Each spine has 128 breakout ports (32 x 4). So 128 leaves x 16 uplinks = 16 spines x 128 downstream ports
But remember ECMP isn't the same as load balancing. ECMP is just some sort of load distribution (based on a hashed function). As my namesake rightly said you have to program your asics in your switches with forwarding information to do proper load balancing. Otherwise your fabric doesn't perform well and your oversubscription gets even worse.