Leaf-and-Spine Fabrics Between Theory and Reality
I’m always envious of how easy networking challenges seem when you’re solving them in PowerPoint, for example, when an innovation specialist explains how scalability works in leaf-and-spine fabrics in a LinkedIn comment:
One of the main benefits of a CLOS folded spine topology is the scale out spine where you can scale out the number of spine nodes increasing your leaf-spine n-way ECMP as well as minimizing the blast radius with the more spine nodes the more redundancy and resiliency.
Isn’t that wonderful? If you need more bandwidth, sprinkle the magic spine powder on your fabric, add water, and voila! Problem solved. Also, it looks like adding spine switches reduces the blast radius. Who would have known?
In reality:
- It doesn’t matter whether you have two or sixteen spines – the blast radius is the same. It is true that you’re pretty low on redundancy if you have just two spines and one of them exploded, so make that three.
- Adding a spine switch often results in the rewiring of the physical fabric; the only exception would be going from three to four spines when you’re using leaf switches with four uplinks (and similarly for switches with eight uplinks). Next step: an exciting configuration exercise unless you’ve decided to use unnumbered leaf-to-spine links when deploying the fabric.
- The number of uplink ports on the leaf switches limits the maximum number of spine switches. Most leaf switches used to have four uplinks. These days, a lot of switches come with six or eight uplinks, making it easier to build fabrics with more spines and thus lower oversubscription ratio. The maximum fabric size is still limited by the number of ports on the spine switches though.
- Obviously you could also buy switches with high-speed ports (example: 100GE) and use some of those as four lower-speed ports (example: 25GE) with breakout cables. That makes your design totally flexible regarding the number of uplinks and the oversubscription ratio, but the breakout cables could get messy, although not as much as the next option1.
- You could build much larger fabrics if you split leaf switch uplinks into individual lanes (100GE ports into four 25GE lanes), but you don’t want to know how messy the cabling gets with the octopus cables or complex behind-the-scenes wiring between patch panels.
Another dose of reality: most of the above doesn’t matter. It’s easy to get a spine switch with 32 100GE or 400GE ports; some vendors are shipping spine switches with 64 ports. Sixty-four leaf switches connected to those ports give you over 3000 server-facing ports – probably good enough for 95% of the data centers out there.
Considering all that, what should we do with generic opinions like the one above? Charity Majors answered this thorny question in a recent tweet2:
I can opine all I want on your architecture or ours, but if I’m not carrying a pager for you, you should probably just smile politely and move along. People with skin in the game are the people you should listen to.
And also:
The antipattern I see in so many places with devs and architects is the same fucking problem they have with devs and ops. “No time to be on call, too busy writing important software” ~turns into~ “No time to write code, too busy telling other people how to write code.”
FWIW, you should read the whole thread (assuming Twitter still works when you’re reading this) and the resulting blog post, and continue with Martin Fowler ’s take on Who Needs an Architect.
What Happened to Switches with Four Uplinks?
The original version of this blog post (see revision history below) talked about leaf switches with four uplinks. I quickly got corrected – many modern leaf switches have four uplinks. What happened? We’ll explore that in the next blog post in this series.
Revision History
- 2023-03-14
- Sander Steffann pointed out that there are more switches with six or even eight uplinks than I expected. Also added the ’local breakout cables’ option.
- 2023-03-15
- Another dose of reality: Erik Auerswald pointed out that many switches using Trident3 or Trident4 ASICs have eight uplinks. More details in a follow-up blog post.
- 2023-03-16
- The number of uplinks a switch has doesn’t matter (apart from the oversubscription ratio). The maximum fabric size is still limited by the number of ports on the spine switches.
I think it's unfair to say "Most leaf switches have four uplinks (some Cisco switches have six)" when Juniper has i.e. the QFX5120 and QFX5200 switches and Arista has the 7280SR3, who all have 6 or 8 uplink ports.
Cisco is not unique enough to deserve special mention here ;)
Thank you! Updated.
Although I disagree with mentioning QFX5120 in the same sentence -- it's just that they cannot use breakout cables on more than 24 100GE links ;) I also can't figure out how you get from 64 ports to 8 uplinks + 24 ports with breakout cables, but that's just me ;)
Anyway, I added an extra bullet describing the breakout cables option. Thanks a million for pointing me in that direction!
Ivan
All the vendors using Broadcom ASICs have "leaf" switches with 6 or 8 100G uplink ports and 48 10 or 25G downlink ports.
Have to go through the datasheets (one of the most boring things I ever did in my life) of my three "favorite" switch vendors to get a complete picture, but a quick sampling indicates you're right: many switches using Trident3 or Trident4 have eight uplinks. Looks like the ASICs are too fast and the vendors don't use all the lanes.
Example: Trident3 has 128 25GE lanes, Arista 7050SX3-48YC8 has 48 x 25GE and 8 x 100GE ports for a total of 80 25GE lanes (leaving 48 lanes unused) and 1,5:1 oversubscription factor. We truly live in crazy times ;)
Nokia 7220-IXR-D2(L) is Trident3 based and offers 48x25GE plus 8x100GE ports (similar to the Arista box you mention).
Nokia 7220-IXR-D3(L) is also Trident3 based and offers 32x100GE ports - fully utilizing all 128 lanes. The over-subscription factor depends on the fabric design - you could do 1:1 by using 16 ports for servers, and the rest as uplinks (for example), up to 30:2 (15:1) by using only 2 uplinks
So it's not really the case that "the ASICs are too fast", the same ASIC is used in different designs offering different design trade-offs (and price points)
I mean... Unless this post is targeted to managers or people who has 0 network knowledge, am I wrong to say that you just "discovered the hot water" as we say in Italy? (I.e. You are simply stating the obvious)
Of course you're absolutely right -- if you built (or even better: upgraded) at least one leaf-and-spine fabric, then nothing I wrote in this blog post sounds new or exciting... but that's true for everything I do ;)
Not everyone is at that level, or there wouldn't be Rule 4 in RFC 1925, and common sense seems to be unevenly distributed ;)
All the best, Ivan
A non-snarky version of this is that you should design your network in its ultimate fully-upgraded form (e.g. 32 racks with 8 spines) then build only the parts you currently need. This also includes leaving rack space and power for all the spines etc. You can sleep easy with the knowledge that your expansion is already designed and will only require adding equipment with no recabling.
Also, if you have so little networking experience that you're looking for information on LinkedIn you should probably just let Apstra design the network for you.
While I like the "prepare rack space and power for the ultimate fully-upgraded form", whenever you have to expand from 2N spines to 2(N+1) spines (or something in between) there will be heavy rewiring (but not recabling if you were smart) unless you're OK to start with higher oversubscription.
And yeah, I agree with your take on LinkedIn as the source of networking knowledge, but I find some comments too good to ignore ;)
IMO the only reason to add spines is oversubscription so you should not need to rewire anything. You should start with enough spines to be "resilient enough" (the exact number depends on your risk tolerance) and thus adding more spines only adds bandwidth.