Leaf-and-Spine Fabric Myths (Part 1)
Apart from the “they have no clue what they’re talking about” observation, Evil CCIE left a long list of leaf-and-spine fabric myths he encountered in the wild in a comment on one of my blog posts. He started with:
Clos fabric (aka Leaf And Spine fabric) is a non-blocking fabric
That was obviously true in the days when Mr. Clos designed the voice switching solution that still bears his name. In the original Clos network every voice call would get a dedicated path across the fabric, and the number of voice calls supported by the fabric equaled the number of alternate end-to-end paths.
In packet switching networks we have (at best) statistically non-blocking behavior – as long as no output port is congested, and the ECMP algorithm running on ingress switch does a perfect traffic distribution. Fat chance… for more details read at least the TL&DR version of the CONGA article (HT: Boris Hassanov).
What we do have today are non-blocking switches… but even that means nothing more than the internal switching bandwidth is equal to the sum of external-facing bandwidth across all ports. As soon as an output port is congested the switch cannot be non-blocking anymore.
But wait, there are the details that silicon vendors don’t want you to know (and thus they only show you their hardware documentation after you sign NDA in blood):
- Most switching silicon has 40GE or 100GE connections that are then split out into 10/25/50 GE front-panel ports. It seems that at least some chipsets have head-of-line blocking challenges across 25GE lanes of a single 100GE port.
- Internal fabric bandwidth is just one of the parameters. Packet forwarding performance is another one… and not all silicon can do line-rate forwarding of small packets.
- Every single packet has metadata attached to it while traversing the internal (intra-switch) fabric as JR Rivers explained in the Networks, Buffers and Drops webinar (available with free ipSpace.net subscription). Some chipsets might struggle with the amount of bandwidth needed to transport both packets content and metadata across the internal fabric.
Finally, we usually build oversubscribed leaf-and-spine fabrics. The total amount of leaf-to-spine bandwidth is usually one third of the edge bandwidth. Leaf-and-spine fabrics are thus almost never non-blocking, but they do provide equidistant bandwidth.
Next steps
If you want to know more about leaf-and-spine fabrics (and be able to figure out where exactly the vendor marketers cross the line between unicorn-colored reality and plain bullshit), start with the Leaf-and-Spine Fabric Architectures webinar (part of Standard ipSpace.net subscription).
You can also take one step further and enroll in the Designing and Building Data Center Fabrics online course which includes three design assignments reviewed by a member of ipSpace.net ExpertExpress team.
Finally, when you want to be able to design more than just the data center fabrics, check out the Building Next-Generation Data Center online course.
Thank you Ivan (and Mr. Evil CCIE) for pointing this out. It has always annoyed the heck out of me that many people blindly assume that all characteristics of a *circuit-switched* Clos network (where indiviual circuits are carefully groomed) autmatically apply to a *packet-switched* network with the same topology (where individual flows are subject to the semi-random placement whims of ECMP hashing).
Another pet pieve is that people simply assume that a Clos fabric of small switches is exactly the same as a "disassembled" chassis switch (leaf switches = line cards and spine switches = fabric cards). Some high-end chassis switches provide lots of extra functionality (e.g. virtual output queues to avoid head-of-line blocking, breaking packets up in cells across the backplane fabric for near-perfect ECMP, etc. etc.)
Just to be clear: I am NOT saying you should go and buy an expensive chassis switch instead of building a Clos fabric out of cheap commodity switches. I am just saying that you should be aware of the difference.
And, since I am in rant mode :-), please folks, it's Clos (after Mr Clos) and not CLOS (it's not an acronym).
a) the "most people build oversubscribed IP fabrics" is a bit of a myth itself,
my (non-representative ;-) sample indicated that many build 1:1 fabrics
to avoid according oversubscription problems
b) An exploded fabric (that's what I call IP Clos fabric often when talking
about chassis replacement ;-) is _not_ an equivalent as Bruno says, the "what you pay
for it strictly applies here". "Proper" chassis fabrics
i) are cell-based (for reasons obvious when you ran packet based backplanes)
ii) do multicast very efficiently
iii) as stated offer tons funky HOL blocking logic, QoS queuing & so on
iv) are completely self-configuring and never try to do zig-zags (by now it should
remind of a certain routing protocol we pursue ;-)
c) I still don't know whether Clos was Belgian or French ;-) ?