Odd Number of Spines in Leaf-and-Spine Fabrics
In the market overview section of the introductory part of data center fabric architectures webinar I made a recommendation to use larger number of fixed-configuration spine switches instead of two chassis-based spines when building a medium-sized leaf-and-spine fabric, and explained the reasoning behind it (increased availability, reduced impact of spine failure).
One of the attendees wondered about the “right” number of spine switches – does it has to be four, or could you have three or five spines. In his words:
Assuming that one can sufficiently cover the throughput/oversubscription plus resiliency/blast-radius requirements, is it fine to use an odd number of spines or would it be better to stick to an even number?
Equal-cost multipathing (ECMP) is usually implemented with a fixed number of output buckets with multiple buckets mapped to the same next hop (I wrote about Cisco CEF implementation in 2006). Some fields that are assumed to be a good representation of flow entropy are then extracted from each incoming packet, a hash value is computed from those fields, and that value selects the output bucket (and the next hop).
Some ECMP implementations used just the destination IP addresses. A bit later source IP addresses were added to the mix to spread traffic toward the same host across multiple links. Today most implementations use the full 5-tuple (source/destination addresses/ports + protocol ID), and some vendors allow you to add additional fields like IPv6 flow label to support ideas like FlowBender.
If the software uses (approximately) the same number of buckets for each next hop you get ECMP, if it allocates more buckets to one next hop you get unequal-cost multipathing (like what we had with EIGRP variance or MPLS-TE over multiple tunnels).
Some ECMP implementations can use any number of buckets… but that complicates the hash function, so it’s common to see hardware implementations use powers of 2.
In the good old days when you could do ECMP across 8 or 16 output buckets it made sense to make the number of spines a power of two.
Today’s silicon has between 256 and 1024 ECMP buckets, so it shouldn’t matter anymore. Do all vendors implement that capability correctly? I have no idea… If you know more, please write a comment – I always love to hear about juicy real-life details like IPv6 support in some EVPN implementations.
Remember also that if an uneven distribution like that is bad for you, then you are in trouble if you have four spines and one of them fail...
Also, a good implementation of ECMP will add a local identifier, something that uniquely identifies the switch/router itself, e.g. its own MAC address or serial number, into the mix. Otherwise, an identical switch downstream would get the same hashes for every packet it receives, and only utilize one of *its* ECMP outputs. I believe at least the Broadcom Trident II and Tomahawk chips do this. Each leaf switch will then get a different distribution over its odd number of uplinks, and the total load over the spines might even out.
I find that the number of spines is more a function of physical environment. E.g., number of available leaf uplink ports, power/cable diversity, red/blue color coding, familiarity for the operations folks, etc.
Everything I'm designing right now that doesn't need crazy amounts of bandwidth revolves around triplets. Why?
- smallest # of boxes that will get you true N+1 redundancy (not 2N). I find 3x to be the smallest you can get & still call it "leaf/spine fabric"
- law of diminishing returns (50/33/25/20/etc); moving from 50% to 33% gives the most incremental gain, by a factor of >2
- there are many more ToRs out there with 6 uplink ports than with 8, & they're cheaper.
- getting budget for the 3rd box is hard enough; the 4th one is truly a luxury in many orgs
- a triangle is the smallest redundant full mesh, with the simplest rules: "connect this one to the other two". None of the links are optional if you want any redundancy whatsoever.
Anyone care to add others?
We're working on a new DC Fabric in the near future and while 2 spines would serve us well for bandwidth and resiliency, I'm contemplating forcing a minimum of 3 spines just so no vendor/engineer will waste our time calling MC-LAG between two core switches leaf-spine.