Odd Number of Spines in Leaf-and-Spine Fabrics

In the market overview section of the introductory part of data center fabric architectures webinar I made a recommendation to use larger number of fixed-configuration spine switches instead of two chassis-based spines when building a medium-sized leaf-and-spine fabric, and explained the reasoning behind it (increased availability, reduced impact of spine failure).

One of the attendees wondered about the “right” number of spine switches – does it has to be four, or could you have three or five spines. In his words:

Assuming that one can sufficiently cover the throughput/oversubscription plus resiliency/blast-radius requirements, is it fine to use an odd number of spines or would it be better to stick to an even number?

Equal-cost multipathing (ECMP) is usually implemented with a fixed number of output buckets with multiple buckets mapped to the same next hop (I wrote about Cisco CEF implementation in 2006). Some fields that are assumed to be a good representation of flow entropy are then extracted from each incoming packet, a hash value is computed from those fields, and that value selects the output bucket (and the next hop).

Some ECMP implementations used just the destination IP addresses. A bit later source IP addresses were added to the mix to spread traffic toward the same host across multiple links. Today most implementations use the full 5-tuple (source/destination addresses/ports + protocol ID), and some vendors allow you to add additional fields like IPv6 flow label to support ideas like FlowBender.

If the software uses (approximately) the same number of buckets for each next hop you get ECMP, if it allocates more buckets to one next hop you get unequal-cost multipathing (like what we had with EIGRP variance or MPLS-TE over multiple tunnels).

Some ECMP implementations can use any number of buckets… but that complicates the hash function, so it’s common to see hardware implementations use powers of 2.

In the good old days when you could do ECMP across 8 or 16 output buckets it made sense to make the number of spines a power of two.

Today’s silicon has between 256 and 1024 ECMP buckets, so it shouldn’t matter anymore. Do all vendors implement that capability correctly? I have no idea… If you know more, please write a comment – I always love to hear about juicy real-life details like IPv6 support in some EVPN implementations.

7 comments:

  1. Ivan, I retired a couple of years ago so no longer get NDA briefings on ASICs. The arms race between the ASIC providers on hash algorithms (ie ECMP) is at a much higher level of complexity than indicated above, the hyperscalers have probably "helped" with even more requirements on ECMP hashing since I retired; and all of this is under NDAs so deep no one is likely to answer the question central to this blog.
  2. There's of course also the question whether it *matters* that the distribution is uneven. If you happen to get a 40%-40%-20% load towards your spines from a certain leaf instead of the ideal 33%-34%-33%, you will still be better off than if you have two spines, getting 50%-50%.

    Remember also that if an uneven distribution like that is bad for you, then you are in trouble if you have four spines and one of them fail...

    Also, a good implementation of ECMP will add a local identifier, something that uniquely identifies the switch/router itself, e.g. its own MAC address or serial number, into the mix. Otherwise, an identical switch downstream would get the same hashes for every packet it receives, and only utilize one of *its* ECMP outputs. I believe at least the Broadcom Trident II and Tomahawk chips do this. Each leaf switch will then get a different distribution over its odd number of uplinks, and the total load over the spines might even out.
  3. Today's ECMP hash algorithms and bucket-to-link mappings are non-uniform, even for even/binary number of links. E.g., given two spines and a perfectly hashable mix of packets, you might find 52% and 48% distribution. This is all great and no one cares about hash polarization anymore. On the flip-side, fragmented packets are handled worse today than ever before -- something that hurts poorly-deployed overlays and crypto.

    I find that the number of spines is more a function of physical environment. E.g., number of available leaf uplink ports, power/cable diversity, red/blue color coding, familiarity for the operations folks, etc.
  4. Thanks for posting this, Ivan.

    Everything I'm designing right now that doesn't need crazy amounts of bandwidth revolves around triplets. Why?

    - smallest # of boxes that will get you true N+1 redundancy (not 2N). I find 3x to be the smallest you can get & still call it "leaf/spine fabric"
    - law of diminishing returns (50/33/25/20/etc); moving from 50% to 33% gives the most incremental gain, by a factor of >2
    - there are many more ToRs out there with 6 uplink ports than with 8, & they're cheaper.
    - getting budget for the 3rd box is hard enough; the 4th one is truly a luxury in many orgs
    - a triangle is the smallest redundant full mesh, with the simplest rules: "connect this one to the other two". None of the links are optional if you want any redundancy whatsoever.

    Anyone care to add others?
  5. Somewhat tangential...

    We're working on a new DC Fabric in the near future and while 2 spines would serve us well for bandwidth and resiliency, I'm contemplating forcing a minimum of 3 spines just so no vendor/engineer will waste our time calling MC-LAG between two core switches leaf-spine.
  6. For IPv6 flow label, some vendors seem to have a kind of annoying ~bug: https://www.youtube.com/watch?v=b0CRjOpnT7w
    Replies
    1. It would be great to know which Broadcom hardware supports hashing based on flow label, whether it's enabled by default or not, and if not which vendor enabled that by default (I know it Arista EOS started using flow label in ECMP hash a while ago, but I always understood it had to be configured).
Add comment
Sidebar