Sharada Yeluri (Senior Director of Engineering at Juniper Networks) wrote a long article describing the connectivity requirements of AI workloads and new approaches to Ethernet fabrics. Definitely worth reading if you’re interested in these topics.
Sebastian described an interesting Cisco ACI quirk they had the privilege of chasing around:
We’ve encountered VM connectivity issues after VM movements from one vPC leaf pair to a different vPC leaf pair with ACI. The issue did not occur immediately (due to ACI’s bounce entries) and only sometimes, which made it very difficult to reproduce synthetically, but due to DRS and a large number of VMs it occurred frequently enough, that it was a serious problem for us.
Here’s what they figured out:
The last time I spent days poring over vendor datasheets collecting information for the overview part of Data Center Fabrics webinar a lot of 1RU data center leaf switches came in two form factors:
- 48 low-speed server-facing ports and 4 high-speed uplinks
- 32 high-speed ports that you could break out into four times as many low-speed ports (but not all of them)
I expected the ratios to stay the same when the industry moved from 10/40 GE to 25/100 GE switches. I was wrong – most 1RU leaf data center switches based on recent Broadcom silicon (Trident-3 or Trident-4) have between eight and twelve uplinks.
A networking engineer attending the Building Next-Generation Data Center online course asked this question:
What is the best practice to connect DC fabric to outside world assuming there are 2 spine switches in the fabric and EVPN VXLAN is used as overlay? Is it a good idea to introduce edge (border) switches, or it is better to connect outside world directly to the spine?
As always, the answer is “it depends,” this time based on:
I’m always envious of how easy networking challenges seem when you’re solving them in PowerPoint, for example, when an innovation specialist explains how scalability works in leaf-and-spine fabrics in a LinkedIn comment:
One of the main benefits of a CLOS folded spine topology is the scale out spine where you can scale out the number of spine nodes increasing your leaf-spine n-way ECMP as well as minimizing the blast radius with the more spine nodes the more redundancy and resiliency.
Isn’t that wonderful? If you need more bandwidth, sprinkle the magic spine powder on your fabric, add water, and voila! Problem solved. Also, it looks like adding spine switches reduces the blast radius. Who would have known?
One of my readers sent me a question along these lines:
How do we determine the number of spines needed in a leaf-and-spine fabric? It’s easy to calculate the number of leaf nodes from the required number of server ports, and two spines give you the redundancy. Does it make sense to have more spines if two are good enough from the capacity perspective?
There are at least two factors to consider:
The simplest way to implement layer-3 forwarding in a network fabric is to offload it to an external device1, be it a WAN edge router, a firewall, a load balancer, or any other network appliance.
Imagine you built a layer-2 fabric with tons of VLANs stretched all over the place. Now the users want to exchange traffic between those VLANs, and the obvious question is: which devices should do layer-2 forwarding (bridging) and which ones should do layer-3 forwarding (routing)?
There are four typical designs you can use to solve that challenge:
- Exchange traffic between VLANs outside of the fabric (edge routing)
- Route on core switches (centralized routing)
- Route on ingress (asymmetric IRB)
- Route on ingress and egress (symmetric IRB)
This blog post is an overview of the design models; we’ll cover each design in a separate blog post.
Are FabricPath, TRILL or SPB still alive, or has everyone moved to VXLAN? Are they worth studying?
TL&DR: Barely. Yes. No.
Layer-2 Fabric craziness exploded in 2010 with vendors playing the usual misinformation games that eventually resulted in totally fragmented market full of partial- or proprietary solutions. At one point in time, some HP data center switches supported only TRILL, and other data center switches from the same company supported only SPB.
Now for individual technologies:
One of my subscribers has to build a small data center fabric that’s just a tad too big for two switch design.
For my datacenter I would need two 48 ports 10GBASE-T switches and two 48 port 10/25G fibber switches. So I was watching the Small Fabrics and Lower-Speed Interfaces part of Physical Fabric Design to make up my mind. There you talk about the possibility to do a leaf and spine with 4 switches and connect servers to the spine.
A picture is worth a thousand words, so here’s the diagram of what I had in mind:
Namex, an Italian IXP, decided to replace their existing peering fabric with a fully automated leaf-and-spine fabric using VXLAN and EVPN running on Cumulus Linux.
They documented the design, deployment process, and automation scripts they developed in an extensive blog post that’s well worth reading. Enjoy ;)
Every now and then I stumble upon an article or a comment explaining how Network Function Virtualization (NFV) introduces new data center fabric buffering requirements. Here’s a recent example:
For Telco/carrier Cloud environments, where NFVs (which are much slower than hardware SGW) get used a lot, latency is higher with a lot of jitter due to the nature of software and the varying link speeds, so DC-level near-zero buffer is not applicable.
It seems to me we’re dealing with another myth. Starting with the basics:
Scott submitted an interesting the comment to my Does Unequal-Cost Multipath (UCMP) Make Sense blog post:
How about even Large CLOS networks with the same interface capacity, but accounting for things to fail; fabric cards, links or nodes in disaggregated units. You can either UCMP or drain large parts of your network to get the most out of ECMP.
Before I managed to write a reply (sometimes it takes months while an idea is simmering somewhere in my subconscious) Jeff Tantsura pointed me to an excellent article by Erico Vanini that describes the types of asymmetries you might encounter in a leaf-and-spine fabric: an ideal starting point for this discussion.
A long-time reader sent me a series of questions about the impact of WAN partitioning in case of an SDN-based network spanning multiple locations after watching the Architectures part of Data Center Fabrics webinar. He therefore focused on the specific case of centralized control plane (read: an equivalent of a stackable switch) with distributed controller cluster (read: switch stack spread across multiple locations).
Arista published a blog post describing the details of forwarding table sizes on 7050QX-series switches. The description includes the base mode (fixed tables), unified forwarding tables and even the IPv6 LPM details, and dives deep into what happens when the switch runs out of forwarding table entries.