data center « ipSpace.net blog

Wednesday, June 8, 2022 06:16 UTC

Select the Best Switching ASIC For the Job

Last week I described some of the data center switching ASIC design tradeoffs and the ASIC families Broadcom created to fit somewhere in that multi-dimensional space.

Next step: how could you design your data center fabric to make the most out of them? To keep things simple, we’ll build a typical leaf-and-spine fabric with a WAN edge layer (sometimes called border leaf switches).

Data Center Switching ASICs Tradeoffs

A brief mention of Broadcom ASIC families in the Networking Hardware/Software Disaggregation in 2022 blog post triggered an interesting discussion of ASIC features and where one should use different ASIC families.

Like so many things in life, ASIC design is all about tradeoffs. Usually you’re faced with a decision to either implement X (whatever X happens to be), or have high-performance product, or have a reasonably-priced product. It’s very hard to get two out of three, and getting all three is beyond Mission Impossible.

What Happened to FabricPath and Its Friends?

Continuing the what happened to old technologies saga, here’s another question by Enrique Vallejo:

Are FabricPath, TRILL or SPB still alive, or has everyone moved to VXLAN? Are they worth studying?

TL&DR: Barely. Yes. No.

Layer-2 Fabric craziness exploded in 2010 with vendors playing the usual misinformation games that eventually resulted in totally fragmented market full of partial- or proprietary solutions. At one point in time, some HP data center switches supported only TRILL, and other data center switches from the same company supported only SPB.

Now for individual technologies:

ICMP Redirects Considered Harmful

One of my readers sent me an intriguing challenge based on the following design:

He has a data center with two core switches (C1 and C2) and two Cisco Nexus edge switches (E1 and E2).
He’s using static default routing from core to edge switches with HSRP on the edge switches.
E1 is the active HSRP gateway connected to the primary WAN link.

The following picture shows the simplified network diagram:

Running BGP between Virtual Machines and Data Center Fabric

Got this question from one of my readers:

When adopting the BGP on the VM model (say, a Kubernetes worker node on top of vSphere or KVM or Openstack), how do you deal with VM migration to another host (same data center, of course) for maintenance purposes? Do you keep peering with the old ToR even after the migration, or do you use some BGP trickery to allow the VM to peer with whatever ToR it’s closest to?

Short answer: you don’t.

Kubernetes was designed in a way that made worker nodes expendable. The Kubernetes cluster (and all properly designed applications) should recover automatically after a worker node restart. From the purely academic perspective, there’s no reason to migrate VMs running Kubernetes.

Building a Small Data Center Fabric with Four Switches

One of my subscribers has to build a small data center fabric that’s just a tad too big for two switch design.

For my datacenter I would need two 48 ports 10GBASE-T switches and two 48 port 10/25G fibber switches. So I was watching the Small Fabrics and Lower-Speed Interfaces part of Physical Fabric Design to make up my mind. There you talk about the possibility to do a leaf and spine with 4 switches and connect servers to the spine.

A picture is worth a thousand words, so here’s the diagram of what I had in mind:

Comparing Forwarding Performance of Data Center Switches

One of my subscribers is trying to decide whether to buy an -EX or an -FX version of a Cisco Nexus data center switch:

I was comparing Cisco Nexus 93180YC-FX and Nexus 93180YC-EX. They have the same port distribution (48x 10/25G + 6x40/100G), 3.6 Tbps switching capacity, but the -FX version has just 1200 Mpps forwarding rate while EX version goes up to 2600 Mpps. What could be the reason for the difference in forwarding performance?

Both switches are single-ASIC switches. They have the same total switching bandwidth, thus it must take longer for the FX switch to forward a packet, resulting in reduced packet-per-seconds figure. It looks like the ASIC in the -FX switch is configured in more complex way: more functionality results in more complexity which results in either reduced performance or higher cost.

Stretched VLANs: What Problem Are You Trying to Solve?

One of ipSpace.net subscribers sent me this interesting question:

I am the network administrator of a small data center network that spans 2 buildings. The main building has a pair of L2/L3 10G core switches. The second building has a stack of access switches connected to the main building with 10G uplinks. This secondary datacenter has got some ESX hosts and NAS for remote backup and some VM for development and testing, but all the Internet connection, firewall and server are in the main building.

There is no routing in the secondary building and most of the VLANs are stretched. Do you think I must change that (bringing routing to the secondary datacenter), or keep it simple like it is now?

As always, it depends, this time on what problem are you trying to solve?

Comparing EVPN with Flood-and-Learn Fabrics

One of ipSpace.net subscribers sent me this question after watching the EVPN Technical Deep Dive webinar:

Do you have a writeup that compares and contrasts the hardware resource utilization when one uses flood-and-learn or BGP EVPN in a leaf-and-spine network?

I don’t… so let’s fix that omission. In this blog post we’ll focus on pure layer-2 forwarding (aka bridging), a follow-up blog post will describe the implications of adding EVPN IP functionality.

Questions about BGP in the Data Center (with a Whiff of SRv6)

Henk Smit left numerous questions in a comment referring to the Rethinking BGP in the Data Center presentation by Russ White:

In Russ White’s presentation, he listed a few requirements to compare BGP, IS-IS and OSPF. Prefix distribution, filtering, TE, tagging, vendor-support, autoconfig and topology visibility. The one thing I was missing was: scalability.

I noticed the same thing. We kept hearing how BGP scales better than link-state protocols (no doubt about that) and how you couldn’t possibly build a large data center fabric with a link-state protocol… and yet this aspect wasn’t even mentioned.

Worth Reading: Switching to IP fabrics

Namex, an Italian IXP, decided to replace their existing peering fabric with a fully automated leaf-and-spine fabric using VXLAN and EVPN running on Cumulus Linux.

They documented the design, deployment process, and automation scripts they developed in an extensive blog post that’s well worth reading. Enjoy ;)

add comment

Sunday, May 30, 2021 08:12 UTC

Worth Reading: Azure Datacenter Switch Failures

Microsoft engineers published an analysis of switch failures in 130 Azure regions (review of the article, The Next Platform summary):

A data center switch has a 2% chance of failing in 3 months (= less than 10% per year);
~60% of the failures are caused by hardware faults or power failures, another 17% are software bugs;
50% of failures lasted less than 6 minutes (obviously crashes or power glitches followed by a reboot).
Switches running SONiC had lower failure rate than switches running vendor NOS on the same hardware. Looks like bloatware results in more bugs, and taking months to fix bugs results in more crashes. Who would have thought…

see 4 comments

Saturday, May 29, 2021 06:34 UTC

Worth Reading: Running BGP in Large-Scale Data Centers

Here’s one of the major differences between Facebook and Google: one of them publishes research papers with helpful and actionable information, the other uses publications as recruitment drive full of we’re so awesome but you have to trust us – we’re not sharing the crucial details.

Recent data point: Facebook published an interesting paper describing their data center BGP design. Absolutely worth reading.

Just in case you haven’t realized: Petr Lapukhov of the RFC 7938 fame moved from Microsoft to Facebook a few years ago. Coincidence? I think not.

see 5 comments

Thursday, May 27, 2021 06:55 UTC

Local TCP Anycast Is Really Hard

Pete Lumbis and Network Ninja mentioned an interesting Unequal-Cost Multipathing (UCMP) data center use case in their comments to my UCMP-related blog posts: anycast servers.

Here’s a typical scenario they mentioned: a bunch of servers, randomly connected to multiple leaf switches, is offering a service on the same IP address (that’s where anycast comes from).

Mythbusting: NFV Data Center Fabric Buffering Requirements

Every now and then I stumble upon an article or a comment explaining how Network Function Virtualization (NFV) introduces new data center fabric buffering requirements. Here’s a recent example:

For Telco/carrier Cloud environments, where NFVs (which are much slower than hardware SGW) get used a lot, latency is higher with a lot of jitter due to the nature of software and the varying link speeds, so DC-level near-zero buffer is not applicable.

It seems to me we’re dealing with another myth. Starting with the basics:

Category: data center