What Happens When a Data Center Fabric Switch Fails?
I got into an interesting discussion with a fellow networking engineer trying to understand the impact of a switch failure in a L2/L3 data center fabric (anything from Avaya’s fabric or Brocade’s VCS Fabric to Cisco’s FabricPath, ACI or Juniper’s QFabric) on MAC and ARP tables.
Here’s my take on the problem – have I missed anything?
2015-10-01: Updated a bit based on offline discussion with Roger Lapuh (Avaya).
Core switch failure
Assuming we believe the promises made by the vendor marketing (and product documentation), the impact should be minimal. All fabrics I mentioned use some sort of overlay encapsulation (PBB, TRILL, VXLAN or MPLS) and map customer MAC address to edge switch ID.
The mapping of customer MAC addresses to edge switch IDs is not affected by a core switch failure. After the fabric routing protocol (IS-IS in most cases) finds alternate paths across the fabric core, the traffic should continue to flow.
Edge switch failure
An edge switch failure will result in a link failure on all attached servers. Link failure will have zero impact on servers that were using the link as a standby link in active/standby link bonding configurations, and significant impact on servers that were actually using the failed link.
We have to assume servers (or hypervisor hosts) dual-homed to two ToR switches without a port channel to make the challenge even remotely interesting.
The bare-metal servers will activate the standby link and send a gratuitous ARP reply for their own IP address over that link. That packet will trigger MAC address learning (in case the standby link was totally silent) and populate the ARP cache of the first-hop L3 switch. Maximum impact: a few dozens of ARP replies flooded across the network.
Hypervisor hosts will either activate the standby link or reassign the VMs using the now-failed link to an alternate link. In both cases, the hypervisor hosts generate RARP packets (vSphere) or gratuitous ARP replies (other hypervisors) populating the MAC table on all switches across the fabric, and (in non-vSphere case) ARP caches.
Hosts connected to the ToR switches with a port channel won’t even experience a link failure, as a member of port channel remains active. Traffic forwarding across the fabric will be adjusted based on SPF calculations done by the fabric routing protocol, see below.
In the absolutely worst-case scenario, the fabric is using simple anycast IP forwarding, and the ARP cache on the standby switch is totally empty because no server or VM was sending the traffic to it. That switch will have to generate loads of ARP requests in very short time to populate its ARP cache… not just for the adjacent IP hosts (servers or VMs), but also for other IP hosts communicating with the adjacent ones.
Fabrics that use host routing and perform full L3 lookup at ingress and egress node like Cisco’s DFA or ACI fare much better – the standby switch doesn’t need the ARP entries for non-adjacent IP hosts.
Cleanup on Edge Switch Failure
All edge switches in the network have to change their MAC-to-switch mappings after an edge switch failure.
In the non-LAG scenario, the gratuitous ARPs (or RARPs) trigger dynamic MAC learning on all fabric edge switches – no big deal assuming the switches manage to deal with a few hundred MAC changes in a short timeframe.
When servers use LAG to connect to the switches, they don’t react to a physical link failure (LAG is still operational) and thus don’t send broadcast packets that would trigger dynamic MAC learning. The fabric edge switches must rely on fabric routing protocol and purge their MAC tables after the fabric routing protocol reports loss of a switch ID.
Want to know more?
Explore the Data Center Fabric Architectures and Leaf-and-Spine Fabrics webinars to learn more about data center fabrics, or vSphere 6 Networking Deep Dive if you need to know how vSphere handles link failures. I also covered redundant server-to-network connectivity in one of the ExpertExpress case studies.
Had it happen last week in a fabricpath environment.
Switches actually stopped learning new MACs - even though nothing new was being plugged in. BFD counters went through the roof, loop got generated ...