Real-Life Data Center Meltdown

Wednesday, May 8, 2019 07:43 +0200

Real-Life Data Center Meltdown

A good friend of mine who prefers to stay A. Nonymous for obvious reasons sent me his “how I lost my data center to a broadcast storm” story. Enjoy!

Small-ish data center with several hundred racks. Row of racks supported by an end-of-row stack. Each stack with 2 x L2 EtherChannels, one EC to each of 2 core switches. The inter-switch link details don’t matter other than to highlight “sprawling L2 domains."

VLAN pruning was used to limit L2 scope, but a few VLANs went everywhere, including the management VLAN.

On the fateful day, a switch crashed. The crash condition resulted in a repeated sequence of frames being sent at full wire speed. The repeated frames included broadcast traffic in the management VLAN, so every control-plane CPU had to process them.

Network infrastructure CPUs at 100% all over the data center including core switches, routing adjacencies down, etc. The entire facility could not process for ~3.5 hours. No stretched L2, so damage was contained to a single site.

This was a reasonably well-managed site, but had some dumb design choices. Highly bridged networks don’t tolerate dumb design choices.

I don’t remember what the “official" story was we issued to customers. Standard operating procedure was to spin the truth. Point being that there are doubtless DC-down stories we never hear. And if we do hear them, the truth is obscured by spin. Probably similar to how most security breaches are handled.

This story is perhaps too old to be relevant… but as I reflect on it, the technology we were using then was fairly simple. Carefully managed rapid spanning-tree. Etherchannels. Diligent VLAN pruning. What have we got today in L2 that would update such an environment? Myriad stacking & MLAG options. TRILL & SPB (that almost no one bought). BGP EVPN (not really L2 anymore).

Can any of the stacking / MLAG technologies be LESS prone to failure, considering their complexity? I remember kicking early Cisco 6500 VSS out the door because it was hopelessly unstable. Nexus VPCs seemed stable in my experience, but had the benefit of being limited in scope compared to shared control-plane stacking technologies.

Coincidentally, I’ve been on hand for my share of L3 related outages, but in almost every case, the issues were caused by human error, bad design, or a combination of both where an unintentional (human screw up) or unforeseen (topological, circuit down) change to the network toppled a house of cards. I don’t see that as a problem inherent to L3, because in every case a design change could resolve the problem. (I.e. replace the house of cards with a proper house.) L2 issues…not so much.

I tend to think of large bridging domains as inherently risky in a way that is not possible to engineer away. Storm-control and fancy STP add-ons help when properly applied, but have limits. Sort of like when the containment shell of a nuclear facility fails during a runaway reaction.

Want to know the real difference between routing and bridging? It will be one of the first topics I’ll cover in the upcoming How Networks Really Work webinar. On a more practical note, you might want to explore various DCI design options with the Data Center Interconnects webinar, or figure out how to build scalable multi-data-center solutions with Designing Active-Active and Disaster Recovery Data Centers webinar.

Finally, there’s also the Building Next-Generation Data Center online course – hundreds of data center architects and designers found it highly relevant and useful.

3 comments:

Salman Naqvi 08 May 2019 16:18

It's a moot point, as moral of the story is making good design choices and not having L2 spread wildly all over the place, but, I've literally worked on over 200 Catalyst 6500s, with many deployed as core switches in a pair using VSS in Enterprise and Data Center environments and never had any stability issues over several years of software upgrades. In the last 6-7 years, across literally a few hundred Nexus switches, VPC has actually proven to be more unstable, but it's pretty rock solid if you stick to the Cisco's official "Recommended Release" which tends to be a couple of major versions behind.

Piotr Jablonski 08 May 2019 22:00

If L2 is required then the lowest risk involved with a propagation of L2 issues (loops, cpu meltdowns, floods, etc.) is related to the host based overlays. A larger L2 underlay topology, a higher risk related to L2. Still VSS, vPC, MC-LAG or even just xSTP is good enough (read: easy and cheap) if there is a L3 data center interconnect.

Ibi 08 October 2019 12:57

I had a customer break their DR site after someone patched in both iLO ports on a server (which are designed to be daisy chained); the switch they were plugged into was part of an EVPN fabric with L2 out to a VMware environment and a management switch, you can probably guess what happened but the broadcast storm propagated through the EVPN fabric, and out the L2 links; killing management and parts of the VMware environment. EVPN does not fix stupidity..

Latest blog posts in Disaster Recovery series

Recent posts in the same categories

design

bridging

data center

3 comments: