How Common Are Data Center Meltdowns?
We all know about catastrophic headline-generating failures like AWS East-1 region falling apart or a major provider being down for a day or two. Then there are failures known only to those who care, like losing a major exchange point. However, I’m becoming more and more certain that the known failures are not even the tip of the iceberg – they seem to be the climber at the iceberg summit.
I’m occasionally running on-site design workshops and although I don’t keep track, my guess is that at least 25% of the companies I’m running workshops in experienced a more-or-less catastrophic bridging-caused meltdown in not-so-distant past. Sometimes it stays within a data center and impacts the performance of all hosts attached to the affected VLAN, sometimes they manage to bring down two data centers (hooray for Stretched VLANs).
It might be selection bias. Customers engaging me usually run complex environments that might be by definition prone to weird failures. On the other hand, at least some of them run well-managed environments, and they got a bridging loop even though they did all the right things.
It might be confirmation bias. I keep telling people how dangerous large L2 environments are, so I might remember those workshops where they told me “yeah, that happened recently” better than others.
Or it could be that the vendors are truly peddling broken technology (of course only because the customers ask for it, right?) and we’re paying the price of CIOs or high-level architects making decisions based on glitzy PowerPoints and “impartial” advice from $vendor consultants.
I simply don’t have enough data points to know better, and it seems we made no real progress in the last few years – all I could find was anecdata. Your feedback would be highly appreciated.
The biggest problem in data centre meltdowns is poor facilities management. People don't do the work right: https://www.iotforall.com/doing-data-center-work-right-checklist/
Ivan's perspective on layer-2 networking is dead on. Layer-2 fabric tech is for suckers who don't know how to tell developers their ideas are dumb.
However, human and automated responses to these events and states differ greatly, and are very dependent on the corporate and engineering culture.
In places that are quick to assign blame, and look for culprits, this leads to design paralysis of always doing what vendors, consultants, and architects propose. Because, that’s where blame will inevitably end.
In places that understand that if humans do work, humans will error, and there will be outages, situation is different.
So, what does this have to do with meltdowns? Pretty much everything.
Large networks, operated by humans, or human-designed automation will melt down. The real question to ask is not how common these are (they are), rather how common is a repeated visible meltdown.
In a case of a first environment I described, I’d be willing to bet it was common.
In the second, I am willing to bet it would be close to zero.
Obviously Type-A organizations will gladly continue failing and blame everyone else... but do you think that Type-B organizations eventually evolve toward a sane applications + infrastructure stack, or do they stay stuck in some well-managed local minimum like "yeah, we have to do long-distance VLANs, but at least they work reasonably well"?
If B determines that root cause is “unreasonably large L2 domain”, then they would address it.
Problem there, again is really not the L2 domain, rather the reason it’s in place, and let’s be frank — it’s VMware, and its historically not-quite-optimal idea of how networks work.
3 years ago, I’d sit and argue that’s an unsolvable problem. These days, with things like Kubernetes, hybrid and on-premise clouds, and dare I beat our own drum: Anthos, reasons to rely on ancient concepts of vMotion and VMs are few and far between.
The fact people are doing something, doesn't mean it's the reasonable thing to do. After all: people smoke.