How Common Are Data Center Meltdowns?

We all know about catastrophic headline-generating failures like AWS East-1 region falling apart or a major provider being down for a day or two. Then there are failures known only to those who care, like losing a major exchange point. However, I’m becoming more and more certain that the known failures are not even the tip of the iceberg – they seem to be the climber at the iceberg summit.

I’m occasionally running on-site design workshops and although I don’t keep track, my guess is that at least 25% of the companies I’m running workshops in experienced a more-or-less catastrophic bridging-caused meltdown in not-so-distant past. Sometimes it stays within a data center and impacts the performance of all hosts attached to the affected VLAN, sometimes they manage to bring down two data centers (hooray for Stretched VLANs).

It might be selection bias. Customers engaging me usually run complex environments that might be by definition prone to weird failures. On the other hand, at least some of them run well-managed environments, and they got a bridging loop even though they did all the right things.

It might be confirmation bias. I keep telling people how dangerous large L2 environments are, so I might remember those workshops where they told me “yeah, that happened recently” better than others.

Or it could be that the vendors are truly peddling broken technology (of course only because the customers ask for it, right?) and we’re paying the price of CIOs or high-level architects making decisions based on glitzy PowerPoints and “impartial” advice from $vendor consultants.

I simply don’t have enough data points to know better, and it seems we made no real progress in the last few years – all I could find was anecdata. Your feedback would be highly appreciated.

Latest blog posts in Disaster Recovery series

10 comments:

  1. I think your have an infatuation with layer 2. The problem is noobs hear this then disable spanning tree. Of course it is going to crash. If you disable spanning tree you still need a path protection protocol.
    The biggest problem in data centre meltdowns is poor facilities management. People don't do the work right: https://www.iotforall.com/doing-data-center-work-right-checklist/
    Replies
    1. "I think your have an infatuation with layer 2" << LOL. You must be pretty new to my blog ;)
    2. Hahahaha - no not at all. The jokes on you mate. Look closely at the logo and tell me what it look like.
    3. Facilities failures are so 2005. Nowadays, organizations are building redundancy models at the datacenter level.

      Ivan's perspective on layer-2 networking is dead on. Layer-2 fabric tech is for suckers who don't know how to tell developers their ideas are dumb.
    4. I sincerely hope you are being sarcastic, otherwise you are in for a surprise of your life when it happens :-)
  2. When a network is over a certain size (I always maintained that the spot is somewhere between 500 and 750 nodes), something is broken somewhere all the time. That is normal.

    However, human and automated responses to these events and states differ greatly, and are very dependent on the corporate and engineering culture.

    In places that are quick to assign blame, and look for culprits, this leads to design paralysis of always doing what vendors, consultants, and architects propose. Because, that’s where blame will inevitably end.

    In places that understand that if humans do work, humans will error, and there will be outages, situation is different.

    So, what does this have to do with meltdowns? Pretty much everything.

    Large networks, operated by humans, or human-designed automation will melt down. The real question to ask is not how common these are (they are), rather how common is a repeated visible meltdown.

    In a case of a first environment I described, I’d be willing to bet it was common.

    In the second, I am willing to bet it would be close to zero.
    Replies
    1. Thank you. Now for a more interesting one...

      Obviously Type-A organizations will gladly continue failing and blame everyone else... but do you think that Type-B organizations eventually evolve toward a sane applications + infrastructure stack, or do they stay stuck in some well-managed local minimum like "yeah, we have to do long-distance VLANs, but at least they work reasonably well"?
    2. That question answers itself, doesn’t it?

      If B determines that root cause is “unreasonably large L2 domain”, then they would address it.

      Problem there, again is really not the L2 domain, rather the reason it’s in place, and let’s be frank — it’s VMware, and its historically not-quite-optimal idea of how networks work.

      3 years ago, I’d sit and argue that’s an unsolvable problem. These days, with things like Kubernetes, hybrid and on-premise clouds, and dare I beat our own drum: Anthos, reasons to rely on ancient concepts of vMotion and VMs are few and far between.
    3. Marko - "reasons to rely on ancient concepts of vMotion and VMs are few and far between" you'd be surprised... this is the norm in "modern" enterprise. Actually migrating from L2 to L3 leaf-spine with NSX-T overlay is the state of art...
    4. Oh, I am perfectly aware of that, and see Ivan's comment above: "evolve toward a sane applications + infrastructure". That architecture now exists, invalidating the need for "insanity" of globe-spanning L2 domains.

      The fact people are doing something, doesn't mean it's the reasonable thing to do. After all: people smoke.
Add comment
Sidebar