A while ago a large airline had a bad-hair day claiming it was caused by a faulty power supply. Not surprisingly, I got a question along the lines of “is that feasible?”
Short answer: Yes. However, someone should be really worried if that wasn’t made up.
There are companies out there that learn from their mistakes. Even more, they publicly admit what went wrong, what they learned, and how they improved their processes to ensure this particular SNAFU won’t happen in the future. I even found a nice list of public post-mortem reports and a list of AWS outages. Not surprisingly, airlines and legacy financial institutions are nowhere to be found.
Sometimes something really stupid goes wrong. For example, you’re upgrading a component, and its redundant pair fails. Or you thought you had redundancy, but it wasn’t configured correctly (missing HSRP neighbor comes to mind). Or a failure resulted in resource shortage resulting in cascading failures as Amazon found a while ago.
Organizations seriously investing in services uptime talk about Site Reliability Engineers1. They also trigger unexpected failures manually or automatically. AWS even made a cloud service out of that idea.
Organizations who love to talk about redundancy to get a tick-in-the-box from their auditors move the active database instance once a year under tightly controlled conditions and at the time of minimum load (or even during a scheduled maintenance window).
Long story short: Full redundancy doesn't prevent failures. When done correctly, it reduces the probability of a total failure. When done incorrectly, redundant solutions get less robust than non-redundant ones due to increased complexity... and you don’t know which one you’re facing until you stress-test your solution.
There’s also this crazy thing called statistics. It turns out that adding redundant components results in decreased availability under gray (or byzantine) failures. The authors of the original article handwaved us to that conclusion, but I did check the math and the results are what they claim they are (or my knowledge of statistics is even worse than I assume).
Finally, do keep in mind that what I was talking about so far are ideal circumstances. In reality, we keep heaping layers of leaky abstractions and ever-more-convoluted kludges on top of each other until the whole thing comes crashing down resulting in days of downtime. Or as Vint Cerf said in a recent article: we’re facing a brittle and fragile future.
Need more details? You might find some useful ideas in my Designing Active-Active and Disaster Recovery Data Centers webinar… or you might go read a vendor whitepaper. The choice is yours.
- Added links to a list of AWS outages and AWS Fault Injection Simulator.
So does everyone else – SRE is becoming another meaningless buzzword ↩︎