A while ago a large airline had a bad-hair day claiming it was caused by a faulty power supply. Not surprisingly, I got a question along the lines of “is that feasible?”
Short answer: Yes. However, someone should be really worried if that wasn’t made up.
There are companies out there that learn from their mistakes. Even more, they publicly admit what went wrong, what they learned, and how they improved their processes to ensure this particular SNAFU won’t happen in the future. I even found a nice list of public post-mortem reports. Not surprisingly, airlines and legacy financial institutions are nowhere to be found.
Sometimes something really stupid goes wrong. For example, you’re upgrading a component, and its redundant pair fails. Or you thought you had redundancy, but it wasn’t configured correctly (HSRP anyone). Or a failure resulted in resource shortage resulting in cascading failures as Amazon found a while ago.
Organizations who are seriously investing in services uptime talk about Site Reliability Engineers (so does everyone else – SRE is becoming another meaningless buzzword). They also trigger unexpected failures – manually or automatically. Organizations who love to talk about redundancy to get a tick-in-the-box from their auditors move the active database instance once a year under tightly controlled conditions and at the time of minimum load (or even during a scheduled maintenance window).
For more details on doing failover tests correctly read at least the failovers part of the Small Batches Principle article by Tom Limoncelli (and I strongly recommend you read the whole article).
Long story short: Full redundancy doesn't prevent failures. When done correctly, it reduces the probability of a total failure. When done incorrectly, redundant solutions get less robust than non-redundant ones due to increased complexity... and you don’t know which one you’re facing until you stress-test your solution.
There’s also this crazy thing called statistics. It turns out that adding redundant components results in decreased availability under gray (or byzantine) failures. The authors of the original article handwaved us to that conclusion, but I did check the math and the results are what they claim they are (or my knowledge of statistics is even worse than I assume).
Finally, do keep in mind that what I was talking about so far are ideal circumstances. In reality, we keep heaping layers of leaky abstractions and ever-more-convoluted kludges on top of each other until the whole thing comes crashing down resulting in days of downtime. Or as Vint Cerf said in a recent article: we’re facing a brittle and fragile future.
Need more details? You might find some useful ideas in my Designing Active-Active and Disaster Recovery Data Centers webinar… or you might go read a vendor whitepaper. The choice is yours.