… updated on Tuesday, November 2, 2021 15:57 UTC
Redundancy Does Not Result in Resiliency
A while ago a large airline had a bad-hair day claiming it was caused by a faulty power supply. Not surprisingly, I got a question along the lines of “is that feasible?”
Short answer: Yes. However, someone should be really worried if that wasn’t made up.
There are companies out there that learn from their mistakes. Even more, they publicly admit what went wrong, what they learned, and how they improved their processes to ensure this particular SNAFU won’t happen in the future. I even found a nice list of public post-mortem reports and a list of AWS outages. Not surprisingly, airlines and legacy financial institutions are nowhere to be found.
Sometimes something really stupid goes wrong. For example, you’re upgrading a component, and its redundant pair fails. Or you thought you had redundancy, but it wasn’t configured correctly (missing HSRP neighbor comes to mind). Or a failure resulted in resource shortage resulting in cascading failures as Amazon found a while ago.
Organizations seriously investing in services uptime talk about Site Reliability Engineers1. They also trigger unexpected failures manually or automatically. AWS even made a cloud service out of that idea.
Organizations who love to talk about redundancy to get a tick-in-the-box from their auditors move the active database instance once a year under tightly controlled conditions and at the time of minimum load (or even during a scheduled maintenance window).
Long story short: Full redundancy doesn't prevent failures. When done correctly, it reduces the probability of a total failure. When done incorrectly, redundant solutions get less robust than non-redundant ones due to increased complexity... and you don’t know which one you’re facing until you stress-test your solution.
There’s also this crazy thing called statistics. It turns out that adding redundant components results in decreased availability under gray (or byzantine) failures. The authors of the original article handwaved us to that conclusion, but I did check the math and the results are what they claim they are (or my knowledge of statistics is even worse than I assume).
Finally, do keep in mind that what I was talking about so far are ideal circumstances. In reality, we keep heaping layers of leaky abstractions and ever-more-convoluted kludges on top of each other until the whole thing comes crashing down resulting in days of downtime. Or as Vint Cerf said in a recent article: we’re facing a brittle and fragile future.
Need more details? You might find some useful ideas in my Designing Active-Active and Disaster Recovery Data Centers webinar… or you might go read a vendor whitepaper. The choice is yours.
Revision history
- 2021-11-02
- Added links to a list of AWS outages and AWS Fault Injection Simulator.
-
So does everyone else – SRE is becoming another meaningless buzzword ↩︎
Agreed with states yes we do maintain different rib/fib states at various tiers based on physical presence.
Redundancy is not equal to resiliency but redundancy improves fault tolerance if configured with proper policies n traffic engineering.
Again burst of traffic n rates of various traffic i n redundant links utilisation also matters.
Resilient always means dedicated resiliency but we have shared infrastructure on the actual fabric but on the cluster level we do have dedicated n shared.
Dedicated resiliency across clusters is always possible in certain layers.
There is no free lunch, but solutions are possible. But who wants to pay or it?
So we are back to imperfect risk analysis... So at least we could document due care and keep out the accountable persons from jail... :-)
Our business is partly about fashion, partly about alibis... Real technology expertise is rarely in the focus... :-)
For me general discussion about redundancy makes no sense. The topic is so deep and complex (I have been working in this business for more than 10 years) and cannot be concluded using general statements.