Redundancy Does Not Result in Resiliency

A while ago a large airline had a bad-hair day claiming it was caused by a faulty power supply. Not surprisingly, I got a question along the lines of “is that feasible?

Short answer: Yes. However, someone should be really worried if that wasn’t made up.

There are companies out there that learn from their mistakes. Even more, they publicly admit what went wrong, what they learned, and how they improved their processes to ensure this particular SNAFU won’t happen in the future. I even found a nice list of public post-mortem reports and a list of AWS outages. Not surprisingly, airlines and legacy financial institutions are nowhere to be found.

Sometimes something really stupid goes wrong. For example, you’re upgrading a component, and its redundant pair fails. Or you thought you had redundancy, but it wasn’t configured correctly (missing HSRP neighbor comes to mind). Or a failure resulted in resource shortage resulting in cascading failures as Amazon found a while ago.

Organizations seriously investing in services uptime talk about Site Reliability Engineers1. They also trigger unexpected failures manually or automatically. AWS even made a cloud service out of that idea.

Organizations who love to talk about redundancy to get a tick-in-the-box from their auditors move the active database instance once a year under tightly controlled conditions and at the time of minimum load (or even during a scheduled maintenance window).

For more details on doing failover tests correctly read at least the failovers part of the Small Batches Principle article by Tom Limoncelli (and I strongly recommend you read the whole article).

Long story short: Full redundancy doesn't prevent failures. When done correctly, it reduces the probability of a total failure. When done incorrectly, redundant solutions get less robust than non-redundant ones due to increased complexity... and you don’t know which one you’re facing until you stress-test your solution.

There’s also this crazy thing called statistics. It turns out that adding redundant components results in decreased availability under gray (or byzantine) failures. The authors of the original article handwaved us to that conclusion, but I did check the math and the results are what they claim they are (or my knowledge of statistics is even worse than I assume).

Finally, do keep in mind that what I was talking about so far are ideal circumstances. In reality, we keep heaping layers of leaky abstractions and ever-more-convoluted kludges on top of each other until the whole thing comes crashing down resulting in days of downtime. Or as Vint Cerf said in a recent article: we’re facing a brittle and fragile future.

Need more details? You might find some useful ideas in my Designing Active-Active and Disaster Recovery Data Centers webinar… or you might go read a vendor whitepaper. The choice is yours.

Revision history

2021-11-02
Added links to a list of AWS outages and AWS Fault Injection Simulator.

  1. So does everyone else – SRE is becoming another meaningless buzzword ↩︎

6 comments:

  1. Your relation of gray failures to byzantine generals was intriguing. Indeed, there is some relation (agreement on state ?) but it seems there is no intent of communication between the apps and the observer. Food for thought as someone used to say...
    Replies
    1. The relation is simpler: the more probable gray failures are, the more likely we'll end in a weird state where the client and the server can't agree on what needs to be done.
  2. With respect to redundancy - a big deal of long-haul aircrat today are twin-engine ones. As opposed to four-engine ones that dominated the market until the arrival of the 777.
  3. Redundancy gives us more b.w but proper policy engineering and traffic engineering of Aggregate and VIP prefixes will help rerouting affected links across different paths.
    Agreed with states yes we do maintain different rib/fib states at various tiers based on physical presence.
    Redundancy is not equal to resiliency but redundancy improves fault tolerance if configured with proper policies n traffic engineering.

    Again burst of traffic n rates of various traffic i n redundant links utilisation also matters.
    Resilient always means dedicated resiliency but we have shared infrastructure on the actual fabric but on the cluster level we do have dedicated n shared.

    Dedicated resiliency across clusters is always possible in certain layers.
  4. True resiliency costs a lot of resource and degrades performance. Even if you do things independently, with all kind of diversity, and then make a voting an the results, you have to introduce extra delays for concluding the voting.

    There is no free lunch, but solutions are possible. But who wants to pay or it?
    So we are back to imperfect risk analysis... So at least we could document due care and keep out the accountable persons from jail... :-)

    Our business is partly about fashion, partly about alibis... Real technology expertise is rarely in the focus... :-)
    Replies
    1. It depends on the business type. I work for public safety communication business where the redundancy is crucial. We spend a lot of effort on designing system with very fast recovery time. Here Customers wants us to do such network and pay for it.

      For me general discussion about redundancy makes no sense. The topic is so deep and complex (I have been working in this business for more than 10 years) and cannot be concluded using general statements.
Add comment
Sidebar