Disaster Recovery
ChatGPT explaining disaster recovery in simple terms
Before going into the details, let’s warm up with a few introductory blog posts:
- Long-Distance vMotion, Stretched HA Clusters and Business Needs
- Simplify Your Disaster Recovery with Virtual Appliances
- Design Challenge: Multiple Data Centers Connected with Slow Links
- Designing Active-Active and Disaster Recovery Data Centers
- Unexpected Recovery Might Kill Your Data Center
Vendors Love Clueless Customers
Even GPT got the gist of disaster recovery right:
- Figure out how you’re going to recover from a disaster (plan)
- Recover lost resources and data
- Restore normal operation
Unfortunately, that’s a lot of hard work, and people believing in fairy tales have always tried to avoid that. Welcome to the “infrastructure will save the day” la-la land of vendor marketing.
- Sooner or Later, Someone Will Pay for the Complexity of the Kludges You Use
- VMware VSAN Can Stretch – Should It?
- Can You Afford to Reformat Your Data Center?
- Revisited: The Need for Stretched VLANs
- Stretched VLANs and Failing Firewall Clusters
- Disaster Recovery: a Vendor Marketing Tale
- Repost: VMware Fault Tolerance Woes
Stretched VLANs
One of the most common “solutions” promoted by virtualization and networking vendors (and consultants drinking their Kool-Aid) is the idea to stretch VLANs across multiple data centers “to automate the recovery and avoid renumbering resources”. Needless to say, both claims are totally bogus.
- You Don't Need IP Renumbering for Disaster Recovery
- Busting Layer-2 Data Center Interconnect Myths
- IP Renumbering in Disaster Avoidance Data Center Designs
- Stretched Layer-2 Subnets – The Server Engineer Perspective
- Layer-2 Extension (OTV) Use Cases
- Stretched VLANs: What Problem Are You Trying to Solve?
Stretched Failure Domains
Stretched VLANs have a major drawback: they turn multiple independent data centers into a single failure domain. Here are a few real-life examples of what happens afterwards:
- STP loops strike again
- Another Spectacular Layer-2 Failure
- VLANs and Failure Domains Revisited
- Large Layer-2 Domains Strike Again…
- How Common Are Data Center Meltdowns?
- Real-Life Data Center Meltdown
- Disaster Recovery and Failure Domains
- Bridging Loops in Disaster Recovery Designs
Is there anything we can do to make things a bit better? Maybe:
- Whose Failure Domain Is It?
- Cisco ACI – a Stretched Fabric That Actually Works
- Availability Zones in Overlay Virtual Networks
- Stretched ACI Fabric Is Sometimes the Least Horrible Solution
- Are VXLAN-Based Large Layer-2 Domains Safer?
Faking Disaster Recovery Tests
Building an infrastructure that turns multiple locations into a single failure domain is bad. Faking disaster recovery tests on such infrastructure is even worse.
- Disaster Recovery Test Faking: Another Use Case for Stretched VLANs
- Disaster Recovery Faking, Take Two
Disaster Avoidance
Could something be worse than the stretched VLAN fairy tales? You bet. Vendors like Cisco, EMC and VMware were happily promoting disaster avoidance: the idea that you’d migrate your workload out of a data center that’s about to experience a disaster (example: hurricane). Not surprisingly, this idea works best in PowerPoint.
- Long-distance vMotion for Disaster Avoidance? Do the Math First
- Follow-the-Sun Workload Mobility? Get Lost!
- Long-Distance Workload Mobility in Perspective
- Workload Mobility and Reality: Bandwidth Constraints
- Latency: the Killer of Spread-Out Application Stack Ideas
- Before Talking about vMotion across Continents, Read This
- Is Anyone Using Long-Distance VM Mobility in Production?
Real-Life Lessons
I had to deal with several (non-networking) disasters over the years. Here are a few lessons I learned:
- Disasters Happen ... It’s the Recovery that Matters
- Disaster Recovery: Lessons Learned
- Disasters and Recoveries, Part 1
- Disasters and Recoveries, Part 2
- Keep Your Failure Domains Small
- All Operations Engineers Should Have Firefighting Training
I’m Not Alone ;)
For years, I’ve been one of the very few vocal opponents to the “industry wisdom”. Fortunately, I’m no longer alone. This is what others had to say on the topic: