Disaster Recovery Test Faking: Another Use Case for Stretched VLANs

The March 2019 Packet Pushers Virtual Design Clinic had to deal with an interesting question:

Our server team is nervous about full-scale DR testing. So they have asked us to stretch L2 between sites. Is this a good idea?

The design clinic participants were a bit more diplomatic (watch the video) than my TL&DR answer which would be: **** NO!

Let’s step back and try to understand what’s really going on:

In the ideal world, you’d just shut down a data center and see what happens. Companies that realized you cannot fake it forever and pretend to solve application availability with infrastructure tricks have no qualms doing exactly that. They know everything eventually fails and focus on making their services as failure-resilient as possible. Once you embrace that mindset, killing parts of your infrastructure to see what happens becomes business-as-usual.

Meanwhile on Planet Enterprise: operations teams have to deal with applications that were never tested beyond loopback interface, or single-instance applications tested with a database residing in the same LAN segment, and try to increase the availability of these impending disasters by introducing infrastructure tricks.

At the same time, management and auditors require disaster recovery functionality… and some server teams, knowing that they would never be allowed to do the right thing, try to fake disaster recovery tests with a charade that would make Potemkin proud:

  • Move a few running VMs to another site;
  • Ignore all dependencies of the application stack (connectivity to common services, prerequisite network- and security services)
  • Perform the test during a maintenance window
  • Leave the VMs running at the other site for a few minutes, carefully bring them back and declare success.

Mission accomplished. See you next year!

Have they proven that the disaster recovery procedure works? Absolutely not, their carefully-planned choreography has nothing to do with what they’d have to face when the primary data center dies. In the meantime, they need stretched VLANs to make the fake test work, turning both data centers into a ticking bomb that will eventually explode (or as Terry Slattery pointed out in the Design Clinic discussion “do you want to have one failure domain or two?”).

Summary: Disaster recovery is a tough challenge. Proper disaster recovery testing is hard, and as long as you’re trying to fake it with careful VM moves, you’re bound to have the well-known three-step process to disaster recovery planning:

  • Wait for disaster to happen;
  • Improvise to recover;
  • Plan for the next disaster.

Needless to say, the next time you’ll experience a disaster your post factum disaster recovery plan will be useless because IT infrastructure and application stacks tend to change over time.

What Others Have to Say on the Topic

I asked Daniel Dib for his opinion on the topic. Here’s what he sent me:

This is the first time I’ve heard people stretching L2 to their DR site though. The entire purpose of the DR site is to not have any dependencies to the main DCs, which is why you put them in different geographical areas etc.

I doubt many organisations have a proper DR plan including RPO, RTO and testing instructions. As you both know, distributed data is always challenging, especially when you try to solve the problem in the network layer.

The ironic thing is that people don’t understand L2 is bad until things explode. It could take some time, some get lucky and don’t experience it, but that doesn’t mean you had a good design.

More Information

The webinars covering the sane and less-sane ways of building active-active and disaster recovery data centers and all recent variants of data center interconnects are available with standard ipSpace.net subscription.

Latest blog posts in Disaster Recovery series

5 comments:

  1. Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center. A word and a blow. Spanning tree converged pretty fast. The stretched VLANs were all functional on the DR site. But there were split brain scenarios on firewalls all over the place. Even worse the storage (iSCSI) didn't survive because it couldn't build the quorum and so the storage wasn't accessible on the DR site. After some minutes no machines (also virtual load balancers) were reachable. Also vCenter wasn't reachable any more and so they couldn't do anything. After bringing the core switches back online it took them hours to recover from the mess.
    They never fixed the broken architecture (firewall/load balancer cluster and iSCSI storage).
    On the second attempt they faked disaster recovery tests like you were describing by carefully moving some machines to the DR site but did not failover firewalls and load balancers.
    In the internal review discussion they had to admit that the network failed over as expected and in a reasonable amount of time.
    Most everyone these days is relying on 100 % network availability. It's a fallacy. Sometimes they have to learn it the hard way.
    Replies
    1. Firewall and storage techies have problems around understanding many basic networking concepts. Strange...
  2. While I agree with the rest of your article and have faced similar issues myself, I disagree with the notion that doing your DR testing during a maintenance window is a bad thing. If your goal is to learn what systems break and you're fairly certain that SOME things will break, why risk corporate reputation or revenue during the test?

    Maybe eventually once you've identified what you think are all the issues and you're ready to do a full live production DR test, you should do it during the middle of the day, but I'd argue against it initially and as significant infrastructure changes are made.

    Thoughts? Did I just misunderstand your angle on this?
    Replies
    1. I have nothing against doing tests within maintenance windows as long as everyone is aware that there’s room for improvement if we’re not confident enough to pull the plug no matter when... because the disaster won’t wait for the next scheduled maintenance window.
  3. Most people cover up any DR failings. Azure failed and much was said about the data center cooling and power but not a word about why the geographic redundancy did not work.
Add comment
Sidebar