Building Network Automation Solutions
6 week online course starting in September 2017

Unexpected Recovery Might Kill Your Data Center

Here’s an interesting story I got from one of my friends:

  • A large organization used a disaster recovery strategy based on stretched IP subnets and restarting workloads with unchanged IP addresses in a secondary data center;
  • Once they experienced a WAN connectivity failure in the primary data center and their disaster recovery plan kicked in.

However… while they were busy restarting the workloads in the secondary data center, and managed to get most of them up and running, the DCI link unexpectedly came back to life.

That wouldn’t be a problem if they would have re-addressed the servers, or at least done proper shutdown in the primary data center, but they were so busy recovering from the failure that they forgot to turn off the servers and storage in the primary data center.

End result: tons of duplicate IP addresses, cluster failures and data corruption (because the data on replicated disks diverged in the meantime).

Moral of the story: disaster recovery starts with a proper shutdown of the failed data center. The extra half an hour you need to do that is much better than being offline for days while restoring data from tape backups… and if your Recovery Time Objective (RTO) is less than a few hours you probably need an application-level active/active solution anyway.

Or as my friend Boštjan Šuštar (who did a great job replacing me at Interop) wrote in a recent email:

Disaster Recovery Procedure: Don’t panic and always carry a towel.

Don’t think such a disaster could ever happen in real life? Then you definitely don’t need my Building Next-Generation Data Center online course.

4 comments:

  1. "stretched IP subnets and restarting workloads"

    And this isn't a recipe for trouble is it? How did this get past due diligence? Didn't someone sit there and say "Guys...link flap..., we need the fail back to be a manual process...once we are sure that the interconnects are stable"

    ReplyDelete
    Replies
    1. You wouldn't believe how often this stupidity gets past due diligence... and how forcefully some people defend it. Read comments in this blog post for more details:

      http://blog.ipspace.net/2016/04/some-people-dont-get-it-it-will.html

      As for "fail back being manual process", the real problem was that they never shut down the primary DC properly.

      Delete
  2. Yes one of the first steps of the recovery plan should be doing a manual shutdown at the edge of the DC being recovered - even physically pulling the cables if power is not available. Even if IP addresses don't conflict plenty of clustering / replication solutions will break if not resynchronized very carefully.

    ReplyDelete
  3. I find another gotcha is when you have a DR server in "hot standby" mode (turned on and accessible in another DC/Subnet) where you may have forgotten to disable certain services from running.. only to find out the next time the DR server is rebooted, the DR services set to "automatic" begin to run, when they should not have. In this case we had an enterprise job scheduler accidentally start tasks against our banking systems which had already been run hours ago.. causing duplicate entries in the banking database.. I guess you know where the rest of the story leads.. accounting staff and customers not being very happy.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.