Disasters happen ... it’s the recovery that matters

My recent vacation included a few perfect lessons in disaster recovery. Fortunately the disasters were handled by total pros that managed them perfectly. It all started when we were already packed and driving – my travel agent called me to tell me someone mixed up the dates and shifted them by two months; we were expected to arrive in late August. Not good when you have small kids all excited about going to the seaside sitting in the car.

The poor lady that had the “privilege” of telling me the news was obviously a total nerve wreck, but she never even tried to shift the blame (although I suspect it was not her fault at all) and instead focused exclusively on trying to find us an alternate accommodation.

Lesson#1 – don’t engage in blame shifting. Regardless of what happened and whose fault it might have been, pushing the blame around is even less useful than rearranging the deck chairs on Titanic. Focus on solving the problem, but preserve enough information for a thorough post-mortem analysis.

Lesson#2 – work on the solution. You might experience a mental breakdown ... or you might shut out the rest of the world, concentrate and try to do your best to solve the problem. Practice makes perfect, so use any opportunity you get to work under time pressure. CCIE lab exams (and practice exams) are a perfect tool.

Years ago I would have started to yell at my travel agent, but fortunately I received a very tough lesson in uselessness of yelling once, so I managed to stay calm.

Lesson#3 – yelling and pushing doesn’t help. Either you work with pros doing their best to solve the problem (in which case leaving them alone to work on the problem is the best approach) or you have a different problem (not having the right people working on the disaster) that also won’t be solved by yelling and pushing people around.

Sometimes the travel agents work miracles – ours found an alternate accommodation very close to the original one within two hours. The only gotcha were the dates – we’d planned to arrive on Thursday, but the new place was only available from Saturday. Obviously not an option when you have a car full of excited kids. Time to get flexible – we told the travel agent to find anything close to our final accommodation for the extra two days. She must have fantastic contacts: 15 minutes later we were all set and ready to go. Time to fix our vacation disaster: 2 hours (not to mention the alternate solution was significantly cheaper due to last-minute pricing and extra discounts we got).

A hint for bothall three Slovenian readers of my blog: this travel agency rocks!

Lesson#4 – be flexible. Sometimes it’s impossible to recover to the original state. It’s important that you understand what the mandatory requirements are and be flexible wherever you could be. Insisting on an immediate perfect restoration of the previous state is often counter-productive.

Lesson#5 – create goodwill. Lots of people are affected by every major disaster or get involved in the recovery process. Even though you didn’t cause the disaster, it never hurts to create goodwill after the situation returns back to normal. Buying a few beers for the whole team is the easiest thing to do (remember: if you’re the team leader, it’s your job to do it).

More information

Ethan Banks and Bob Plankers wrote a few blog posts on similar topics:

9 comments:

  1. "A hint for both Slovenian readers of my blog: this travel agency rocks"
    inkrementiraj število bralcev bloga is SI :)

    ReplyDelete
  2. Ivan Pepelnjak18 July, 2011 16:27

    Fixed :-P

    ReplyDelete
  3. Lesson #6 - recheck the plan one last time.

    Been burn one too many either assuming things were OK (regardless of who is responsible for them - could be you, could be someone else), trusting someone's last confirmation sent three days ago, it nerver failed me before, and the list goes on.

    Very good example, mountain biking... how many times I rushed in the hills assuming that my camelback had everything I needed for that morning... even that backup tire tube that I punctured last ride and I completely forgot to patch/replace. :)

    ReplyDelete
  4. "Lesson#2 – work on the solution."

    sublesson : let people work on the solution. The technical guys need to concentrate on fixing the problem : the director or commercial calling every other minute to know where things are won't help. Of course, to do their job, they also need to be informed of what's going on.

    The best is to have a team working on the actual problem, another doing disaster management (calling customers, or answering calls...) and only one or two people passing the information between the teams. They can also keep track of events for later.

    And yeah, the meal / beer for everyone involved fter the rush is important. :)

    ReplyDelete
  5. Ivan Pepelnjak20 July, 2011 07:48

    Well, this should be Lesson#0. I got hit by this one years ago; after that, I always check the travel documents. The glitch happened deeper within the system this time: all my papers had proper dates on them and I had no way of detecting it.

    ReplyDelete
  6. Hope that you come back alive :)

    ReplyDelete
  7. Derick Winkworth21 July, 2011 00:30

    I spent a year at Sungard assisting folks with the network portion of their recovery. A couple of things to consider:

    (1) Don't rely on recovery techniques that completely bomb from a single second of network downtime.
    (2) Don't think your DR vendor/provider knows how to build your network in the recovery center.
    (3) Napkins do not count as documentation
    (4) Have a coherent telecom plan. So important here. That guy at Vz or ATT at 1AM that takes calls inbetween naps does not care about your circuit. Have all circuit IDs, acct numbers, the names of all the people of your account team and their cell numbers, and every escalation number you have ever been given. If you can swing it, get the names of the COs that circuits are supposed to recover at. You can never have enough info for these companies when a real disaster strikes.
    (5) Ditto for Cisco, Juniper, Citrix, Checkpoint, etc. Make sure you have licenses and arrangements that allow for recovery on devices whose serial numbers are not on your support contract. Have all numbers, account team members names/cell numbers, etc... You're going to need it
    (6) Active/Active DR scenarios mean you should not be operating above 50/50 capacity. You have failed already if you exceed this and there is a disaster. theres no way around it.

    ReplyDelete
  8. Markku Leiniö23 July, 2011 22:43

    Great that everything sorted out for you! Btw, I'd never use a travel agency of name mokai.si because in Finnish "mokaisi" means roughly "if they will screw up" :-D

    ReplyDelete
  9. Sally Frederick Tudor10 August, 2011 07:54

    Good posting. Recovery needs to be focused on what is the problem, and what is the solution. Or what are our choices. Cool

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.