Disasters Happen ... It’s the Recovery that Matters
My recent vacation included a few perfect lessons in disaster recovery. Fortunately the disasters were handled by total pros that managed them perfectly. It all started when we were already packed and driving – my travel agent called me to tell me someone mixed up the dates and shifted them by two months; we were expected to arrive in late August. Not good when you have small kids all excited about going to the seaside sitting in the car.
The poor lady that had the “privilege” of telling me the news was obviously a total nerve wreck, but she never even tried to shift the blame (although I suspect it was not her fault at all) and instead focused exclusively on trying to find us an alternate accommodation.
Lesson#1 – don’t engage in blame shifting. Regardless of what happened and whose fault it might have been, pushing the blame around is even less useful than rearranging the deck chairs on Titanic. Focus on solving the problem, but preserve enough information for a thorough post-mortem analysis.
Lesson#2 – work on the solution. You might experience a mental breakdown ... or you might shut out the rest of the world, concentrate and try to do your best to solve the problem. Practice makes perfect, so use any opportunity you get to work under time pressure. CCIE lab exams (and practice exams) are a perfect tool.
Years ago I would have started to yell at my travel agent, but fortunately I received a very tough lesson in uselessness of yelling once, so I managed to stay calm.
Lesson#3 – yelling and pushing doesn’t help. Either you work with pros doing their best to solve the problem (in which case leaving them alone to work on the problem is the best approach) or you have a different problem (not having the right people working on the disaster) that also won’t be solved by yelling and pushing people around.
Sometimes the travel agents work miracles – ours found an alternate accommodation very close to the original one within two hours. The only gotcha were the dates – we’d planned to arrive on Thursday, but the new place was only available from Saturday. Obviously not an option when you have a car full of excited kids. Time to get flexible – we told the travel agent to find anything close to our final accommodation for the extra two days. She must have fantastic contacts: 15 minutes later we were all set and ready to go. Time to fix our vacation disaster: 2 hours (not to mention the alternate solution was significantly cheaper due to last-minute pricing and extra discounts we got).
A hint for both[all three]{.underline} Slovenian readers of my blog: this travel agency rocks!
Lesson#4 – be flexible. Sometimes it’s impossible to recover to the original state. It’s important that you understand what the mandatory requirements are and be flexible wherever you could be. Insisting on an immediate perfect restoration of the previous state is often counter-productive.
Lesson#5 – create goodwill. Lots of people are affected by every major disaster or get involved in the recovery process. Even though you didn’t cause the disaster, it never hurts to create goodwill after the situation returns back to normal. Buying a few beers for the whole team is the easiest thing to do (remember: if you’re the team leader, it’s your job to do it).
More information
Ethan Banks and Bob Plankers wrote a few blog posts on similar topics:
inkrementiraj število bralcev bloga is SI :)
Been burn one too many either assuming things were OK (regardless of who is responsible for them - could be you, could be someone else), trusting someone's last confirmation sent three days ago, it nerver failed me before, and the list goes on.
Very good example, mountain biking... how many times I rushed in the hills assuming that my camelback had everything I needed for that morning... even that backup tire tube that I punctured last ride and I completely forgot to patch/replace. :)
sublesson : let people work on the solution. The technical guys need to concentrate on fixing the problem : the director or commercial calling every other minute to know where things are won't help. Of course, to do their job, they also need to be informed of what's going on.
The best is to have a team working on the actual problem, another doing disaster management (calling customers, or answering calls...) and only one or two people passing the information between the teams. They can also keep track of events for later.
And yeah, the meal / beer for everyone involved fter the rush is important. :)
(1) Don't rely on recovery techniques that completely bomb from a single second of network downtime.
(2) Don't think your DR vendor/provider knows how to build your network in the recovery center.
(3) Napkins do not count as documentation
(4) Have a coherent telecom plan. So important here. That guy at Vz or ATT at 1AM that takes calls inbetween naps does not care about your circuit. Have all circuit IDs, acct numbers, the names of all the people of your account team and their cell numbers, and every escalation number you have ever been given. If you can swing it, get the names of the COs that circuits are supposed to recover at. You can never have enough info for these companies when a real disaster strikes.
(5) Ditto for Cisco, Juniper, Citrix, Checkpoint, etc. Make sure you have licenses and arrangements that allow for recovery on devices whose serial numbers are not on your support contract. Have all numbers, account team members names/cell numbers, etc... You're going to need it
(6) Active/Active DR scenarios mean you should not be operating above 50/50 capacity. You have failed already if you exceed this and there is a disaster. theres no way around it.