You probably know the three steps to a disaster recovery plan: Disaster. Recovery. Plan. It’s amazing how true that joke is, and how unprepared we tend to be for infrequent outages.
After heavy snows on Thursday we got a hefty dose of sleet on Friday. The trees almost immediately got a wonderful ice coating … and we lost electricity a few hours afterwards.
Photo taken by Armin Agič (thanks for sharing!)
Things that look great might actually do more harm than good. Technologies that look great in PowerPoint might bring down your network.
Recovery ... the first steps
We live in a small village in Slovenian hills. Heavy snows are not unusual, and we were warned that prolonged power outages aren’t exactly impossible, so I implemented all sorts of redundancy measures.
A typical winter day ... when the weather is cooperating
Being in the middle of a winter storm, I was most worried about us staying warm. Time to fire up our ceramic wood oven. Half an hour later the fire was happily burning.
Just in case you don't know what I'm talking about - you can buy them here.
Simple technologies work best. Find the simplest possible technology that will meet your recovery time objectives and stick with it.
Go for easy wins that solve the most pressing problems. Establish connectivity, get the critical services up and running. Take a deep breath, relax, and continue.
However, it takes incredibly long for the warmth to propagate through the layers of bricks and ceramic (hint: 3-4 hours).
Recovery isn’t instantaneous. Even if you manage to get a backup data center up and running, it might take hours to recovery all storage volumes, databases, servers … If you need faster recovery, you need a better plan (and no, live VM migration won’t help if you’re dealing with a data center failure).
As the oven was getting warm the seams between the ceramic tiles started leaking smoke (a regular annoyance when you use this type of oven after a while).
Recovery procedures that haven’t been used or tested for a long time might have a few glitches. You might discover out-of-date configurations and missing firewall/load balancer rules. There’s a good reason you should start the diesel generators every few weeks.
Unfortunately that wasn't all
We got power after ~6 hours, and our house has better insulation than I expected ;), but I was still bound to get a few more hard lessons. As I'm updating this post, 10% of Slovenia has no power and it might take days before it's restored.
The true heroes
Every emergency has its true heroes - in this case the servicemen from the power distribution companies that have been working days and nights to restore the power, and the firefighters (most of them volunteers) that are still removing all the trees blocking the local roads. Thank you, guys - you're my heroes!