MUST READ: Fast and Simple Disaster Recovery Solution
More than a year ago I was enjoying a cool beer with my friend Nicola Modena who started explaining how he solved the “you don’t need IP address renumbering for disaster recovery” conundrum with production and standby VRFs. All it takes to flip the two is a few changes in import/export route targets.
I asked Nicola to write about his design, but he’s too busy doing useful stuff. Fortunately he’s not the only one using common sense approach to disaster recovery designs (as opposed to flat earth vendor marketectures). Adrian Giacometti used a very similar design with one of his customers and documented it in a blog post.
TL&DR Summary:
- Layer-3 DCI
- No stretched VLANs
- Simple storage replication between sites
- Recovery site is in ready-to-go hot standby. Storage is ready, networking is ready, all it needs are the virtual machines
- Production and Recovery VRFs use the same IP addressing internally. They are never connected directly.
- He complicated the design a bit with NAT and probing-based DNS. I’m positive it would be possible to get rid of these requirements following Nicola’s approach.
He concluded his blog post with three rhetorical questions that I couldn’t resist answering:
Is technology evolving or is just a pile of old stuff rebranded by new developers?
It’s just a pile of old stuff (see also RFC 1925 rule 11). Unfortunately most developers don’t care about history and thus repeat its mistakes ad nauseam.
Why haven’t I seen these scenarios before? I know it might sound weird, but it is completely viable as a basic case of study.
Because these scenarios don’t fulfill two requirements:
- You can’t fake disaster recovery testing with them, because it’s impossible to move a single VM to the other site, claim “mission accomplished”, and go home.
- These designs don’t make any money for the networking or virtualization vendors – they work well on any gear that supports VRFs, and are not complex enough to justify selling new gear.
How far away were we from having an active-active scenario?
Very very far. More in an upcoming blog post.
If I understand Adrian's solution correctly, he would also have VRF Prod and VRF DRP at the DRP site (see the last picture from his blog post). Both VRFs (Prod and DRP) at the DRP site are connected via so called pivot servers for management purposes. Those pivot servers will probably have one leg in VRF Prod (10.1.0.0/16) and one leg in VRF DRP (10.0.0.0/16). Then for customer facing services he relied on NAT and rerouting their public IP space by the ISP. The latter is suboptimal as you depend on someone else (ISP in this case). It's also questionable how often the VMs and DBs (storage) is replicated to DRP site. Adrian's solution is not bad but convoluted. Nicola's solution looks more elegant to me as he only needs to reprogramm route target imports/exports. I'm bothered about having the same IP addresses on both the production and DR site as it is not really needed. Most applications need some form of high availability which results in distributed systems. In the end it's all about the CAP theorem. You have to choose two, you can not have all three. The most important thing is your data. Maybe you can live with eventual consistency. If so spend more time in proper and intelligent load balancing (or failover). If you depend on consistency of your data then your application has to deal with failures.
As always, make it a network problem
The VRFs are connected through a NATing device. The pivot servers are inside the DRP VRF and accessed from the outside like any other service in the DRP VRF, by using a NAT from outside. We only NAT main services IP (like the VIP load balancer), the pivot server (so admins can enter the DRP and manage local locally), and nothing else. Internal users come from internal MPLS. While internet users come from Internet, and we had no choice but to ask the ISP, to reroute the segment. As I mention, is not easy to reconfigure hundreds of business partner's VPN. Nicola's solution is good, but it requires human intervention and BGP routes updates in the whole region, shouldn't be a big deal and it could even be automated. This solution is different since does not require any BGP and is always online, but of course, you have a trade-off, the NAT. I should make part2 post, I over simplified the first one.