You Don't Need IP Renumbering for Disaster Recovery
This is a common objection I get when trying to persuade network architects they don’t need stretched VLANs (and IP subnets) to implement data center disaster recovery:
Changing IP addresses when activating DR is hard. You’d have to weigh the manageability of stretching L2 and protecting it, with the added complexity of breaking the two sites into separate domains [and subnets]. We all have apps with hardcoded IP’s, outdated IPAM’s, Firewall rules that need updating, etc.
Let’s get one thing straight: when you’re doing disaster recovery there are no live subnets, IP addresses or anything else along those lines. The disaster has struck, and your data center infrastructure is gone.
What could be the simplest solution to get the same subnets and IP addresses you had in the now-destroyed data center reawakened in the new site? Obviously the IP addresses will start popping up as soon as the virtual machines are restarted, but what about the networking infrastructure?
Here are just a few simple ideas:
When using data center switches as first-hop routers: Pre-provision all VLANs and shut down SVI interfaces. Enable SVI interfaces as part of the disaster recovery process.
When using firewalls as first-hop routers: Pre-provision all VLANs and firewall contexts, and shut down the unused contexts (or interfaces in those contexts). Enable the firewall as part of the disaster recovery process.
When using virtual firewall appliances: Pre-provision all VLANs and everything else happens auto-magically once the firewall VMs are restarted in the disaster recovery site.
When you know how to spell MPLS/VPN: Pre-provision the whole infrastructure in another VRF (that you can also use for DR testing) and enable it by changing import route targets on WAN edge routers.
I’m positive you can quickly find a few others. However, all of these ideas have a series of “shortcomings”:
- They cannot be used for disaster recovery test faking (that often fails anyway);
- They require the networking team to be involved in disaster recovery process (OMG, what a weird idea!)
- They require continuous synchronization of configuration changes between primary and disaster recovery infrastructure. Not a big deal if you automated configuration changes, use infrastructure-as-code principles, or use something as simple as Oxidized… and obviously a total deal-breaker if you’re in habit of randomly clicking various GUI options on a Friday evening trying to fix a botched deployment.
Long story short: PLEASE don’t ever tell me you NEED stretched VLANs for disaster recovery. There is absolutely no technical need for them.
Your organization might decide to go down the stretched VLAN path because consultants told them to do so, because you have broken processes, because the virtualization team and the networking team cannot stand each other, or because the application or virtualization teams fake DR tests to get a tick-in-the-box during the annual audit.
In any case, stretched VLANs are a wrong tool to build disaster recovery infrastructure, and when implementing them you created a permanent ticking bomb that you’ll be blamed for when it goes off just to solve someone else’s problem. Good job.
Fortunately, even though most everyone else is selling you VXLAN/EVPN-based stretched VLAN as the latest miracle cure, VMware finally realized that you should recover networking infrastructure as the first step of overall workload recovery, and their disaster recovery approach to multi-site NSX-T deployments makes a lot of sense (active-active multi-site NSX-T deployments are still as bad in release 2.5 as they were before).
More Information
- You might want to watch Building Active-Active and Disaster Recovery Data Centers webinar;
- Multi-site fabrics are covered in the Leaf-and-Spine Fabric Architectures webinar. The underlying technologies are described in EVPN and VXLAN webinars;
- I described NSX-T multi-site deployment as the last topic in November 2019 NSX-T update sessions.
All these webinars are part of Standard ipSpace.net Subscription. For even more goodies check out the Building Next-Generation Data Center online course (part of Expert ipSpace.net Subscription).
This is all great if your disaster is the entire datacenter. You are ignoring the fact that sometimes a single storage system, database, or app cluster fail and need to be recovered in another location. You shouldn't fail the entire datacenter for one app.
If a storage system fails (be it storage array or database), then you have to restart everything attached to it in another location anyway or suffer the consequences of increased RTT. With a decent design, that can still be solved by moving the whole IP subnet to another DC.
Obviously, it would be much better to renumber VMs and use DNS, but if the app developers wouldn't hard-code IP addresses in source code, networking and virtualization vendors wouldn't be able to sell unicorn farts anyway.