You Don't Need IP Renumbering for Disaster Recovery

This is a common objection I get when trying to persuade network architects they don’t need stretched VLANs (and IP subnets) to implement data center disaster recovery:

Changing IP addresses when activating DR is hard. You’d have to weigh the manageability of stretching L2 and protecting it, with the added complexity of breaking the two sites into separate domains [and subnets]. We all have apps with hardcoded IP’s, outdated IPAM’s, Firewall rules that need updating, etc.

Let’s get one thing straight: when you’re doing disaster recovery there are no live subnets, IP addresses or anything else along those lines. The disaster has struck, and your data center infrastructure is gone.

On a related topic, don’t ever start the disaster recovery process until the primary site is gone for good. I’ve heard horror stories of people managing to reconnect a newly-activated DR site back to the still-operational (but previously disconnected) primary site. It wasn’t pretty…

What could be the simplest solution to get the same subnets and IP addresses you had in the now-destroyed data center reawakened in the new site? Obviously the IP addresses will start popping up as soon as the virtual machines are restarted, but what about the networking infrastructure?

Here are just a few simple ideas:

When using data center switches as first-hop routers: Pre-provision all VLANs and shut down SVI interfaces. Enable SVI interfaces as part of the disaster recovery process.

When using firewalls as first-hop routers: Pre-provision all VLANs and firewall contexts, and shut down the unused contexts (or interfaces in those contexts). Enable the firewall as part of the disaster recovery process.

When using virtual firewall appliances: Pre-provision all VLANs and everything else happens auto-magically once the firewall VMs are restarted in the disaster recovery site.

When you know how to spell MPLS/VPN: Pre-provision the whole infrastructure in another VRF (that you can also use for DR testing) and enable it by changing import route targets on WAN edge routers.

I’m positive you can quickly find a few others. However, all of these ideas have a series of “shortcomings”:

  • They cannot be used for disaster recovery test faking (that often fails anyway);
  • They require the networking team to be involved in disaster recovery process (OMG, what a weird idea!)
  • They require continuous synchronization of configuration changes between primary and disaster recovery infrastructure. Not a big deal if you automated configuration changes, use infrastructure-as-code principles, or use something as simple as Oxidized… and obviously a total deal-breaker if you’re in habit of randomly clicking various GUI options on a Friday evening trying to fix a botched deployment.

Long story short: PLEASE don’t ever tell me you NEED stretched VLANs for disaster recovery. There is absolutely no technical need for them.

Your organization might decide to go down the stretched VLAN path because consultants told them to do so, because you have broken processes, because the virtualization team and the networking team cannot stand each other, or because the application or virtualization teams fake DR tests to get a tick-in-the-box during the annual audit.

In any case, stretched VLANs are a wrong tool to build disaster recovery infrastructure, and when implementing them you created a permanent ticking bomb that you’ll be blamed for when it goes off just to solve someone else’s problem. Good job.

Fortunately, even though most everyone else is selling you VXLAN/EVPN-based stretched VLAN as the latest miracle cure, VMware finally realized that you should recover networking infrastructure as the first step of overall workload recovery, and their disaster recovery approach to multi-site NSX-T deployments makes a lot of sense (active-active multi-site NSX-T deployments are still as bad in release 2.5 as they were before).

There might be other reasons why you might be asked to implement stretched VLANs. Most of them are equally bogus.

More Information

All these webinars are part of Standard Subscription. For even more goodies check out the Building Next-Generation Data Center online course (part of Expert Subscription).

Latest blog posts in Disaster Recovery series

Add comment