Disaster Recovery and Failure Domains

One of the responses to my Disaster Recovery Faking blog post focused on failure domains:

What is the difference between supporting L2 stretched between two pods in your DC (which everyone does for seamless vMotion), and having a 30ms link between these two pods because they happen to be in different buildings?

I hope you agree that a single broadcast domain is a single failure domain. If not, let agree to disagree and move on - my life is too short to argue about obvious stuff.

Having a VLAN stretched across multiple pods destroys the idea of having pods in the first place (unless you need them for physical structure and/or scaling reasons) - you merged two potential availability zones into one.

Doing the same across multiple data centers has the same effect: you destroyed all the benefits of large investments you made when building a second site, unless you use the second site solely as a cold/warm backup for the primary site. I don’t know many organizations where CIO or bean counters would agree to that approach.

There’s also a minor technical detail: WAN links fail more often than data center infrastructure (see also: fiber finder in its natural habitat). Stretching a VLAN across two data centers to build an active-active architecture introduces a weak link and effectively reduces the availability of that architecture.

Fortunately, you can’t (easily) do the same mistake if you use a public cloud as your backup site - most public cloud providers are sane enough to work exclusively on layer-3 and offer direct L3 (routed) links or IPsec-based VPN connectivity as the only means of building hybrid clouds (even when the whole thing looks like stretched VLANs to untrained eyes). Not at least surprisingly, some enterprise networking- or virtualization vendors offer all sorts of crazy schemes on top of IP transport to stretch the VLANs to places they don’t belong to.

