Disaster Recovery and Failure Domains

One of the responses to my Disaster Recovery Faking blog post focused on failure domains:

What is the difference between supporting L2 stretched between two pods in your DC (which everyone does for seamless vMotion), and having a 30ms link between these two pods because they happen to be in different buildings?

I hope you agree that a single broadcast domain is a single failure domain. If not, let agree to disagree and move on - my life is too short to argue about obvious stuff.

Having a VLAN stretched across multiple pods destroys the idea of having pods in the first place (unless you need them for physical structure and/or scaling reasons) - you merged two potential availability zones into one.

Doing the same across multiple data centers has the same effect: you destroyed all the benefits of large investments you made when building a second site, unless you use the second site solely as a cold/warm backup for the primary site. I don’t know many organizations where CIO or bean counters would agree to that approach.

There’s also a minor technical detail: WAN links fail more often than data center infrastructure (see also: fiber finder in its natural habitat). Stretching a VLAN across two data centers to build an active-active architecture introduces a weak link and effectively reduces the availability of that architecture.

Fortunately, you can’t (easily) do the same mistake if you use a public cloud as your backup site - most public cloud providers are sane enough to work exclusively on layer-3 and offer direct L3 (routed) links or IPsec-based VPN connectivity as the only means of building hybrid clouds (even when the whole thing looks like stretched VLANs to untrained eyes). Not at least surprisingly, some enterprise networking- or virtualization vendors offer all sorts of crazy schemes on top of IP transport to stretch the VLANs to places they don’t belong to.

More Information

Latest blog posts in Disaster Recovery series

Add comment
Sidebar