Disaster Recovery: a Vendor Marketing Tale
Several engineers formerly working for a large virtualization vendor were pretty upset with me when I claimed that the virtualization consultants promote “disaster recovery using stretched VLANs” designs instead of alternatives that would implement proper separation of failure domains.
Guess what… it’s even worse than I thought.
Here’s a sequence of comments I received after reposting one of my “disaster recovery doesn’t need stretched VLANs” blog posts on LinkedIn sometime in late 2019:
It’s just simpler to stretch layer 2 and failover edge routing for most customers I work with. There are different ways to skin a cat, but stretching layer 2 is a great way to solve DR.
Stretching layer-2 is also a great way to bring down two data centers, and I’ve seen more people getting that result than a working disaster recovery solution (because in many cases the solution proposed by the consultants doesn’t work anyway).
You don’t NEED stretched layer 2, but it is the right solution for some. Particularly in the sector where I work. Customers want DR, but don’t have the processes/operators to do anything more involved.
OK, I understand that some customers want tick-in-the-audit fake that’s best implemented using stretched VLANs, but that was not the case here…
They generally are using SRM or some other sort of VM replication to re-instantiate then following a runbook process to bring other systems back online, which includes migration of edge. But we’re talking about orgs who are small enough to accept RTO of 24-48 hours. Recovery time is a little more flexible when you’re not losing money each moment you’re down!
Now read the above paragraph twice, brew a coffee, and allow it to sink in.
- They are using VMware Site Recovery Manager (a decent recovery orchestration tool) that has the capability to add external hooks. It would be trivial to enable networking in the recovery site as the first step of the SRM recovery process (see this blog post for a few ideas);
- Recovery Time Objective (RTO) is 24 - 48 hours, and yet they’d risk having a ticking bomb instead of a stable network just so the server/virtualization team wouldn’t have to work together with the networking team during the recovery process?
I’m positive the engineer writing those comments had the best intentions and did his best to help his customers. It’s just that he based his design on white papers from a major virtualization and a major networking vendor… both of them having a vested interest in peddling more-and-more complex solutions instead of robust time-proven designs (that could be implemented with boxes from any networking vendor).
Long story short: If you base your design on vendor whitepaper, you get the results you deserve. Step back, figure out what you really need, and what’s the simplest way (across the whole application stack) to implement it instead of pushing the problems down the stack.
"There are different ways to skin a cat, but stretching layer 2 is a great way to cause DR."
"You don’t NEED stretched layer 2." (period)
"Recovery time is not losing money each moment you’re down!"