Disaster Recovery Faking, Take Two

Thursday, October 17, 2019 08:22 +0200

Disaster Recovery Faking, Take Two

An anonymous (for reasons that will be obvious pretty soon) commenter left a gem on my Disaster Recovery Test Faking blog post that is way too valuable to be left hidden and unannotated.

Here’s what he did:

Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center.

That’s the right first step: simulate the simplest scenario (total DC failure) in a way that is easy to recover (power on the switches). It worked…

A word and a blow. Spanning tree converged pretty fast. The stretched VLANs were all functional on the DR site.

… but it also exposed all the shortcomings of PowerPoint-based designs promoted by $vendor engineers and copycat consultants:

But there were split brain scenarios on firewalls all over the place.

Also, nobody ever thinks about storage when talking about automated disaster recovery using stretched VLANs because “we do synchronous replication so what could possibly go wrong”. Well, as expected:

Even worse the storage (iSCSI) didn’t survive because it couldn’t build the quorum and so the storage wasn’t accessible on the DR site.

Final result: total failure.

After some minutes no machines (also virtual load balancers) were reachable. Also vCenter wasn’t reachable any more and so they couldn’t do anything. After bringing the core switches back online it took them hours to recover from the mess.

Let me recap: someone proposed a disaster recovery architecture (stretched VLANs) that is not only totally broken from resilience perspective and is a potential ticking bomb, it doesn’t even work when needed.

Just in case you think that’s an isolated incident: major networking and virtualization vendors are selling the same fairy tale to their customers every day without ever mentioning the drawbacks or caveats.

How do you think the story continued?

They never fixed the broken architecture (firewall/load balancer cluster and iSCSI storage).

On the second attempt they faked disaster recovery tests like you were describing by carefully moving some machines to the DR site but did not failover firewalls and load balancers.

In the internal review discussion they had to admit that the network failed over as expected and in a reasonable amount of time.

As expected - cognitive dissonance kicked in. It was easier to pretend there was no problem and faking the results while spending the minimum amount of work. Who cares that the house-of-cards would collapse the first time it’s really needed.

He concluded with:

Most everyone these days is relying on 100 % network availability. It’s a fallacy. Sometimes they have to learn it the hard way.

The only problem is that some people never learn… sometimes because they don’t want to, sometimes because they can’t grasp the idea of external "experts" misleading them, sometimes because they don’t understand the basics of reliability theory.

Don’t be like that. If you want to discover how networks really work including the impact of fallacies of distributed computing, the basics of reliability theory, or how to build disaster-recovery or active-active data centers we have you covered.

Latest blog posts in Disaster Recovery series

Recent posts in the same categories

design

data center

high availability

11 comments:

Bogdan Golab 17 October 2019 12:24

Let me share you system vendor's perspective & experience: from this perspective Failure & Recovery (Disaster recovery) should be sold as any other feature (should be a part os Acceptance tests).
Vendor does not want to avoid responsibility and being blamed for not doing miracles.
The crucial thing is to define what kind of failures should be covered (handled) and what is expected system response to this failures (the connectivity gap, etc).
So the whole problem is not only about technology which is neither good or bad (still cheap solutions win as lomg as the consequences to the business continuity are clearly shown adn accepted).
The problem is when the Customer expects 'miracles' where the technology cannot handle it.
---
So define the requirements to the system surivalibility (what kind of failures are supposed to be handled, do we support more than ONE failure?, what about the gap between subsequent failures - sometimes systems cannot survive two failures when the gap between them is short to recover, recovery is NOT only about the networking part - we have servers as well, improve th eweakest part first ,etc)
Then test them during the System Acceptance Testing.
Of course the real life is much more complex and you will be hit but something unexpected. But be specific what you sell/advertise. The customer tends to assume that there is a 'magic' under the hood. So ask how this magic does work.
Don't expect the magic is going to "fix" low of physics. The honest vendor should talk to the Customer about shortcomings to avoid being blamed in the future.

Again, this is vendor's perspective.

Replies

Ivan Pepelnjak 17 October 2019 18:27

"Again, this is vendor's perspective" << and I totally agree with everything you wrote, and I would love to have a vendor delivering and taking responsibility for an end-to-end solution.

The crux of the problem is somewhere else - most IT vendors supply spare parts for DIY kits, tell their customers how wonderful the creations based on those DIY kits can be, but never take any responsibility for the end result. Some go as far as blaming everyone else, and telling people how to misconfigure boxes connected to their miraculous creations to make the whole thing work.

Now we can say "but it's the customers claiming that every network is different", and that's absolutely right... but I doubt that it's in any vendors' interest to change that mentality.

Bogdan Golab 17 October 2019 18:43

Yes. You are right. When you deliver the SYSTEM as the PRODUCT and want to build it out of 'DYI Kits' you need EXTRA developer support from a vendor (=they customize/develop the feature you need to achieve what you need). This is sad side of the IT 'DIY Kits' business: whenever you want to make something more advanced the vendor offers extra support for extra money. In my quite long professional history I saw it several times.

This proves that 'DIY Kits' are often useless. That's why programmable devices (when you have own development forces) is sometimes the way to go. You don't need spend time and money on working with vendor's development team to teach them what is the use case.

You touched the important part: often vendors of the 'DYI Kits' do not address real life use cases - they are somewhat alienated from real life - but it's another story.

Piotr Jablonski 17 October 2019 19:16

Ivan, I support your mission to avoid stretched VLANs, but this is not a fairy tale of virtualization consultants. :)
The virtualization layer in DR does not require the stretched layer-2. As described in this case, a cluster of firewall had a split-brain. So the question comes how many firewall vendors support L3 clustering (not over VXLAN)? Checkpoint? Anyone else? Even without this feature, still, L2 is not needed for a DR scenario. A different story is Active-Active and Active-Standby scenarios. L2 is overused, but with overlays, it can be at least limited here. To have fully L3 separated domains in the Active-Active model, a customer would need to rewrite his applications. Stateless architecture is an answer. Then the status of applications won't be stored in the front/app layer, which frequently requires the L2 extension.

Replies

Ivan Pepelnjak 17 October 2019 20:10

Let me start with a real-life story: https://blog.ipspace.net/2013/01/long-distance-vmotion-stretched-ha.html - I know for a fact what color the background on those slides were.

Also, let's be realistic. Which vendor promotes these in their marketing materials as DR solution:

* stretched compute clusters with affinity rules to prevent VMs from escaping into the wrong data center (don't get me started on this one...);
* stretched distributed file system instead of storage replication;
* long-distance VM mobility for "disaster avoidance"

... and what industry segment is that vendor most known for?

Disaster recovery done right does NOT require stretched anything, but in that case the teams in an organization have to TALK to each other and SYNCHRONIZE their actions. OMG, what a niche to exploit ;) Remind me again, which vendor was the first one telling their customers "you don't need to talk to anyone else, just ask for stretched VLAN and we can do the rest" story?

Piotr Jablonski 17 October 2019 20:43

Ivan, similarly we can say that EVPN is not recommended now because in 2013 there was no route-type 5 supported in BGP. Vmotion does not require L2 since 2015. At that time, every vendor promoted stretched VLANs by offering solutions requiring L2 - firewalls, load-balancers, MS clusters, storage clusters. All those fabrics based on VPLS, FP, TRILL, OTV were solutions to do the same - extent L2. At that time, everyone was a promoter. But saying today that stretched VLANs is a fairy tale of virtualization consultants is not valid. Remind me again, which component had a failure after power off physical switches? :)

Ivan Pepelnjak 17 October 2019 20:51

You conveniently ignored "even worse, the storage didn't survive" ;) .. oh and don't blame Microsoft. They had DNS-based clustering for ages.

In any case, while there are no innocents in this mess, we are both old enough to know how the stretched VLAN story started, and no amount of lipstick will make this pig look any better.

I'm stopping my end of the discussion.

Piotr Jablonski 17 October 2019 21:04

We discussed virtualization consultants and their fairy tales. Who is the vendor of this iSCSI storage? If VMware then, of course, it contributed to the failure outcomes. ..and a good DNS option does not mean that customers are actually using it always; unfortunately, the mcast based solution was quite common.

Today, there is a new promoter of L2 extension - containers. In my opinion, we should try to evangelize developers and business decision-makers more than vendors. ;)

Anonymous 17 October 2019 19:19

Thank you that you dedicated an own blog post based on what I experienced. As often after the total failure the system administrators blamed the network. I had to justify myself that there was no problem on layer 2 with the help of MAC address table-, interface-, spanning-tree and LACP events. Finally they stopped blaming and wrote the problem off because in their eyes a DC failure would be a rare case. Notice that they sold a "geo redundant high availability solution" to the customer.

Also with the same employer I had a second experience (with own DCs) where I hadn't that much fun. They still have SLAs with huge financial impact which would be triggered if connectivity is lost for seconds. To "meet" the SLAs they spanned VLANs for a lot of applications but not for all (they sold the non redundant applications as "high available" regardless). Their assumption was that everything but the non redundant applications would survive a DC failure. I wasn't sure about their assumption so I suggested another shutdown of the core switches. I reached the change management to find a maintenance window and prepare a change. The answer was "Don't dare you! If you would do so everybody can look for a new job". So as you said it's a ticking bomb.

For years I'm working in the networking field I always saw spanned VLAN deployments failing miserably. A spanned VLAN is still a single point of failure no matter what. Most high available (failover) solution out there will fail in the event of a DC failure because they are untested. I don't give up: every time there's the possibility for a DR test I will do a core switch shutdown and enjoy the mess.

Replies

Piotr Jablonski 17 October 2019 19:37

Agree, that sooner or later the architecture based on stretched VLANs fails. That's why I like the decoupled architecture promoted by AWS. L2 cannot be extended between Availability Zones there. Even if someone does not like the public cloud, this approach can be replicated to a local Data Centers. :)

Ken Lai 23 October 2019 03:03

I always respect the ChaosMonkey in Netflix, which randomly shuts down services to test the resilience and recovery plan, after I've seen so many perfect-in-PPT-only architectures. I thinks the problem is not only limited to the strentched VLANs but also other things.

Just as Bogdan Golab said, a perfect solution may not be good since some times vendors need something to be broken so they can sell more.

Add comment