Stretched VLANs and Failing Firewall Clusters

Tuesday, November 12, 2019 08:14 +0100

Stretched VLANs and Failing Firewall Clusters

After publishing the Disaster Recovery Faking, Take Two blog post (you might want to read that one before proceeding) I was severely reprimanded by several people with ties to virtualization vendors for blaming virtualization consultants when it was obvious the firewall clusters stretched across two data centers caused the total data center meltdown.

Let’s chase that elephant out of the room first. When you drive too fast on an icy road and crash into a tree who do you blame?

The person who told you it’s perfectly OK to do so;
The tire manufacturer who advertised how safe their tires were?
The tires for failing to ignore the laws of physics;
Yourself for listening to bad advice

For whatever reason some people love to blame the tires ;)

Now for stretched firewall clusters. Building a reliable clustering solution is hard - according to some people it’s so hard that non-clustered solutions have higher uptime than clustered ones, because everything else in the system has a lower failure rate than the clustering software.

Ignoring that bit of wisdom, building a 2-node cluster is the worst thing you can do. Getting a majority in a 2-vote system is a bit hard, and most clustering solutions get around that limitation by adding a witness node - a fake node that does nothing else but helps the voting algorithm.

Why don’t we build a 3-node firewall cluster? Because nobody wants to pay for the extra node just to have sane clustering behavior.

In compute clusters the disks usually serve as the witness nodes. In firewall clusters we could use virtual machines (or something else), but that would make the firewall cluster dependent on some other infrastructure, and security engineers wouldn’t appreciate that.

Relying on clustered storage as a witness node for compute clusters is “turtles all the way down” and people doing that sometimes get what they deserve: after a split-brain event, both storage arrays become active, both cluster nodes think they have majority, and do fun stuff with the data. Recovery from that scenario usually involves restores from backup tapes.

The firewall vendors “solved” the witness node problem by using the communication paths between the cluster nodes as the tie-breaker. There are at least three independent connections between the members of a firewall cluster: inside link(s), outside link(s) and cluster link. If a cluster node cannot reach its peer through any of these links, it’s safe to assume the peer is dead and take over, right?

Well, that line of reasoning was perfectly valid as long as the three links were independent. When stretching a firewall cluster across multiple sites the three links between the firewalls usually get turned into three VLANs running across shared DCI infrastructure, effectively turning what was supposed to be three independent links into a shared-risk link group. No wonder the solution doesn’t work as advertised.

Could we make it better? Sure, change your design. Oh, without that? Sure, make sure the three links are really independent. Oh, without paying for extra links? Wish you luck - I’m out of here…

Then there’s the “minor” detail of failure probability. In a cluster made from adjacent devices connected with point-to-point links the links themselves are least likely to fail, and if they fail, they usually fail in a predictable way, so you can assume they are reliable and move on. When stretching the cluster nodes across multiple sites the links between them become the least reliable component and might exhibit all sorts of gray failures. Do you seriously think the firewall vendors can simulate all those failure scenarios?

Somewhat-related: data center fabric vendors faced the same problems (example: Juniper VCS, HP IRF) but made a sane choice: one node in the cluster has two votes, and the minority nodes shut down in split-brain scenario. For whatever reason that was not found acceptable by the firewall vendors.

That brings me to the final part of this sad story. Any reliability engineer analyzing the whole thing should come to the same conclusion: Don’t ever do it… but why are the firewall vendors still promoting such solutions? I guess it’s always a 737MAX-like story (minus the loss of life): while at least some development engineers know it’s not safe to do things the way they are advertised, the product managers and the sales team are doing whatever it takes to sell the gear, and nobody listens to the engineers anyway.

So what can you do? You claim to be an engineer, right? So learn the fundamentals, figure out how things (probably) work and go from there. Some “solutions” are inherently unsafe - it’s your job to identify them (don’t ever rely on vendors doing that job for you - FAA learned that the hard way), and work with your peers to build a solution that satisfies the true business needs of your company.

Failing that, decide how badly you want to fail, for example when you don’t have the budget for a proper solution.

10 comments:

Unknown 12 November 2019 17:10

Hi Ivan, I've always looked at design with indipendent firewall nodes connected via routing, but how synchronize the connections table? Is the firewall sandwich design (firewall between load balancers) an alternative solution?

Ivan Pepelnjak 12 November 2019 20:48

Why would you need to synchronize the connection tables? Does it matter? How often do you get WAN link failures and traffic failover to the other side? What applications can't recover from session loss?

Unknown 13 November 2019 09:49

I agree, failure scenario does not happen often, but I've seen several applications (not web based) that have problem to recover from a connection failure. More often is the customer that not accept the idea to do not have the synchronization of connections table...until they face with scenario of split brain of course :)

HairyBear 13 November 2019 04:10

How long have you been saying now that this is a bad idea?

I work for a firewall vendor. RFP requirements have to be met for us to sell product. RFP's always have it as a requirement (even if the customer isn't going to use it). If it's not an RFP the customer still tells we have to support it if we want to sell firewalls to them. If the vendor engineers (like me) try to tell them it's a bad idea, they just think it's because we are trying to sell them more stuff. But the firewalls are a small part of the whole design anyway, so why would they change it based on my advice?

Ivan Pepelnjak 13 November 2019 06:48

"How long have you been saying now that this is a bad idea?" << too long. And yes, I feel like that famous gentleman who had issues with windmills.

I also understand your perspective (and potential frustrations), but that doesn't make it right.

Sven Nilsen 14 November 2019 20:57

Funny thing you would mention this today Ivan, i had to replace an undisclosed firewall vendors subsidiary node last week when the node went active for no apparent reason. the short story : "the unit was replaced" the sync process scares me a little more now. btw, i blame the tree for getting in the way.

Ronald Bartels 20 November 2019 06:02

What is an icy road??? Over here we usually hit the elephant...

Francis 15 November 2019 15:33

hen you drive too fast on an icy road and crash into a tree who do you blame? Seems pretty obvious the tree for been there in the first place

Jarechiga 02 December 2019 07:58

I blame the tree removal company. They are obviously not doing their part.

Unknown 09 December 2019 13:16

Thanks for this (and other too) article - balanced, realistic and well reasoned. My impression is that nowadays we often forget about fundamentals. Or not even forget, but not finding enough courage and energy to check the claims made by others (vendors?).

Latest blog posts in High Availability Service Clusters series

Recent posts in the same categories

design

firewall

data center

10 comments: