Do I Need Redundant Firewalls?

Wednesday, October 12, 2016 10:44 +0200

Do I Need Redundant Firewalls?

One of my readers sent me this question:

I often see designs involving several more than 2 DCs spread over different locations. I was actually wondering if that makes sense to bring high availability inside the DC while there's redundancy in place between the DCs. For example, is there a good reason to put a cluster of firewalls in a DC, when it is possible to quickly fail over to another available DC, as a redundant cluster increases costs, licenses and complexity.

Rule#1 of good engineering: Know Your Problem ;) In this particular case:

What’s acceptable loss of service?
What’s your RTO (Recovery Time Objective)?
What’s your acceptable unit of loss?
What’s your fallback/recovery approach?

Decades ago when we used carrier pigeons to transport data between terminals and mainframes (not really, but 2400 bps modems weren’t much faster), losing a terminal session involved loss of data, cussing, yelling, and plenty of wasted time.

Today, losing an HTTP(S) session results in minor annoyance. Also, your mobile users will lose signal orders of magnitude more often than you’ll lose a firewall (at least I hope so), so why bother with a state-sharing cluster. Maybe a fast failover to a secondary unit is good enough.

I haven’t seen any hard data, but intuition suggests that apart from hardware failures a standalone firewall might be more stable than a state-sharing firewall cluster. If you have a pointer to something more tangible, please write a comment!

However, what does matter to the spoiled users of today is the recovery time. If they want to buy something from your web site and cannot do it NOW, they’ll walk away and complain loudly on Twitter and Facebook. From this perspective, it makes sense to have an approach that would bring your application back to life ASAP (for whatever value of ASAP). Maybe it’s good enough to have a cluster that shares the public IP address but no session state, resulting in session loss and recovery within a few seconds.

If you think you can survive a longer outage every now and then, maybe it’s good enough to run a firewall as a VM and have it restarted after the crash.

Finally, does it make sense to declare a data center offline just because its firewall crashed? No, not even to Amazon or Google. You could bypass the failed firewall with routing tricks on DCI link, but it would be way cheaper and less complex to have some firewall redundancy in place within the data center.

Last but definitely not least, there’s the divide-and-conquer approach. Don’t put all your eggs in one basket (or protect all your application with a single firewall instance).

Want to Know More?

You MUST read Scalability Rules: Principles for Scaling Web Sites
I haven’t read the whole Site Reliability Engineering: How Google Runs Production Systems book yet, but even the first chapters are good.
I’ve discussed networking aspects of multi-data-center designs in Designing Active-Active and Disaster Recovery Data Centers webinar.
The same topic is also one of the sections of my Building Next-Generation Data Center online course.

2 comments:

Anonymous 12 October 2016 14:57

Of course, budget and RTO are kings here.

I think that if "designs involving several more than 2 DCs spread over different locations" involves something like good timed "anycast" for example, it can be a pretty good solution in case of mostly static webserver contents (like free wordpress blogs!). But on the other hand, a webapp/mobile app server for like online work/don't know what but relied to work, would be disastrous: the sessions have to be retained as long as possible, that means :

-highly redundant hardware (multiple PSUs, high quality DC, high quality components)
-highly redundant hardware architecture (multiple clustered FWs with intelligent clustering techniques/protocols)
-highly redundant uplinks
-etc

THEN only you can start thinking about a disaster recovery plan. Which is meant, as you know, for real disasters, not for "FW uplink port failure" scenarios.

So as always, it depends. But people mix things up if they think a disaster recovery plan can be used in case of a "SPOF" failure.

Unknown 15 October 2016 11:19

State sharing imho is not worth the complexity it adds to the system. This complexity is what causes the software to become way more senstive and increases the risk with a whole new can of problems. Catching non-cluster software faults with a twin-brother that has the same software doesn't make sence as well.
Running a secondary unit from maybe a different SW train or another vendor AND make sure that the config definition is maintained on an external system that configures a change on the secondary only if it's been running succesfull on the first for X-time is way better imho.

Also: who needs a stinking firewall:)

Want to Know More?

Latest blog posts in High Availability Service Clusters series

Recent posts in the same categories

design

firewall

data center

2 comments: