Stretched Firewalls across Layer-3 DCI? Will the Madness Ever Stop?

I got this question from one of my readers (and based on these comments he’s not the only one facing this challenge):

I was wondering if you can do a blog post on Cisco's new ASA 5585-X clustering. My company recently purchased a few of these with the intent to run their cross data center active/active firewalls but found out we cannot do this without OTV or a layer 2 DCI.

A while ago I expressed my opinion about these ideas, but it seems some people still don’t get it. However, a picture is worth a thousand words, so maybe this will work:

On a more serious note…

Whenever someone proposes a stupidity like “let’s turn our L3 DCI into L2 DCI so we can run stretched firewall cluster on top of it”, politely ask “and what happens when (not if) the DCI link fails?” because asking “what were you smoking” might sound offensive.

Fortunately for everyone who has to work with real-life networks, Cisco engineers (even those working in marketing) tend to be pretty honest when it comes to how things really work, so it was really easy to answer that question by reading the documentation, design guides, and ASA Clustering Deep Dive Cisco Live session:

  • A failure in communication between different members of the cluster will result in ejection of that firewall from the cluster;
  • CCL (Cluster Communication Link) loss forces the member out of the cluster
  • CCL link loss causes unit to shut down all data interfaces and disable clustering. Clustering must be re-enabled manually after such an event

For those who still don’t get it: if you lose the communication between cluster members (which would happen after DCI link failure), the firewalls in one data center shut down and cut that data center off the net.

Do keep in mind that if you have two data centers with L3 DCI between them, they could work independently after DCI link failure (apart from the potential need to synchronize data between them). Building a firewall cluster on top of L3 DCI is thus a huge step back in terms of failure resiliency.

Finally, here’s my message to the vendor sales engineers promoting such stupidities:

Latest blog posts in High Availability Service Clusters series

21 comments:

  1. This post just made my day, I wish I had such images when I had to redesign/fix a dumb firewall ASA clustering design across L2-DCI, so I could show these images to them. Your sense of humor is just awesome Ivan!

    Keep it up. Cheers!
  2. I wish I could print up all of your anti-stretched-DC blogs posts as a "propaganda" flyer and drop thousands of copies on the engineering campuses of all the major equipment vendors. Perhaps that would be enough of a clue bat to make them cease and desist the insanity.
  3. Ivan, you act like Don Quixote but i'm your Sancho Panza...
  4. Ivan, just curious - Does your opinion change if you have dark fiber with two different physical paths and diverse entrances at your data centers? Or is your opinion is that it is just ALWAYS a bad idea?
    Replies
    1. If you're willing to pay for that why not add the extra two firewalls to remove the problem?
    2. What about two firewalls at each data center, thereby creating a four node cluster? Please understand, I get the logic, I just want to see if there is EVER a time where this could be an acceptable solution or if it is ALWAYS a bad idea.
    3. Talking about scope creep and lack of infinite budgets - let's get dedicated dark fiber (2x10GE would run about $15k/month). Mind you it was suggested by a vendor, you got no L2 DCI - no problem just get dedicated dark fiber. Or even better, get OTV ( throw in a couple of expensive routers and associated licenses) to help achieve this clustering.
      MattN: It's what we are aiming for, but we will leave it at 2 firewall cluster local to each DC minus the need to cluster across DCI.
    4. MattN, single cluster = single failure domain. Adding more firewalls to the cluster will not make it more resilient to DCI/CCL link failures. The only way forward is to build two clusters (one in each data center).
    5. OK - So, consensus appears to be that it is ALWAYS a bad idea, even under "ideal" conditions. This is because a SPOF would be created by the cluster.
    6. Do you even need two clusters? One firewall per DC is enougn most cases.
  5. Hehehhe, I like the article. It is hummorous, but in the end true. Marketing people (no offense) will lot of time sell anything without checking if it makes sense (or possible) from technical point of view.
    Replies
    1. And sadly, a big part of customers won't check too ;)
  6. Ivan,
    I get where you're coming from. You're an old school, disciplined networking leader that architects networks based on rock-solid, time-tested designs. But it seems that the prevailing fashion in network design and availability go against your traditional design principles: inter-site firewall clustering, inter-site vMotion, DCI, etc

    This isn't the first time that readers have asked you about these technologies, and it won't be the last. Vendors will continue to market them despite their shortcomings, and customers will continue to eat them up. I'd like to think that vendors will also continue to work out the kinks and over time the technology will become rock solid and time-tested.

    I sincerely ask this out of curiosity and respect, though I will word it very directly: Are you too stuck on past, traditional designs and not being open to new ways of building IT? I get that IT is very cyclical, and these new trends may die in the future...or thrive, and the customers may either fail...or succeed.

    For the record, I see the same mindset reflected on the blog posts at netcraftsmen.net, another site comprised of old-school, disciplined professionals. Maybe there's a common theme here...but that won't stop me from asking :-)

    Thanks
    Replies
    1. Well, if you think people are "stuck in the past" because they actually look left & right before crossing the street and don't belive their "App" or the traffic light, then: Go ahead, prove yourself wrong, nobody will prevent you from hurting yourself.

      In other words: It's not about what you think, it's about how the technology was / is designed to work and how much risk you are willing to take. If you use a tool for a task it's not supposed to work for, it might work for some time, but it will fail at point...
  7. In similar discussions I used to give example of AWS and their idea of regions and zones. Example of big player but not vendor, helps to stop crazy ideas from IT/app teams.
  8. I've seen it done only once, and that was with a Telco that had 2 buildings 100metres apart with a tunnel full of fibre between them.

    Basically, to do it you will need rock-solid, highly available, high capacity, low latency layer2 between the sites. Not just for the sync between the 2 firewalls, but also for pretty much all of the VLANs that the firewalls connect to, and for everything else around the firewalls as well - think perimeter routers, load balancers, etc. Unless you've got your own tunnel between the buildings, it's not worth the cost and pain of what happens if there's a break.

    Replies
    1. Worth noting that the CCL carries data in case of asymmetric traffic - not just cluster synchronization data.
  9. There are DCI designs which can protect from having one DC cut off from the Internet: https://drive.google.com/file/d/0B6XxNd5c3zV_SW9kTVF4SkdiYWM/view?usp=sharing
    You can also make your DCI link redundant.
    At last, there is an updated version of the ASA Clustering Deep Dive session PDF: https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=83709&backBtn=true
    Replies
    1. Regardless of how complex a design you make, it cannot survive a DCI link failure (unless you build a backup tunnel over the outside WAN).

      As for "making the link redundant" - it's just a question of how much money you're willing to throw at the problem to reduce the risk of failure, but you can never eliminate that risk.

      As I wrote several times (note that the first blog post is from 2011), in the end you have to decide how badly you want to fail.

      http://blog.ipspace.net/2011/04/distributed-firewalls-how-badly-do-you.html
      http://blog.ipspace.net/2015/10/sometimes-you-have-to-decide-how-badly.html
  10. Strange how inter-DC clustering failure is considered a certainty in this blog... I am running two 6 ASA node clusters (tier1/2) across 3 data centers. It has worked really well and has lightened the management burden for the network team to support 3 DCs. I do agree that your setup should favor success before attempting... IE heavy VM presence, redundant physical servers at each DC, reliable/resilient L2 DCI, backend L3 routing inside and outside the firewall, dynamic routing protocol... I did have to fight through a few bugs early on, but I fear not the challenge of new technology.

    Mark Baker
    www.hi-technetworks.net
    Replies
    1. Response here:

      http://blog.ipspace.net/2016/04/some-people-dont-get-it-it-will.html
Add comment
Sidebar