Sometimes You Have to Decide How You Want to Fail
Another week, another ExpertExpress session, as is often the case focusing on two data centers with stretched VLANs spanning both of them. However, this one was particularly irksome, as the customer ran a firewall cluster stretched across two locations.
I gave the customer engineers my usual recommendations:
- Try to get rid of stretched VLANs, or at least minimize the number of VLANs that span both data centers;
- Try to identify the real business needs and solve them with the right tool(s) for the job (example: VMware SRM instead of stretched HA cluster);
- Split applications (and subnets) across both data centers so that every application (and subnet) is active only in one data center, having the second one as a backup;
- Consider the storage failover: if the storage failover is not automatic, it doesn’t make sense to even consider a stretched HA cluster (the proof is left as an exercise for the reader ;)
- Sometimes good enough is good enough – you can create VLANs and subnets in the alternate location with automation scripts before starting the data center recovery.
They understood all the arguments and agreed with most of them, but it was obvious every single one of them was going to be an uphill battle for them.
In the end, we were left with the question of the stretched firewall cluster. As we already agreed to try to keep (active parts of the) subnets in a single data center, it would be easy to split the cluster in two independent firewalls, each one of them serving a subset of IP subnets, but then they’d lose the redundancy features.
The obvious solution was to deploy two independent firewall clusters, but they didn’t have the budget to buy additional boxes, so the only advice I could give them was you have to decide how badly you want to fail (and when):
- You could keep the stretched firewall cluster, and risk shutting down one data center (if one firewall in the cluster shuts down after losing connectivity with the other one) or having an interesting split-brain failure when the DCI link fails;
- You could run two non-redundant firewalls and risk losing external access to a data center when one of them crashes;
Summary: when faced with a lose-lose decision, the only thing you can do is (A) evaluate all possible failure modes, (B) identify all the risks associated with individual options, (C) decide which one(s) you want to accept, (D) document the risks and the decision, and have your boss sign it off.
But the FW clusters is a solution to the wrong problem. It is time the firewall dies - 100k lines of configuration in an enterprise worsens security and does not improve it.
PaloAlto comes with a solution which doesn't need L2 between the Data Interfaces. Described in there Design Guide - Paragraph 2.3 ( https://live.paloaltonetworks.com/t5/Integration-Articles/Designing-Networks-with-Palo-Alto-Networks-Firewalls/ta-p/60868 ).
The design guide is describing an active/active scenario which is not what I want. But with proper routing you can run this in an active/passive way (based on traffic flow).
Maybe the Cluster Links have to be L2. But this could be solved with non Switched Point-2-Point Links like L2TP/MPLS-XConnect.
Of course, this doesn't solve issues with too slow DCI links or DCI delay issues. Furthermore, just the data path is somewhat separate, the management doesn't (wrong configuration = whole cluster is down).
It seems Paloalto is the only vendor supporting (=described in the documentation) this. Somewhat surprising. Do I overlook something?
In the end, it doesn't matter how both sides will be connected and over which protocol. The real matter is how do you implement the awareness of DCs from each other.
It's so unpredictable to me that a Human based switch will be more reliable in 99% of the cases.
You can invent a very complicated system with a lot of switches/tests to be made before processing the failover. But you might be missing a step (humans do miss steps, quite often actually), missing a bug: -> not needed failover/or splitbrain happen and you created the disaster by trying to avoid it. (I personally think that if my VMs are suddenly working on two separate DCs, with network flapping around or any other happy event between, it will be more desastrous than a main site failure at all, remember you lost the state of the data which is actually one of the most important points).
Or simply, your system that is doing the failover is failing too, so It's not going to work out too.
Or, you implement a Human switch, deal with that fact that you cannot be faster as 1 hour. Customers know that because they signed the SLA. Accordingly, you documentate a procedure to be followed in case of DC failure, step by step:
-Making sure your DC really failed
-Switch the network (simple scripts or just portions of configuration that have to be activated, like ospf or bgp)
-Go to the DR tool, switch the DC (hosts and Storage), bring up a test VM and test connectivity, ifok bring up all VMs, etc
That being said, I just want to point out that before you even think about designing a backup site with DR, just make sure you have a well-designed main site, that would prevent half of the DRs you might issue later.
DR is really in case of disaster, that means your DC fails entirely. It costs less to opt for a good main DC than having 2 DR sites with all the hardware/license costs it's generating.
So now, I don't even understand why you would ever stretch or span something.
In my live, people who decide listen more to vMware people than me (seems to be the same in Ivan's ExpertExpress sessions). So, L2 DCI is just a requirement.
Now I have two possibilities:
- Looking for another employee which doesn't use L2-DCI
- Try to convince the people further about "logically connected DC will fail" and try to separate my part of the DC (Network/Firewall) as best as possible with the L2-DCI requirement and the available budget.
In Ivan's blog post above, it seems they already run a stretched VLAN design and I assume there is a L2-DCI requirement because of vMotion. So the question is, can we improve the mentioned firewall design and still support the required L2-DCI.
His advice is "... or at least minimize the number of VLANs that span both data centers...". But to get any benefit of local VLAN's, you have to make sure the firewall cluster doesn't rely on stretched VLANs. Together with the mentioned "there isn't enough money to buy two clusters" you have to look for a stretched firewall cluster which doesn't need streched VLAN's.
With my mentioned design above, you don't (from the Firewalls point of view) care about a split brain or bridging loop. They just doesn't hurt the firewall. If you have a mix of local and stretched VLAN's, the local VLAN's will always work if the remote DC fails (even if just a part of one DC fails) if your routing for local VLANs still works.
And maybe, in x years if you can prove your local VLANs are far more stable then the stretched ones, people who decides listen a little bit more to you.
But it is never as simple as it seems:
With the mentioned design above, you have to run your own MPLS network with L3-VPN (or any other L3 virtualisation). If you don't already have a running network like this, this design is likely to bee too complex (adding another layer of complexity just to remove a few stretched VLAN's isn't a good tradeoff). But if you already have L3 virtualisation in your network, this design could be a first step to further separate your DC.
Of course, it also doesn't make sense if all of your VLAN's have to be streched.
So, to sum up... The mentioned design isn't the perfect solution but it could be better than a stretched VLAN firewall cluster.
But since my knowledge about this is just theory and there is just one vendor mentioning this, I am still looking for the big issue of this design...compared to the stretched VLAN design all vendors mention.