Reliability of Clustered Solutions: Another Data Point

A while ago I wrote:

I haven’t seen any hard data, but intuition suggests that apart from hardware failures a standalone firewall might be more stable than a state-sharing firewall cluster.

Guillaume Sachot (working for a web hosting company) sent me his first-hand experience on this topic:

I don't have statistics to prove it, but as a hoster, I can confirm that I've seen high availability appliances fail more often than non-clustered ones. And it's not limited to firewalls that crash together due to a bug in session sharing, I have noticed it for almost anything that does HA: DRBD instances, Pacemaker, shared filesystems...

It fails, fails and fails. There are bugs, but also configuration issues/bad sync, network flapping that breaks replication and puts all instances in standby/secondary mode...

Same for VMWare clusters that cannot handle partial storage loss or performance issues without losing it completely, disconnecting from vCenter because their daemon indefinitely blocks on refreshing storage status.

In the end, a non-redundant VM on a non-clustered host seems to be the service that lasts the longest.

The good compromise for most people that do not require very low recovery time seems to be non-redundant VM(s) running on a cluster to cover host failure (hardware or software), and maybe "passive" VM(s) on another cluster that is triggered manually.

And if customers need very high availability, they should/must design their app to take it into account and host it on multiple locations/independent providers.

Not that many people would want to listen to anything along these lines. Believing in Santa Claus and magical High Availability solutions is more comforting.

Latest blog posts in High Availability Service Clusters series

6 comments:

  1. Proper deployment design and operations could make the difference. The trick with clusters is preventive maintenance. You have to exercise the reboot of cluster members regularly. This could reduce your risk significantly.

    The measure of success is not the up-time of a server! :-) It is the reliability of the real services.
    As an auditor, if I see a long up-time clustered server, I would be immediately suspicious, that this organization has a bad operational practice. :-)

    Of course, if you have basic bugs in the cluster software, nothing could help out...
    Replies
    1. Totally agree about reboot tests. What does Netflix by creating random failures is even better. But this can be a real challenge with databases and any software that need manual resync/downtime for recovery.

      The issue is that most people don't get that it costs something, and are not ready to pay for regular tests.
  2. As far as firewall goes my experience is that a cluster is not less stable if I account for the environment of the customers.

    The networks on single firewall sis mostly less complex. And a good failover setup can in fact kee a firewall cluster alive where a single firewall would have an outage due to the reboot required to fix issues.

    My main experience is with Check Point firewalls but there I found that a good firewall cluster will save the day.

    As always poor management is something a cluster won't protect you against.
  3. In my opinion there is some thruth in the foregoing statement that clustered systems will fail more often than standalone ones. This is based on the fact, that complexity is one factor causing a higher probability of failure.

    But we have to differentiate a bit, as allways when our statements are based more on feelings than on mesurement.

    In the past we had to deal with a series of decayed software - as well in Switches or Router plattformas as in Firewall systems. And especially on Firewall systems the possibility to do an upgrade at runtime was a good thing, for we had not to plan doing the work at off-times.
  4. not sure which version of vSphere is being referred to, but with APD and PDL handling those problems seem to be an issue of the past. With APD handling we fast fail IOs so you won't hit that issue any longer, I think that was introduced in vSphere 5.1 even.
    Replies
    1. It's still true in the latest version (6). Sure, the ESXi itself doesn't crash anymore since the release 5 when there is a storage issue.

      But the webservice that allows the vCenter to discuss with the hypervisor will freeze until the failing paths are removed (meaning you don't have any partial failure, with input/output error for instance).

      With the webservice being unresponsive, you lose the ability to migrate your VM. And if you forgot to enable SSH, your host won't be reachable until you solve the storage issue. If you stop a VM, or a VM crashes, you then won't be able to restart it even if the data is not on a datastore affected by the issue.

      Even the tech guys from VMWare admit there is still a lot of improvements to be done in this area.
Add comment
Sidebar