High Availability Service Clusters
Clusters of servers offering the same service in active/standby or active/active setup are a common high availability solution. They work very well as long as:
- The clustering software can reliably detect partitions – there should always be an odd number of cluster members, or extra witness nodes
- The cluster is not stretched across unreliable infrastructure.
You should also understand how these solutions work:
- Node or service failure in an active/standby setup might cause significant downtime while the service is restarted on another cluster member.
- No clustering solution will protect you against operator mistakes.
Once you grasp these fundamentals, it’s perfectly possible to design and deploy well-functioning clusters including network services clusters:
- BGP as High Availability Protocol
- Using BGP for Firewall High Availability: Design and Software Upgrades
- Tuning BGP Convergence in High-Availability Firewall Cluster Design
Not surprisingly, solutions created by networking vendors (including multi-chassis link aggregation clusters) ignore all of the above. This is what I had to say about the sad state of affairs:
- Stretched Clusters: Almost as Good as Heptagonal Wheels
- Long-distance IRF Fabric: Works Best in PowerPoint
- Controller Cluster Is a Single Failure Domain
- VMware VSAN Can Stretch – Should It?
- Reliability of Clustered Solutions: Another Data Point
- Never Take Two Chronometers to Sea
SDN controllers are no exception:
- Is Controller-Based Networking More Reliable than Traditional Networking?
- To Centralize or not to Centralize, That’s the Question
- Impact of Controller Failures in Software-Defined Networks
- State Consistency in Distributed SDN Controller Clusters
One of the worst examples of the services clusters are firewall clusters. They are almost always implemented with two nodes without a witness, and often stretched across multiple data centers.