Never Take Two Chronometers to Sea
One of the quotes I found in the Mythical Man-Month came from the pre-GPS days: “never go to sea with two chronometers, take one or three”, and it’s amazing the networking industry (and a few others) never got the message.
Wait, What?
If you’re not a naval history buff, you probably have no idea what I’m talking about, so here’s the TL&DR version:
What is a chronometer? The mechanic version of Stratum 0 NTP server ;) – a very precise clock.
Why did they need it? It’s relatively easy to measure latitude while on open seas. It’s really hard to measure longitude, and the marine chronometers were the best (although expensive) solution.
Why should you take three? Two things can go wrong with a chronometer: it can fail or it can be imprecise. If you have two chronometers, it’s impossible to figure out which one is imprecise and should be disregarded. You might decide to use one as primary and the other one as backup based on whatever criteria, but then you’re acting as the third party (witness) in this protocol.
If you have three, you take the average time of the two that are closer together.
Long story short: it’s impossible to get a reliable high-availability solution with two components (or even number of components).
Why Is This Relevant?
Have you ever deployed redundant firewalls or load balancers? How many nodes are in a typical cluster? Got my point?
How about data center switches implementing MLAG? Or stackable switches like HP IRF or Juniper Virtual Chassis that support at most 4 or 10 nodes (depending on the model)?
There’s a good reason the server clustering solutions with two nodes use a disk as a witness. Networking industry obviously never got the memo, the obvious exceptions being VMware NSX controller (because it was designed by people who actually understood voting protocols) and Cisco ACI controller.
Meh, You’re Just Spreading FUD
Sure. I’ve seen enough real-life failures to believe in simpler solutions, but of course you shouldn’t trust anything you read in a blog post. For a long list of split-brain failures from production environments, read this ACM queue article. Enjoy!
Do you disagree with the only statement I made ? That the error of a measure would be halved ?
As for two-node cluster, you are golden if the node can detect its own failure and remove himself from the cluster or if the second node can detect the first failed and take over the cluster by itself.
I understand Marko's comment but from a practical purposes I would still rather have a 2-node cluster and prepare to respond to the problems that might eventually emerge from it than have a single node. I wouldn't put "hot spare" as a solution to single node, this is a 2-node in primary/standby and the same caveats apply (detect errors for switch over, could incorrectly become master with split brain, etc.).
Regarding the comments about stackables. Trying to understand your thinking here as I don't see this as really the same thing. In the case of the chronometers, this is a multi-master model where both are processing data ( time ) at the same time and capable of spitting out different outputs. More like a Cisco Nexus VPC MLAG or Arista MLAG scenario. In the case of Cisco VSS, Juniper Virtual Chassis, or HPE IRF, there's one master in the group that has full control and all the other units are simply subscribed ( more or less ) to the master's view of the world. The failover will happen regardless of whether there's 2 or more because the next box in priority is going to become the master and life goes on.
not to say there aren't other issues with borg scenarios, as I know you're well aware, but I just don't know if this particular comparison is fair.
Thoughts?
@netmanchris
While an even number of cluster members is always a challenge, it might not be as bad if you have 4 or 6. You might be able to fake it by giving the master another half vote ;) Have to think a bit more about it... or someone could send me a link to the result (which would be highly appreciated).
Of course every vendor's MLAG implementation is a little different & I wouldn't be surprised if at least one of them messed up an important detail of the split brain scenario in a subtle but unfortunate way.
The only sane way to handle MLAG cluster (or switch stack) splits is to shut down half of it, but the of course you don't know which part to shut down. QED.
Brooks Jr., Frederick P.. The Mythical Man-Month, Anniversary Edition: Essays On Software Engineering (Kindle Locations 754-758). Pearson Education. Kindle Edition.
Especially chapter 5 the Second System Effect - all Network architects and engineers should consider a read.
The essay on Interactive Discipline for the Architect is still very applicable in today's fast paced design to implementation and environment and when dealing with a Vendor(s) who is acting as the Architect or Implementer.
The essay on Self-Discipline - The second -System Effect
Especially about the concept of Stretch.
“Consider as a stronger case the architecture, implementation, and even the realization of the Stretch computer, an outlet for the pent-up inventive desires of many people, and a second system for most of them. As Strachey says in a review: I get the impression that Stretch is in some way the end of one line of development. Like some early computer programs it is immensely ingenious, immensely complicated, and extremely effective, but somehow at the same time crude, wasteful, and inelegant, and one feels that there must be a better way of doing things.”
--------------
“How does the architect avoid the second-system effect? Well, obviously he can't skip his second system. But he can be conscious of the peculiar hazards of that system, and exert extra self-discipline to avoid functional ornamentation and to avoid extrapolation of functions that are obviated by changes in assumptions and purposes.”
And comparing the concept of stretch from Mythical Man Month above to the definition/concept of Stretch in Network Architectures from White, Russ; Donohue, Denise. The Art of Network Architecture: Business-Driven Design (Networking Technology) (p. 81). Pearson Education. Kindle Edition.
“Modularization and Optimization If modularization brings so many benefits to network architecture, why shouldn’t every network be modularized at every point possible? Isn’t more aggregation always better than less aggregation? Network design is, as all things, a matter of choosing trade-offs— there is no such thing as a free lunch!
-----------------------
One of the trade-offs we deal with all the time is state versus stretch. Stretch, quite simply, is the difference between the optimum path through the network (for any pair of hosts) and the actual path through the network. For instance, if the shortest actual path available is 2 hops, but traffic is flowing along a 3 hop path, the stretch is 1. Why should we ever have stretch in a network? It seems like you’d just never, ever, want stretch, because stretch always represents suboptimal use of available resources. But you always end up with stretch, because one of the other fundamental concepts of network design is the use of information hiding to break up failure domains. Hierarchical network design, in fact, is the intentional use of aggregation to reduce the state information— the routing table size, in most cases— in the control plane, so that changes in one area of the network don’t cause changes in the routing table halfway around the world. How does this relate to stretch? Anytime you hide state you increase stretch. This might not be obvious in all networks— specifically, anytime 100% of your traffic flows north/ south, decreasing state will not impact stretch.
--------------------
But if you have a combination of north/ south and east/ west traffic, then aggregation— reducing state— will always cause traffic to take a suboptimal path through the network— thus increasing stretch. Spanning tree is a perfect example of running to one extreme of the state/ stretch trade-off. Spanning tree reduces the state by forcing all traffic along a single tree in the network and blocking links that don’t belong to that tree. Control plane state is absolutely minimized at the cost of increasing the stretch through the network to the maximum possible— to the point that we often design network topologies around the elimination of links not used on the single tree.”
Amazon S3 Availability Event: July 20, 2008
status.aws.amazon.com/s3-20080720.html
Some theoretical/practical background papers by Leslie Lamport:
The Byzantine Generals Problem
research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf
Time, Clocks, and the Ordering of Events in a Distributed System
research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf