Never Take Two Chronometers to Sea

One of the quotes I found in the Mythical Man-Month came from the pre-GPS days: “never go to sea with two chronometers, take one or three”, and it’s amazing the networking industry (and a few others) never got the message.

Wait, What?

If you’re not a naval history buff, you probably have no idea what I’m talking about, so here’s the TL&DR version:

What is a chronometer? The mechanic version of Stratum 0 NTP server ;) – a very precise clock.

Why did they need it? It’s relatively easy to measure latitude while on open seas. It’s really hard to measure longitude, and the marine chronometers were the best (although expensive) solution.

Lunar distance navigation is a must-read for true geeks ;)

Why should you take three? Two things can go wrong with a chronometer: it can fail or it can be imprecise. If you have two chronometers, it’s impossible to figure out which one is imprecise and should be disregarded. You might decide to use one as primary and the other one as backup based on whatever criteria, but then you’re acting as the third party (witness) in this protocol.

If you have three, you take the average time of the two that are closer together.

Long story short: it’s impossible to get a reliable high-availability solution with two components (or even number of components).

Why Is This Relevant?

Have you ever deployed redundant firewalls or load balancers? How many nodes are in a typical cluster? Got my point?

How about data center switches implementing MLAG? Or stackable switches like HP IRF or Juniper Virtual Chassis that support at most 4 or 10 nodes (depending on the model)?

There’s a good reason the server clustering solutions with two nodes use a disk as a witness. Networking industry obviously never got the memo, the obvious exceptions being VMware NSX controller (because it was designed by people who actually understood voting protocols) and Cisco ACI controller.

Meh, You’re Just Spreading FUD

Sure. I’ve seen enough real-life failures to believe in simpler solutions, but of course you shouldn’t trust anything you read in a blog post. For a long list of split-brain failures from production environments, read this ACM queue article. Enjoy!

Latest blog posts in High Availability Service Clusters series

15 comments:

  1. Even if I get your point, I would say that two is better than one. The mean value gets better with the number of components, and you would have half the error of a wrong chrono.
  2. I disagree. With one component, the failure modes and mitigation (hot spare). With two, the problem space is greater, and split. Rain issues are difficult to diagnose and mitigate (re-read Ivan's text).
    Replies
    1. I was not thinking about hot spare but active active. And IMHO, the cluster control problem and the metering problem do not seem to be the same.
      Do you disagree with the only statement I made ? That the error of a measure would be halved ?
  3. This is quite similar to the usual error detection and correction mechanisms we use like CRC. With no parity bits you can't assert anything (single chrono). With 1 bit you can detect 1 error but not correct (two chronos, second is the "parity"). With 2 bits of parity you can detect and correct 1 error (three chronos where 2 agree and 1 doesn't).

    As for two-node cluster, you are golden if the node can detect its own failure and remove himself from the cluster or if the second node can detect the first failed and take over the cluster by itself.

    I understand Marko's comment but from a practical purposes I would still rather have a 2-node cluster and prepare to respond to the problems that might eventually emerge from it than have a single node. I wouldn't put "hot spare" as a solution to single node, this is a 2-node in primary/standby and the same caveats apply (detect errors for switch over, could incorrectly become master with split brain, etc.).
    Replies
    1. It's definitely better to have a spare than no spare, but the real problem is somewhere else: there's absolutely no reliable way of doing automatic failover in a 2-node cluster regardless of what the vendors are telling you.
  4. Hey Ivan,

    Regarding the comments about stackables. Trying to understand your thinking here as I don't see this as really the same thing. In the case of the chronometers, this is a multi-master model where both are processing data ( time ) at the same time and capable of spitting out different outputs. More like a Cisco Nexus VPC MLAG or Arista MLAG scenario. In the case of Cisco VSS, Juniper Virtual Chassis, or HPE IRF, there's one master in the group that has full control and all the other units are simply subscribed ( more or less ) to the master's view of the world. The failover will happen regardless of whether there's 2 or more because the next box in priority is going to become the master and life goes on.

    not to say there aren't other issues with borg scenarios, as I know you're well aware, but I just don't know if this particular comparison is fair.

    Thoughts?

    @netmanchris
    Replies
    1. The problems seem different, but are pretty similar - in both cases the challenge is "what is the majority if we get divergence", in our case "what should each part of the cluster do if we get partition?"

      While an even number of cluster members is always a challenge, it might not be as bad if you have 4 or 6. You might be able to fake it by giving the master another half vote ;) Have to think a bit more about it... or someone could send me a link to the result (which would be highly appreciated).
  5. Or maybe we can stop taking two control plane and binding them together? Stop with the IRF and Virtual chassis and VSS and etc'? Stop mchassis lags and use routing for active active links
    Replies
    1. That is never going to happen. Vendors calls that "differentiation", I call it snake oil ;)
  6. Lol Couldn't agree more. L3 everywhere for the win. Can't wait until host-based routing is a reality. But until all the applications that we're forced to work with will run in a resilient way in a L3 environment, then I'm afraid that we're doomed to repeat the stupid network tricks over and over again.
  7. In the case of MLAG, the lacp peers act as witnesses. If both switches in a MLAG think they're the master at the same time, they report a different lacp id. The other end won't bond a link with different ID's and becomes the "tiebreaker".

    Of course every vendor's MLAG implementation is a little different & I wouldn't be surprised if at least one of them messed up an important detail of the split brain scenario in a subtle but unfortunate way.
    Replies
    1. And they both advertise the same subnet to the core, and when the packet arrives to one of the switches it cannot get to the host because that path is down due to LACP system ID mismatch.

      The only sane way to handle MLAG cluster (or switch stack) splits is to shut down half of it, but the of course you don't know which part to shut down. QED.
  8. Great recommendation on the book for network architects and engineers by Ivan.
    Brooks Jr., Frederick P.. The Mythical Man-Month, Anniversary Edition: Essays On Software Engineering (Kindle Locations 754-758). Pearson Education. Kindle Edition.
    Especially chapter 5 the Second System Effect - all Network architects and engineers should consider a read.
    The essay on Interactive Discipline for the Architect is still very applicable in today's fast paced design to implementation and environment and when dealing with a Vendor(s) who is acting as the Architect or Implementer.
    The essay on Self-Discipline - The second -System Effect
    Especially about the concept of Stretch.
    “Consider as a stronger case the architecture, implementation, and even the realization of the Stretch computer, an outlet for the pent-up inventive desires of many people, and a second system for most of them. As Strachey says in a review: I get the impression that Stretch is in some way the end of one line of development. Like some early computer programs it is immensely ingenious, immensely complicated, and extremely effective, but somehow at the same time crude, wasteful, and inelegant, and one feels that there must be a better way of doing things.”

    --------------

    “How does the architect avoid the second-system effect? Well, obviously he can't skip his second system. But he can be conscious of the peculiar hazards of that system, and exert extra self-discipline to avoid functional ornamentation and to avoid extrapolation of functions that are obviated by changes in assumptions and purposes.”
    And comparing the concept of stretch from Mythical Man Month above to the definition/concept of Stretch in Network Architectures from White, Russ; Donohue, Denise. The Art of Network Architecture: Business-Driven Design (Networking Technology) (p. 81). Pearson Education. Kindle Edition.
    “Modularization and Optimization If modularization brings so many benefits to network architecture, why shouldn’t every network be modularized at every point possible? Isn’t more aggregation always better than less aggregation? Network design is, as all things, a matter of choosing trade-offs— there is no such thing as a free lunch!



  9. -----------------------
    One of the trade-offs we deal with all the time is state versus stretch. Stretch, quite simply, is the difference between the optimum path through the network (for any pair of hosts) and the actual path through the network. For instance, if the shortest actual path available is 2 hops, but traffic is flowing along a 3 hop path, the stretch is 1. Why should we ever have stretch in a network? It seems like you’d just never, ever, want stretch, because stretch always represents suboptimal use of available resources. But you always end up with stretch, because one of the other fundamental concepts of network design is the use of information hiding to break up failure domains. Hierarchical network design, in fact, is the intentional use of aggregation to reduce the state information— the routing table size, in most cases— in the control plane, so that changes in one area of the network don’t cause changes in the routing table halfway around the world. How does this relate to stretch? Anytime you hide state you increase stretch. This might not be obvious in all networks— specifically, anytime 100% of your traffic flows north/ south, decreasing state will not impact stretch.
    --------------------
    But if you have a combination of north/ south and east/ west traffic, then aggregation— reducing state— will always cause traffic to take a suboptimal path through the network— thus increasing stretch. Spanning tree is a perfect example of running to one extreme of the state/ stretch trade-off. Spanning tree reduces the state by forcing all traffic along a single tree in the network and blocking links that don’t belong to that tree. Control plane state is absolutely minimized at the cost of increasing the stretch through the network to the maximum possible— to the point that we often design network topologies around the elimination of links not used on the single tree.”
  10. A spectacular case of clustering and protocol gone awry:

    Amazon S3 Availability Event: July 20, 2008
    status.aws.amazon.com/s3-20080720.html

    Some theoretical/practical background papers by Leslie Lamport:

    The Byzantine Generals Problem
    research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf

    Time, Clocks, and the Ordering of Events in a Distributed System
    research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf
Add comment
Sidebar