Circular Dependencies Considered Harmful

A while ago, my friend Nicola Modena sent me another intriguing curveball:

Imagine a CTO who has invested millions in a super-secure data center and wants to consolidate all compute workloads. If you were asked to run a BGP Route Reflector as a VM in that environment, and would like to bring OSPF or ISIS to that box to enable BGP ORR, would you use a GRE tunnel to avoid a dedicated VLAN or boring other hosts with routing protocol hello messages?

While there might be good reasons for doing that, my first knee-jerk reaction was:

You are making that VM a crucial part of your transport infrastructure. There are some security implications right there. You might want to run it on a dedicated management cluster (like vCenter), in which case adding that extra VLAN to those two ToR switches and running OSPF/IS-IS on them doesn’t sound too bad.

There’s an even more important fallacy in this line of thinking. You’re making your core transport infrastructure dependent on:

  • A data center,
  • A data center fabric,
  • A controller managing that data center fabric (if you bought into the SDN religion),
  • A virtualization environment,
  • A virtualization management/orchestration system,
  • One or more virtual machines that you cannot reach with a console cable.

You need a running network to get to most of these components, resulting in a nightmare circular dependency – once you lose that VM, it will be very hard to get it back.

Seasoned networking architects have been aware of the dangers of circular dependencies way before the centralized controller SDN hype started, and virtualization vendors like VMware were always very careful to have at least some semblance of an independent control plane on hypervisor hosts to be able to restart the management system from the outside1. Networking vendors have no such reservations – looking at some slide decks, it seems perfectly fine to run a network control plane on infrastructure with heavy circular dependencies on the network itself (not to mention DNS or NTP)2.

Think something like that can never happen, or that you could build enough resiliency into your design to survive any possible failure? Look no further than the October 2021 Facebook outage that disconnected Facebook from the Internet.

According to the official outage report:

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.

It seems (from the outside) like they had a circular dependency between DNS and BGP (here’s why), and never experienced a (real or simulated) failure that would expose that dependency.

In the meantime, there were reports on Twitter that Facebook employees couldn’t enter buildings (doors not being able to reach authentication servers), and couldn’t use third-party services like Google Docs or Zoom because those required Facebook authentication.

Based on how long it took them to get to the affected routers, it looks like their out-of-band network was useless (OOB access servers relying on RADIUS servers?). After the outage was over, there were claims the final tool needed to get to the bricked router(s) was an angle grinder (same source: none of the doors have keyholes so what happens if that system goes down?)

Every large enough system is full of circular dependencies (someone should make a law out of that). Kripa Krishnan (Google) mentioned a few they discovered during Disaster Recovery Testing in a (must read) ACM Queue Article:

  • Failovers failed because the alternate location was unavailable;
  • Lack of authentication servers locked out the workstations;
  • The configuration server for the alerting and paging system went offline, making it impossible to redirect alerts to other locations.

Couldn’t we avoid the dependencies? Of course, we could if someone could visualize the whole picture3, but that tends to be impossible in large enough systems. Another root cause might be the stability of the infrastructure4 – when infrastructure is stable enough, its users take it as a given (see also: first fallacy of distributed computing).

What else could we do? Test, test, test. Trigger real failures (don’t fake them), learn from them, and fix stuff. All the big players do that; maybe it’s time for you to start as well.


  1. Cisco Nexus 1000v architects learned that lesson the hard way – after the 1000v control plane VM failure, the ESXi hosts would get (permanently) disconnected from the network because they ran LACP from the control plane. ↩︎

  2. Proving (yet again) RFC 1925 rules 4 and 5 ↩︎

  3. This is a perfect time to mention AI/ML as the fix to all problems. ↩︎

  4. Does it look like I’m saying it’s Ops fault for being too good? Get used to it – it’s always the Ops fault, and within the infrastructure, it’s always the network. Within the network? DNS, of course… unless it’s BGP. ↩︎

Latest blog posts in BGP in Data Center Fabrics series

3 comments:

  1. There is a nice story in the O'Reilly book Site Reliability Engineering providing an example for "[...]when an infrastructure is stable enough, its users take it as a given":

    "In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added."

    Source: The Global Chubby Planned Outage

    Replies
    1. Fantastic solution. Thanks a million for the pointer!

  2. I refrained from commenting on the BGP ORR post because it reeks of shit from the way you described it, sounding like another BS LFA nonsense. The Cisco doco that goes about it in detail makes it all clear ORR is just a solution looking for (imaginary) problems, problems already solved by other existing methods. And even as a solution, ORR is a terrible one with circular dependency -- look how many other features you need to enable just for it to work, and if that's on the surface, then what kind of hellscape would it be code-wise -- and a poor attempt to turn BGP into a LS protocol. why BGP LS, why? What's next, path vector OSPF?

    And indeed ORR resort to the LFA/SPB trick, and due to the intensity of running SPF multiple times, they have to limit it with designated roots. On top of that, ORR is the wrong way to solve the problem, a stupid nerdknob resulting from a poor understanding of Addressing. MPLS is the right way to solve the problem, because it solves the real one, by creating multiple logical address spaces out of one underlying physical IP space. A good design of the network with appropriate placement of RRs is also another right way to do things. Inventing rubbish band-aids like ORR for lame excuses mentioned in the Cisco doc is just dumb. Optimal RR my ass.

    Re the FB outage, I incidentally commented on your NSF post some 3 weeks before this happened, on the unpredictable nature of nonlinear effects resulting from optimization-induced complexity. Their outage just drives home the point that optimization is a dumb process and leads to combinations of circular dependency that no one can account for and test. The combinations can be infinite given the parameter space in a large network with lots of features turned on; who has the capability to enumerate all of the twists and turns, let alone test them? Let's face it: tail risks in a complex system aka Black Swan events, by their non-linear nature, cannot be predicted -- think Fukushima -- so instead of trying to rationalize them after the fact, it's much better to go simple in the first place.

    Reading the many papers that FB IT Teams published over the years, and one can't help but get the notion that a lot of what these guys do is change for the sake of change. Have to wonder if that's part of their job security? If so, the very same reason is the cause behind the downfall of Big Science, where profit incentive drives people away from doing real science and into shitty career-building and grant-winning parlor tricks. FB IT, incl.their network Teams, always seem to be on the run for one sort of optimization or another -- I suppose the same thing can be said about other HyperScalers as well, and the AGILE movement in general, so this is not directed just toward FB -- but are most of their optimizations optimal or even necessary? I tend to seek out heretic discoveries due to my big distrust of the mainstream, and in one example, looks like a lot of advanced routing tricks fare no better than plain-old BGP:

    https://homepages.dcc.ufmg.br/~cunha/papers/arnold19hotnets-bgp.pdf

    So much for optimization. The upside is questionable, while the downside blatantly obvious, exemplified by this outage and the likes. When you cramp too much crap in, at some stage a bifurcation point will be reached, when even a small tiny change will propagate throughout the entire system, causing phase transition and potential collapse, depending on the nature of the change. That, is the true Butterfly effect, and not the popular, jaded Butterfly effect people like to brag about in the mainstream. But IT people, being ignorant of this, don't seem to care much about adding complexity on top of complexity, because it makes them look smart.

    This, IMO, is one consequence of fucking around too much with software and computer, of the BS software-eating-the-world mentality. People who spend all days in front of the PC screen lose touch with physical reality, and become stupid nerds thinking all the stuff they read in science fiction can be realized. Over-reliance on technology, attributing non-existent power to it, is a manifestation of this WOKE mentality -- I said over-reliance because looks how they couldn't even get into the office without the outage. Maybe making RFC 1925 mandatory reading for all IT workers can be a step in the right direction to help reverse this trend.

    "Every large-enough system is full of circular dependencies (someone should make a law out of that)."

    Yes they do. Here is one Ivan :))

    https://how.complexsystems.fail/

    and RFC 3439 as well. RFC 3439 is essentially RFC 1925 BY EXAMPLES. RFC 1925 raises the thesis, 3439 gives it detailed treatment. It is, therefore, a crucial reading that everyone serious about Networking must read. Read and re-read, time and time again, to appreciate the lessons in it, and see what a mess the current stage of networking really is.

  3. When WAP Gateway was a very critical component of our services, then after designing proper redundancy and failover, we initiated a failure testing every month. We also tested restoration of failed components. Sometimes you get suprises at the point... :-) It should have proved that our assumptions about having configured everything correctly could be still valid. Service availability improved significantly.

    People also should not forget failures because of single events. This is rarely taken into account... With circular dependencies such failures could be amplified and would look extremely misterious, because you cannot repeat or recreate them...

Add comment
Sidebar