Is OSPF Unpredictable or Just Unexpected?

I was listening to a very interesting Future of Networking with Fred Baker a long while ago and enjoyed Fred’s perspectives and historical insight until Greg Ferro couldn’t possibly resist the usual bashing of traditional routing protocols and praising of intent-based (or flow-based or SDN or…) whatever.

Here’s what I understood he said around 35:17

The problem with the dynamic or distributed algorithms is that they quite often do unexpected things.

You might think it was a Freudian slip-of-tongue, but it seems to be a persistent mantra. Recently it becamea fallacy that a network will ever be reliable or predictable.”

Well, I totally believe that routing algorithms like OSPF would surprise Greg or myself (as I often admit during my network automation workshops), but that only means that with all the nerd knobs we added they became too complex for mere mortals to intuitively grasp their behavior.

On a side note, I would love to see how expected the results of complex intent-based systems will be.

Anyway, let’s move from subjective unexpected to objective unpredictable or non-deterministic.

Interestingly, with the clear split between information distribution (LSA flooding) and route computation (SPF algorithm), link-state routing protocols are one of the most predictable distributed algorithms out there, and can in the worst-case scenario result in temporary forwarding loops due to eventual consistency of topology database.

Seemingly simpler hop-by-hop protocols like distance- or path vector routing protocols are much worse and can result in permanent forwarding loops or persistent oscillations.

Assuming you have infinite patience, it’s quite easy to predict what an OSPF network will look like:

  • Take topology database;
  • Follow all the intricate rules in various OSPF-related RFCs;
  • Get the final forwarding table.
Speaking about the intricate rules: many of them seem like Rube Goldberg fixes introduced to correct unexpected OSPF behavior, probably proving my “lack of intuitive grasp” hypothesis.

Nobody in his right mind would do something like that, but once the steps to a solution are well-defined, it’s trivial (from the perspective of a mathematical proof, not the actual implementation) to carry them out… and there are tools like Cariden’s MATE that do exactly that.

However, because it’s easier to not spend money on something that would prevent an event with uncertain probability (network going down due to misconfigured OSPF, or losing customer data due to an intrusion), vendors like Cariden have relatively few customers, resulting in expensive tools.

Of course, there’s another way of dealing with the “unexpectedness” of OSPF: stop being a MacGyver, forget the nerd knobs, keep your network design as simple as possible, and use the absolute minimum subset of features you need to get the job done.

Unfortunately, it seems like only a small minority of engineers or architects want to follow this particular advice. It’s so much easier to believe in yet another technology wonder.


  1. Well said.

    Speaking of "persistent oscillations" (in BGP, for example) - most examples of those that I have seen are "we have added more knobs, and people build brittleness by twisting too many knobs". BGP or EIGRP in itself are not as deterministic as OSPF (since "on which link was the update heard first?" becomes relevant) but *stable* they are...
    1. I would dismiss "persistent oscillations" not caused by redistribution loops (where it's extremely easy to shoot yourself) as myths a few years ago. In the meantime, I've seen scenarios where they can happen in organically-grown networks with just the right sequence of failures.

      Rare? Sure. Impossible? Definitely not.
  2. You can take the CAP theorem which your friend Russ is teaching us plus rule 7a from RFC 1925 and you got the answer.
    1. CAP theorem applies only to the "eventual consistent" part of the discussion - any distributed control-plane protocol solves the problem by dropping consistency (C in CAP) in favor of availability (A) and partition tolerance (P). Eventual consistency causes microloops, but has nothing to do with overall complexity or OSPF nerd knobs.

      RFC 1925 Rule 7a is more applicable: we chose to have Fast routing protocol on Cheap processors and sacrificed Good (or simple) in the process.
  3. OSPF is for loosers. We use RIFT and OpenFabric in our network. Just turn those 2 protocols on and they do all the magic for you.
    1. Welcome, traveler from the future. Hope you won't get lost in the clumsy world of 2018 ;)
    2. What switches do you use?
      What operating system do you use on your switches?
    3. I don't think you got it. There's a very early version of RIFT for Junos (see the discussion in RIFT podcast), I don't think there's a released implementation of OpenFabric anywhere... and it wouldn't make sense to run both anyway.
  4. I don't like Ferro's mindset that everything what we have now is wrong and only his SDN-WAN paradigm will rule the world. I understand that he blogs for money, but it starting to be behind the line (at least from my perspective).
  5. Using OSPF for a long time now as interior routing protocol for many customers I am still encouraged of its robustness. We have done some things way too simple from time to time, wondering, how we could filter routes better which led to a more sophisticated approach - but thinking of a multi-vendor scenario I wonder, if Greg would ever get the same result using a totally closed SDN-thing approach. :-)
    1. BTW: SDN comes with the Achilles' heel of a centralized control plane, which is a real sword of Damokles in nowadays software quality. An error within it could cause the whole thing to collapse and I don't think that we would gain robustness using such approaches.
  6. Greg Ferro is great in that his last podcast was sponsored by Cisco, and he spent at least 10 minutes bashing them. I expect that in the normal podcasts, but not in the one sponsored by cisco! Gives me a smile on the way to work.

Add comment