Why Is OSPF (and BGP) More Complex than STP?

I got this question from one of my readers:

Why are OSPF and BGP are more complex than STP from a designer or administrator point of view? I tried everything to come to a conclusion but I couldn’t find a concluded answer, ChatGPT gave a circular loop answer.

There are numerous reasons why a protocol, a technology or a solution might be more complex than another seemingly similar one (or as Russ White would have said, “if you haven’t found the tradeoffs, you haven’t looked hard enough”):

TL;DR generated by ChatGPT GPT-4 ?
OSPF and BGP are more complex than STP because they solve different problems and have different performance goals. STP focuses on preventing loops in forwarding topology, while OSPF and BGP find optimal forwarding topologies and propagate edge information. OSPF’s complexity arises from various optimizations, while BGP complexity comes from implementing routing policies in large-scale environments.

Someone failed to find a simpler solution. Every now and then, someone finds an amazingly simple solution to what seemed to be a hard problem. Considering the long history of all three protocols1, the time spent on theoretical foundations of routing (example: graph theory), and numerous routing protocols developed in the past, we can probably conclude this one doesn’t apply.

Trading performance for complexity. Faster implementations or implementations with lower memory requirements or stricter performance guarantees tend to be more complex than simplistic implementations (assuming we looked hard enough for a simpler solution). The long history of sorting algorithms is an excellent example. Cache-based forwarding and vector packet processing versus simple packet-by-packet forwarding is another one.

They are solving a different problem. This is obviously the case for STP, OSPF, and BGP.

The only job STP has is to shut down block2 interfaces to prevent loops in a forwarding topology, and it does that in the simplest possible way: if a bridge hears about a more prominent bridge3 through multiple interfaces, it shuts down blocks all but one of them. STP does not advertise edge prefixes4, has no concept of neighbors and no reliability5, and the only way to make such a simplistic protocol work is to wait a while after each change to make sure things settle down.

OSPF has to find the optimal forwarding topology from the perspective of any node in the network and propagate edge information (IP subnets) to all forwarding nodes to allow them to build deterministic optimal forwarding tables. While simpler routing protocols like RIP get the same job done, OSPF does it faster and more reliably. To reach all those goals, OSPF has to:

  • Find adjacent routers (neighbors)
  • Exchange its network topology information with adjacent routers
  • Figure out when the adjacent routers are no longer reachable and track other changes in local network topology.
  • Propagate those changes to all its neighbors, and flood changes generated by remote routers
  • Use the network topology information and edge reachability information to build optimal forwarding table.

That sounds reasonable, but the designers of OSPF wanted to get it implemented on 16-bit processors with 2MB of memory running at 4 MHz clock rate. To get there, they had to implement tons of optimizations like areas, inter-area summarization, and stub areas. Like that wouldn’t be hard enough, people started using OSPF in weird scenarios and added features like not-so-stubby-areas, totally not-so-stubby areas, and on-demand circuits to make those scenarios work. Long story short, exploring all the potential nooks in the solution space turned OSPF into bloatware6.

Finally, the designers of BGP wanted to implement routing policies in large-scale environments, and they did a marvelous job – BGP can carry millions of routes advertised by tens of thousands of autonomous systems. To get there, they:

  • Decided to use a well-understood transport protocol (TCP),
  • Used distance-vector7 approach because one cannot build an Internet-wide link state graph8, and
  • Traded speed of convergence for scalability.

You’ll learn more about the routing protocol basics in the Routing Protocols part of How Networks Really Work webinar. I also collected links to several open-source networking textbooks in the Textbooks and Other Resources part of Networking Fundamentals roadmap.

As for ChatGPT gave a circular answer: there’s still no substitute for hard work and understanding how things really work.

Revision History

2023-04-28
Correction: STP does not shut down interfaces but puts them into blocking state.

  1. And assuming networking engineers working in IETF aren’t stupid or blindsided ;) ↩︎

  2. As an anonymous commenter pointed out, STP does not shut down the interfaces (or it wouldn’t be able to listen to BPDUs) but puts them into a BLOCKING state in which they don’t receive or send anything that is not a layer-2 control-plane traffic. ↩︎

  3. Sometimes called ’the root of the spanning tree' ↩︎

  4. Relying instead on dynamic MAC learning that does not scale beyond a few hundred attached nodes ↩︎

  5. It happily forms a forwarding loop if a neighbor forgets to announce what it’s doing. ↩︎

  6. The same thing happened to STP with Rapid STP, per-VLAN Spanning Tree, and Multiple STP. ↩︎

  7. OK, path vector for the pedants. ↩︎

  8. Plus nobody wants their competitors to see the innards of their networks or their external connectivity. ↩︎

3 comments:

  1. As I recall, DecNet Phase V was considered by its designers to be "the easier solution" for complex networks and was intended to be the network layer routing protocol in OSI. http://www.bitsavers.org/pdf/dec/decnet/EK-DNAPV-GD_DECnet_Phase_V_General_Description_Sep87.pdf

    My experience was limited to DecNet Phase IV and the wonders of interconnecting multiple sites all of which had chosen to be in Area 1 (IIRC - this was in the 1990s).

    DecNet Phase V and OSI didn't get widely adopted despite a major push including government requirements to deploy it. This wasn't surprising given the cost of the OSI software licenses et al compared to the zero cost of open source TCP/IP.

  2. I thought that STP does not shutdown interfaces (except BPDU guard) for loop prevention but put the interface into blocking state so that no MAC addresses were learnt from that interface.

    Replies
    1. Of course you're correct (although the results are almost the same apart from the ability to receive and send BPDUs). Thank you, fixed.

    2. > […]it blocks down one of them[…]

      I think this should be: "it blocks all but one of them".

      (No more "down" from the former "shut down" wording, and only one bridge port may stay unblocked.)

    3. Hope I got it right this time. Thank you!

      Note to self: fix blog posts after the second coffee of the day 🥴

  3. When comparing one optimal design to another optimal design, complexity is a manifestation of the information required to execute an equation/information function. Those things that require more information, and are more complex, are often doing more.

    Cannot reduce complexity beyond necessary information and function: https://internetdynamics.substack.com/p/cannot-reduce-complexity-beyond-necessary

    Axioms - Information https://internetdynamics.substack.com/p/axioms-information

Add comment
Sidebar