Why Is OSPF not Using TCP?
A Network Artist sent me a long list of OSPF-related questions after watching the Routing Protocols section of our How Networks Really Work webinar. Starting with an easy one:
From historical perspective, any idea why OSPF guys invented their own transport protocol instead of just relying upon TCP?
I wasn’t there when OSPF was designed, but I have a few possible explanations. Let’s start with the what functionality should the transport protocol provide reasons:
TCP has point-to-point sessions. That’s more than good enough today when everything is a router layer-3 switch, but in early days routers were expensive, and it was quite common to have numerous edge routers connected to a shared layer-2 segment. OSPF DR/BDR uses IP multicast to send the same information to all neighbors at once.
TCP provides a single stream, whereas OSPF has per-LSA retransmission capabilities. A dropped OSPF LSA does not prevent other LSAs from being sent to a neighbor, a dropped TCP packet stalls the TCP session.
TCP provides a byte stream, and expects a higher-layer protocol to provide message boundaries. That’s also why a single dropped packet stalls a TCP session – as there are no message boundaries in the TCP byte stream, TCP cannot deliver out-of-order packets to higher layers.
Head-of-line blocking. A corollary of TCP provides a single stream. TCP has a hard time delivering urgent messages ahead of the usual chatter. Yes, it has urgent data, but that functionality is “somewhat” limited. OSPF could transmit LSAs in any order it wishes (not that I would be aware of any implementation doing so, but at least the functionality is there).
There’s also neighbor discovery: As Enrique Vallejo pointed out in the comments, you still need a multicast-based hello protocol to discover adjacent routers if you find it unacceptable to configure them. That doesn’t mean you can’t use TCP to establish the sessions once the neighbors are discovered – LDP uses UDP-over-multicast to discover neighbors, and TCP to exchange labels.
Finally, there might have been other considerations, including:
TCP was considered an overkill. After all, TCP provided decent reliable end-to-end transport under a variety of conditions while all we needed in OSPF was a single-hop quick fix.
Straight from OSPF: Anatomy of an Internet Routing Protocol by John T. Moy (quote provided by Paul Hare):
We did not need the reliability of TCP; link-state routing protocols have their own reliability built into the flooding algorithms, and TCP would just get in the way. Also, the ease of applications in UNIX and other operation systems to sent and receive UDP packets was seen by some as a disadvantage; the necessity of gaining OS privileges was seen as providing some small amount of security. The additional small benefits of UDP encapsulation were outweighed by the extra 8 bytes of UDP header overhead that would appear in every protocol packet. So we decided to run OSPF directly over the IP network layer, and we received an assignment of IP protocol number 89 from the IANA
TCP was considered to be a resource hog by people writing networking code. I never understood this one, as it persisted long after WWW took off for real.
CPU cycles were precious in early routers, as they used the same CPU for control-plane activities and packet forwarding. Networking vendors cutting costs and using the cheapest CPU they could get away with didn’t help either. Keep in mind that running an O(|E| + |V|.log(|V|)) algorithm on a graph with hundred nodes was considered to be a big deal in those days.
Networking is special. We couldn’t simply reuse a protocol that works. We have to invent something more optimal (leading to tons of protocols with unique binary encodings instead of everyone using the same markup language). Lack of understanding of what presentation layer should provide didn’t help either (considering the alternative could be ASN.1 maybe I shouldn’t complain too much).
Keep Reading
You MUST read the extensive comments:
- Henk Smit explaining the efforts to use TCP transport with IS-IS
- Tony Przygienda describing tons of things that could go wrong in a transport protocol used by a routing protocol
- Minh Ha debunking the TCP is a resource hog myth
I also received pointers to:
- Experiments with IS-IS flooding by Sarah Chen and Tony Li (PDF) from IETF108
- More tests of IS-IS flooding from IETF109
Have I missed anything? Got it all wrong? Please write a comment or send me an email.
Related to the first point about point-to-point sessions in TCP, I would also add that OSPF neighbor discovery (via broadcast of Hello messages) is impossible to be implemented over TCP. You might replace IP multicast for multiple individual TCP sessions to disseminate LSA information (your first point), but you cannot implement neighbor discovery using TCP, because you need to know the other endpoint in order to establish the TCP connection.
I've got a copy of "OSPF: Anatomy of an Internet Routing Protocol" by John T Moy ISBN 0-201-63472-4.
I hope the quote below proves of interest.
Section 3.2 which discusses the choice of encapsulation (Page 51 in my copy)
"We did not need the reliability of TCP; link-state routing protocols have their own reliability built into the flooding algorithms, and TCP would just get in the way. Also, the ease of applications in UNIX and other operation systems to sent and receive UDP packets was seen by some as a disadvantage; the necessity of gaining OS privileges was seen as providing some small amount of security. The additional small benefits of UDP encapsulation were outweighed by the extra 8 bytes of UDP header overhead that would appear in every protocol packet. So we decided to run OSPF directly over the IP network layer, and we received an assignment of IP protocol number 89 from the IANA"
@Enrique & Paul: Thanks a million. Added to the blog post.
Note that recently there were idiots who suggested using TCP for the flooding of LSPs in IS-IS. https://tools.ietf.org/html/draft-hsmit-lsr-isis-flooding-over-tcp-00
This won't work over a multi-point network (like a switched Ethernet with more than 2 routers). This would work only over p2p links. But there are benefits (see draft): - flow-control, improving the speed of flooding of large numbers of LSPs - using reliability of TCP which allows the IS-IS implementation to be a bit simpler. and keep less state. and removes scaling factors like having to use large amounts of precise timers. - retransmission is done by the kernel, not by the IS-IS process. which gives you a form of multi-threading for free - all future improvements to TCP can be used automatically by IS-IS
But nah. The IETF LSR workgroup has decided they rather invent their own new wheel.
ASN.1 has nothing to do with the with the real-time behaviour of the network. It is a domain specific protocol design language that is compiled to some programming language to create an packet assembler or interpreter. Similar to yacc/lex, just optimized for networking protocols. Then you just need to add your state machines and you have a working protocol stack. You could create both the transmit and the receive sides from the same source code, so it would be guaranteed to be consistent.
You could also specify your protocols by any other languages, including English text files, although this gives less opportunities for automation than ASN.1. You can also have a JSON presentation of ASN.1. It is also possible to translate between XML and ASN.1. ASN.1 also enables developing TTCN-3 based automated protocol testing.
ASN.1 is a widely used tool even today. Most of the 3GPP protocols are specified by that. It has its own decades long evolution path and it will not disappear for a long time.
Of course, ASN.1 is not for the faint hearted, and it does not support well the popular ad hoc, implusive, political protocol development.
@Henk: Considering that we're already using BGP on multi-access networks, and that flooding goes through DIS or DR, I don't see any reason why you couldn't use IS-IS over TCP over LLA on multi-access networks... but as you wrote, reinventing the wheels is so much more fun (and CV-generating).
Well, I cannot resist to add some mud to clarify ;-)
So, the story is very, very multi-faceted and talking about what makes sense depends very much WHEN you talk WHAT made sense and @ WHAT scale/speed you want things to happen.
First, architecturally, running control plane over TCP is a mild abomination that costed us dearly and I had my share of arguments with Yakov about that then (well, I was a young punk @ that time ;-). Short term, it was a great temporary shortcut. Long term it costed us a lot of pain in things like NSR (and still does). Anyway, for BGP it's history ;-) and we can't even get multiple parallel sessions standardized so we'll never wean it off TCP.
So, let's talk about IGPs over TCP (and there are some, Open/R does it so it CAN be done). Pluses & minuses modulo time scale here.
In John's time bits very incredibly expensive for IGPs, really, incredibly, links were thin, memory was small, CPUs were glowing @ 600Mhz ;-) TCP for that purpose was massive overhead simply in terms of context switching, all the formatting, fast timers, and whatever not the kernel burns. Today, it's arguably less of an issue.
Then, stock kernel TCP is fun but it looses its appeal real, real quick once you hit serious peer scale or look for really fast stuff. Reasons are a multitude but basically kernel TCP is mostly tuned to be a fairly docile animal and it does not allow you to fiddle with lots stuff you want to fiddle to get it start fast, support NSR and bazillion other little twists youi need on a serious scale, high end device. And then you pay kernel context switches as well. So you end up pulling TCP into user space and tweak the hell out of it or leave it in kernel space and tweak the hell out of it as well, neither of that is in any sense trivial.
TCP also forces you onto an addressing scheme and that's not exiting, IGPs sometimes run on fun stuff like a single link local on all links, unnumbered overlapping and so on. Teaching TCP to do that properly can cause unexpected effects of limited fun.
Then, if you use MTU 255 without special hacks TCP will happily build sessions all over the place for you ;-) So more kernel/user space special knobs.
Head of line blocking is to some extent a problem @ very high speeds and low flooding rates but I'd consider it marginal. On very slow links it was an issue, yes.
Framing is an issue, anybody who wrote a really solid, good BGP framer resilient to all kind of misformatting knows it's not trivial. And the big the BGP buffer gets (64K updates, anyone?) and the more peers you have the higher the bloat to the point it starts to become noticable. Yes, we're talking 1000s of peers and arguably, IGP is almost never used in such a setup but neither is BGP except they are ;-)
What more? Well, lots of warts and things like flushing, linger on going down and trying to restart peers fast, nightmarish collision FSMs and funky issues on fast interface address changes, next-hop changes TCP uses to resolve (hmm, where is that once TCP is in user space ;-) and other things that make you tear hair our as practitcioner (MD5 TCP option hacks anyone, SYN attacks anyone) and pretty soon you wish you wrote your own flooding and pump raw frames over an interface. Is that hard? Yeah, moderately so, especially first time. The real hard part is that it's unforgiving, you can't get it "almost" right, it has to be 105% right just like TCP is. Performance tunning is its own little game, IGP WG is enlightening people to an extent as we speak and that's good work albeit arguably implementation detail that matters @ scale only.
I BTW disagree with the UDP angle (in today's times), John was right in his time but UDP gives you basically zero work to get off the ground on anything & allows for very easy multiplexing. Anyone who implemented SNAP for ISIS on anything will know what kind of fun is better avoided ;-)
So, I probably forgot bunch things but in sum, in underlay I vastly prefer special reliable session protocol (because that's what discovery/flooding gives you compared to TCP) on a protocol whereas when things move closer to database land (think BGP) where session bringup speed is less important, sessions live for very long time and pump lots of data where fastest convergence is not the primary concern I wish we had a proper specialized session protocol for control plane (actually, that has been done if you look up e.g. RFC4960 AFAIR, recently QUIC is threading new ground).
Ivan, re the "TCP was considered a resource hog" point, there's a seminal paper debunking that myth 31 years ago, in 1989 (It was also around this time that OSPF and ISIS came to be, IIRC):
https://groups.csail.mit.edu/ana/Publications/PubPDFs/An%20Analysis%20of%20TCP%20Processing%20Overhead.pdf
The paper went into the painstaking detail of breaking down the performance components of TCP into instruction level, so it's a pretty good read.
What I like the most about the paper is one of its conclusions: "It is not enough to be better than TCP and to be compatible with the form and function of silicon technology. The protocol itself is a small fraction of the problem." Looks like this statement is as applicable today as it was 30 years ago.
People these days talk down TCP a lot and come up with all sorts of alternatives, but how much stress-test have those protocols gone through, in real-world environments, with real traffic, not in test/benchmark labs? Real-world traffic is neither Bernoulli nor Poisson, it's not beautifully distributed but bursty on many levels aka scale-invariant/long-range dependent. Just because a protocol works well in a lab or in controlled environments, doesn't mean it'll perform nice and clean in the wild. And then we have the problem of code-level maturity too. A protocol that has lovely functionality, might have horrible implementation under the hood, with all sorts of race conditions and other concurrency bugs.
And while a lot of Tony's points are correct, esp. those re context-switch overhead, I'm just wondering how many of them have been mitigated as TCP implementations matured over the years? Also, assuming OSPF runs on top of TCP, the problem of accidental full-BGP-table redistribution would be mitigated, as TCP has built-in flow control and the session would not flap due to overload, avoiding the massive clean-up issue. Using TCP as transports would also remove the need to maintain multiple, confusing link-state timers as well, as Henk rightly pointed out. Overall, these 2 points mean improved scalability for OSPF, and ISIS as well.
"TCP has built-in flow control and the session would not flap due to overload, avoiding the massive clean-up issue" -- well, it can also contribute to massive routing mess instead ;) https://mailarchive.ietf.org/arch/msg/idr/L9nWFBpW0Tci0c9DGfMoqC1j_sA/