When I wrote the Why Does Internet Keep Breaking? blog post a few weeks ago, I claimed that FRR still uses single-threaded routing daemons (after a too-cursory read of their documentation).
Donald Sharp and Quentin Young politely told me
I was an idiot I should get my facts straight, I removed the offending part of the blog post, promised to write another one going into the details, and Quentin improved the documentation in the meantime, so here we are…
Why Does It Matter?
In a word1: sanity, performance and responsiveness.
Networking engineers love to build artisanal wheels, and routing protocol designers are no better. Every routing protocol has a bespoke implementation of the same three major functionalities:
- Deal with neighbors: discover them, keep them happy, and figure out when one of them keeps quiet for too long.
- Deal with updates: receive them, acknowledge them (when the protocol designer thought he could do better than TCP), send new information out, and retransmit it if needed (yet again, only for people who think TCP sucks)2
- Deal with changes: Update internal topology information based on received updates, calculate new routing tables, push new stuff into routing table to compete with other stuff.
Does it make sense to do all three things in a single monolithic blob of code? Sure it does if you think juggling is a great pastime. Everyone else tries to stay sane by decoupling things into smaller bits that can be executed independently, and according to JeffT (see comment below) most modern routing protocol stacks are implemented that way3.
For the sake of completeness: Cisco IOS programmers figured that out decades ago. For example, Cisco IOS OSPF implementation uses two processes per routing protocol: a hello process and a router process:
s1#show ip protocols summary Index Process Name 0 connected 1 static 2 application 3 ospf 1 s1#show processes | include OSPF 224 Mwe 2D2CE17 23 285 80 8956/12000 0 OSPF-1 Router 228 Mwe 2D30EB5 15 288 52 9044/12000 0 OSPF-1 Hello
Processes or Threads?
The Cisco IOS printout talks about processes, FRR documentation talks about threads. What’s the difference?
The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.
From that perspective, Cisco IOS processes are really threads as Cisco IOS does not have inter-process isolation4.
The only correct answer to “when should one use threads instead of processes” is “it depends”, or we could keep going for hours. To make a long story short: whenever independent bits of code share sockets (including TCP sessions) or memory structures, it’s easier to say “who cares about memory isolation” and use threads.
Now that I mentioned Cisco IOS, I have to add another bit of trivia: Cisco IOS is a non-preemptive (or run-to-completion) operating system. As long as one process keeps running, nobody else can jump in to get something done (like sending a badly needed HELLO message)5.
Modern operating systems like Linux can do better. Processes or threads can be interrupted or ran in parallel on multiple CPU cores or sockets, which means that a keeping neighbors happy thread can keep sending HELLO messages or BGP keepalives while the updating the BGP table thread scratches its head trying to figure out what to do with another half a million updates that just came in.
Coming back to FRR: according to process architecture documentation, they split BGP functionality into three threads: an I/O thread, a keepalive processing thread, and a rest of the stuff thread.
Other BGP implementations use even more threads, see for example IOS XR BGP threads.
I mentioned that you could use threads to increase performance when you happen to have too many CPU cores. For example, you could have multiple threads processing incoming BGP updates in parallel, and another bunch of threads building outgoing updates.
Whenever you want to increase the performance of a software solution with a scale-out threading architecture you have to split the problem you’re facing into smaller (hopefully independent6) shards7. There are at least three solutions real-life BGP routing daemons use8:
- Per-neighbor thread handling all neighbor-related I/O operations. RustyBGP took that approach resulting in phenomenal performance in an environment with many neighbors (as compared to other open-source BGP stacks). IOS XR Distributed BGP goes a bit further, performing as much work as feasible (down to MED comparison) within the speaker threads.
- Per address family thread. BGP tables of individual address families are totally independent from each other apart from next-hop references between VPN address families (VPNv4/VPNv6/EVPN) and IPv4 unicast address family. Having a thread per address family is thus a (conceptual) no-brainer9. Cisco IOS XR uses this approach in their Distributed BGP implementation.
- RIB sharding. Numerous threads are run in parallel on smaller chunks of BGP RIB. Junos release 19.4R1 introduced RIB sharing together with update threading (packing outgoing BGP updates in parallel threads). To learn more, read the Deploying BGP RIB Sharding and Update Threading Day One book – chapter 1 does a great job of explaining the concepts.
Interestingly, it looks like the scale-out BGP daemons were implemented primarily in high-end routers used to run the Internet core, but not in data center switches10.
There might be no need for high BGP performance in data center switches considering the forwarding table sizes in merchant silicon ASICs… although I do wonder how long it takes to bring up a new BGP session in large-scale EVPN deployments considering how many vendors insist on running BGP sessions with EVPN address family between loopback interfaces.
Another reason could be the underlying hardware – I have a feeling that the data center switches still get the cheapest reasonable CPU the vendor can buy, in which case it would make no sense to optimize a routing daemon for many-core performance.
- Totally rewrote the Performance section
- Added a pointer to IOS XR threads (HT: Kristian Larsson)
- Added a pointer to thread-per-neighbor RustyBGP implementation (HT: Kristian Larsson)
- Added a pointer to Junos RIB sharing (HT: Adam Chappell)
- Added a bit of a discussion on the viability of non-multi-threaded routing protocol implementations (based on comment from JeffT and the Facebook paper mentioned by Mario).
- Added a “BFD can be offloaded” remark for Twitter pedants who found this blog post outdated because of that omission.
- Reworded the last paragraph a bit because it painted a pessimistic view of multi-thread/multi-code. Can’t do more than what I did; sometimes the reality isn’t bright and shiny.
Three to be precise, or is it four, but who’s counting. ↩︎
The second half of this functionality might be tightly coupled with the next bullet, in which case we’re talking about a distance vector protocol. ↩︎
… or not. Facebook published a conference paper in early 2021 in which they proudly compared their multi-threaded custom-built BGP implementation with single-threaded Bird or Quagga. Bird is (AFAIK) still one of the most popular IXP route server implementation. ↩︎
I’m positive generations of TAC engineers and software developers loved the fun-to-squash bugs you get that way. ↩︎
And if the offending process doesn’t give up in a reasonable time, you get the much-admired CPUHOG syslog message. ↩︎
If the shards you’re working on aren’t independent enough, you’ll spend a lot of time locking the data structures and waiting for other threads to unlock them, effectively wasting CPU cycles on synchronization activities. ↩︎
That tends to be a really hard problem unless you started with a routing protocol specifications and an architecture that considered scalability. I’m also positive that anyone taking a monolithic routing daemon and implementing multi-threading on top of that code would get some really nice bugs on the first try. ↩︎
We’ll focus on BGP; most other routing protocols are trivial (performance-wise) compared to what we’re throwing at BGP. For example, you don’t have to build outgoing updates in OSPF or IS-IS, all you have to do is to flood what came in. ↩︎
There’s probably a large gap between theory and practice. ↩︎
… even though some data center pundits think BGP is the answer regardless of what the question is. ↩︎