Multi-Threaded Routing Daemons

When I wrote the Why Does Internet Keep Breaking? blog post a few weeks ago, I claimed that FRR still uses single-threaded routing daemons (after a too-cursory read of their documentation).

Donald Sharp and Quentin Young politely told me I was an idiot I should get my facts straight, I removed the offending part of the blog post, promised to write another one going into the details, and Quentin improved the documentation in the meantime, so here we are…

Why Does It Matter?

In a word1: sanity, performance and responsiveness.

Networking engineers love to build artisanal wheels, and routing protocol designers are no better. Every routing protocol has a bespoke implementation of the same three major functionalities:

  • Deal with neighbors: discover them, keep them happy, and figure out when one of them keeps quiet for too long.
  • Deal with updates: receive them, acknowledge them (when the protocol designer thought he could do better than TCP), send new information out, and retransmit it if needed (yet again, only for people who think TCP sucks)2
  • Deal with changes: Update internal topology information based on received updates, calculate new routing tables, push new stuff into routing table to compete with other stuff.

Does it make sense to do all three things in a single monolithic blob of code? Sure it does if you think juggling is a great pastime. Everyone else tries to stay sane by decoupling things into smaller bits that can be executed independently, and according to JeffT (see comment below) most modern routing protocol stacks are implemented that way3.

For the sake of completeness: Cisco IOS programmers figured that out decades ago. For example, Cisco IOS OSPF implementation uses two processes per routing protocol: a hello process and a router process:

Multiple OSPF-related processes are running on a Cisco IOS box with a single OSPF routing process
s1#show ip protocols summary
Index Process Name
0     connected
1     static
2     application
3     ospf 1

s1#show processes | include OSPF
 224 Mwe  2D2CE17           23        285      80 8956/12000  0 OSPF-1 Router
 228 Mwe  2D30EB5           15        288      52 9044/12000  0 OSPF-1 Hello

Processes or Threads?

The Cisco IOS printout talks about processes, FRR documentation talks about threads. What’s the difference?

Here’s a sweet-and-short answer I found on (where else) StackOverflow:

The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.

From that perspective, Cisco IOS processes are really threads as Cisco IOS does not have inter-process isolation4.

The only correct answer to “when should one use threads instead of processes” is “it depends”, or we could keep going for hours. To make a long story short: whenever independent bits of code share sockets (including TCP sessions) or memory structures, it’s easier to say “who cares about memory isolation” and use threads.

Responsiveness?

Now that I mentioned Cisco IOS, I have to add another bit of trivia: Cisco IOS is a non-preemptive (or run-to-completion) operating system. As long as one process keeps running, nobody else can jump in to get something done (like sending a badly needed HELLO message)5.

Modern operating systems like Linux can do better. Processes or threads can be interrupted or ran in parallel on multiple CPU cores or sockets, which means that a keeping neighbors happy thread can keep sending HELLO messages or BGP keepalives while the updating the BGP table thread scratches its head trying to figure out what to do with another half a million updates that just came in.

The benefits of being able to jump it at any time and get a small job done are even higher once we start talking about almost-real-time stuff like BFD… and yes, I’m aware that high-end platforms offloaded BFD to linecard CPUs or hardware. Replace BFD with any other time-critical control-plane protocol of your choice.

Coming back to FRR: according to process architecture documentation, they split BGP functionality into three threads: an I/O thread, a keepalive processing thread, and a rest of the stuff thread.

Other BGP implementations use even more threads, see for example IOS XR BGP threads.

Performance!

I mentioned that you could use threads to increase performance when you happen to have too many CPU cores. For example, you could have multiple threads processing incoming BGP updates in parallel, and another bunch of threads building outgoing updates.

Whenever you want to increase the performance of a software solution with a scale-out threading architecture you have to split the problem you’re facing into smaller (hopefully independent6) shards7. There are at least three solutions real-life BGP routing daemons use8:

  • Per-neighbor thread handling all neighbor-related I/O operations. RustyBGP took that approach resulting in phenomenal performance in an environment with many neighbors (as compared to other open-source BGP stacks). IOS XR Distributed BGP goes a bit further, performing as much work as feasible (down to MED comparison) within the speaker threads.
  • Per address family thread. BGP tables of individual address families are totally independent from each other apart from next-hop references between VPN address families (VPNv4/VPNv6/EVPN) and IPv4 unicast address family. Having a thread per address family is thus a (conceptual) no-brainer9. Cisco IOS XR uses this approach in their Distributed BGP implementation.
  • RIB sharding. Numerous threads are run in parallel on smaller chunks of BGP RIB. Junos release 19.4R1 introduced RIB sharing together with update threading (packing outgoing BGP updates in parallel threads). To learn more, read the Deploying BGP RIB Sharding and Update Threading Day One book – chapter 1 does a great job of explaining the concepts.

Interestingly, it looks like the scale-out BGP daemons were implemented primarily in high-end routers used to run the Internet core, but not in data center switches10.

There might be no need for high BGP performance in data center switches considering the forwarding table sizes in merchant silicon ASICs… although I do wonder how long it takes to bring up a new BGP session in large-scale EVPN deployments considering how many vendors insist on running BGP sessions with EVPN address family between loopback interfaces.

Another reason could be the underlying hardware – I have a feeling that the data center switches still get the cheapest reasonable CPU the vendor can buy, in which case it would make no sense to optimize a routing daemon for many-core performance.

Revision History

2021-11-26
Totally rewrote the Performance section
2021-11-24
  • Added a pointer to thread-per-neighbor RustyBGP implementation (HT: Kristian Larsson)
  • Added a bit of a discussion on the viability of non-multi-threaded routing protocol implementations (based on comment from JeffT and the Facebook paper mentioned by Mario).
  • Added a “BFD can be offloaded” remark for Twitter pedants who found this blog post outdated because of that omission.
  • Reworded the last paragraph a bit because it painted a pessimistic view of multi-thread/multi-code. Can’t do more than what I did; sometimes the reality isn’t bright and shiny.

  1. Three to be precise, or is it four, but who’s counting. ↩︎

  2. The second half of this functionality might be tightly coupled with the next bullet, in which case we’re talking about a distance vector protocol. ↩︎

  3. … or not. Facebook published a conference paper in early 2021 in which they proudly compared their multi-threaded custom-built BGP implementation with single-threaded Bird or Quagga. Bird is (AFAIK) still one of the most popular IXP route server implementation. ↩︎

  4. I’m positive generations of TAC engineers and software developers loved the fun-to-squash bugs you get that way. ↩︎

  5. And if the offending process doesn’t give up in a reasonable time, you get the much-admired CPUHOG syslog message. ↩︎

  6. If the shards you’re working on aren’t independent enough, you’ll spend a lot of time locking the data structures and waiting for other threads to unlock them, effectively wasting CPU cycles on synchronization activities. ↩︎

  7. That tends to be a really hard problem unless you started with a routing protocol specifications and an architecture that considered scalability. I’m also positive that anyone taking a monolithic routing daemon and implementing multi-threading on top of that code would get some really nice bugs on the first try. ↩︎

  8. We’ll focus on BGP; most other routing protocols are trivial (performance-wise) compared to what we’re throwing at BGP. For example, you don’t have to build outgoing updates in OSPF or IS-IS, all you have to do is to flood what came in. ↩︎

  9. There’s probably a large gap between theory and practice. ↩︎

  10. … even though some data center pundits think BGP is the answer regardless of what the question is. ↩︎

5 comments:

  1. Junos used to spawn a dedicated thread only if the precision-timers knob (sub-15ms hold time) was applied, now I think it is baked in by default. Design sanity is good but I think it's still a trivial job to choke keepalives..on any vendor platform not properly protected. If there occurs a mega-failure someday it will probably be related to this. Even the processes meant to protect systems (policers) can also smother it.

    Replies
    1. Yeah, that's another huge can of worms. Unless you can do policing per interface you're always open to a nasty DoS attack.

      Years ago it was trivial to kill box-wide ARP handling on a GRS - policer would kick in (protecting the CPU), but nobody would get their ARP replies because most requests were dropped, eventually resulting in loss of service.

  2. Ivan - all modern routing protocols implementations are multi-threaded, with a minimum separation of adj handeling, route calculations and update generation. Note - writing multi-threaded code for complex tasks is a non trivial exercise (you could search for thread safety and similar artifacts and what happens when not implemented correctly). Moving to a multi-threaded code in early 2010s resulted in a multi-release (year) effort and 100s of related bugs all around. FYI non preemptive is usually called “run to completion”

    Replies
    1. Thanks for the feedback - will add to the article. Would you happen to be aware of scale-out implementations (example: multiple threads computing updates in parallel)? It would be nice to add a few examples in that category.

    2. Imho, multi-threading where you divide your workload in a fixed number of threads doesn't count. That's relatively trivial. E.g. a hello thread, an update thread and a route-computation thread. That's still O(1) scalability. To be able to brag, your code should be able to used a large number of cores on your route-processor.

      That being said, for link-state IGPs, it does not make sense to go beyond a fixed number of threads. E.g. as we mentioned, the hello, the update and the spf thread. Maybe a route-installation thread too. But that's about it. You don't need more.

      BGP is a whole different story. That's where the challenge is.

      Another thing to consider is how router OS's deal with multiple VRFs. Suppose you have a 1000 VRFs on a PE, and each VRF runs a routing protocol with a CE. What are you going to do? Spawn a 1000 processes? That doesn't scale really. Have one process per routing protocol, with 1 thread per VRF? Or are you going to use worker-threads? These are the harder questions.

      I have no idea how my current employer's BGP implementations are. Sorry. But I can tell you that my previous employer has a BGP implementation that does do "scale-out multi-threading" in BGP. Their CPM (route-processor) has 10 cores. Their BGP will use 6 cores for route-generation. That's very nice when you are a route-reflector. Alas other parts of their BGP code are still single threaded. Far from perfect.

      I am surprised about the lack of true improvement BGP implementations have made in the last 20 years. I mean architectual and performance wise. A lot of work has gone into the protocol (writing RFCs). But not in the implementations themselves, it seems. I guess it is easer to write drafts than to write code. As far as I know, there is no BGP implementation that does everything multi-threaded at scale. Reading from sockets, doing ingress policy, bestpath-computation, route-installation, egress-policy, generating output updates. It should be possible to do all of that on multiple cores, in every stage. Some stages require locking, or must be single-threaded. E.g. installing new routes in the Adj-RIB-In. But other things (policy, bestpath computation, rib-install, update-generation) you can do on many cores in parallel.

      I wonder why nobody has attempted to write such a "perfect" implementation yet. And I wonder why nobody has asked for one. Maybe current implementations are deemed "good enough"?

    3. I agree Henk, I'm disappointed that there aren't more scalable solutions. I want BGP daemons to catch up to modern databases. How do we get Network router vendors to think the same way that current database people do, they should not be held back by hardware and they should take advantage of hardware. At what point do I take an in memory no-sql database and hook on a simple BGP protocol parser?

      I wonder how often we can't even consider new architectures because we assume we have software from 1987.

      Also, I want data more than I want a discussion in English about how best to break it up. I don't want to dismiss the good work that the FRR team is doing, but they still have a long way to go to take advantage of modern hardware. https://elegantnetwork.github.io/posts/bgp-perf5-1000-internet-neighbors/. I want the BGP industry to catch up to database technology.

      While we are at it, I also want protocol stacks made in memory safe languages like Rust and not c.

      Don't get me started about how I want them to change the way that they test software. (I know, I know, I've heard that Arista has modern approaches to testing.)

  3. I found this paper released from Facebook fairly interesting. They do mention the creation of their own standards based BGP agent written in C++. I found it interesting they compared the performance of BIRD/QUAGGA with their own implementation. The paper also mentions Facebook's approach to ASN reuse with BGP confederations, spine pods and hierarchical per POD ipv6 prefix suppression. Pretty interesting approach to yet another BGP use case in the DC.

    An excerpt from their paper:

    "Our implementation employs multiple system threads, such as the peer thread and RIB thread, to leverage the multi-core CPU. The peer thread maintains the BGP state machine for each peer and handles parsing, serializing, sending, and receiving BGP messages over TCP sockets."

    https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf

    Replies
    1. Thanks for the link. Looks like they went down the same path as FRR - splitting the routing protocol functionality into independent threads along the lines of the comment by JeffT.

      Couldn't figure out from the article whether they implemented anything beyond that, for example a scale-out architecture with parallel threads handling (for example) outbound updates.

  4. Modern OSes and hypervisor can have fractional virtual CPUs. Your mentioned limitations on one thread per CPU core is a kind of problem slowly fading away. So when there is I/O blocking another thread can be scheduled on the same CPU core in modern systems. With real processes you do not have this problem for ages, since multiple processes could be easily scheduled in a single CPU.

    However, process isolation is not done as good on Unix/Linux as on VMS or ESA. So they had to invent containers. I did not need such tricks on my VAX/VMS already in 1987. I had proper isolation and full resource allocation control. You could also have hard real-time systems on VAXELN. I also used QNX with nicely isolated, robust processes already in the 80s. It was a pity when Cisco dropped QNX from IOS XR...

    IBM has had virtual fractional CPU cores already for decades, but people are usually not willing to pay for good quality engineering... :-)

  5. Hi Ivan, re process vs thread, the easiest (but still technically correct) way to think about them, is that process is the environment (address space, register set etc) where a collection of worker threads do work. A process is physically represented by PCB, while a thread, by a TCB object. So a one-thread process is still technically not a thread.

    Juniper implements all of their routing protocols inside RPD, and while it's clear they do this for performance reason as well as the fact that in Junos, there's no such thing as interaction between protocols say in redistribution, but all protocols interact only with the RIB, by housing many threads doing quite a diversity of stuff (OSPF, ISIS, BGP...) into one process, they risk a failure of one thread bringing down everything else since they all share the same process environment, esp. when protocols get more and more complex with new features being introduced.

    Wrt to their BGP RIB sharding document, while it's obviously great, it also has some downsides worth mentioning, mostly due to multi-threading overhead. The RIB to FIB ratio of 4:1 or higher for good performance means it only works great when the number of multipaths are very large.

    But the FIB download time is most crucial. Again this problem has been known for over 10 yrs; it's the FIB download and installation time that's the biggest bottleneck in modern routers, not the control-plane side of thing. And this problem gets exponentially worse the higher the number of routes one has in the RIB. In short, according to Juniper, RIB sharding may or may not help with FIB download time. I suppose there's only so much maths/algorithmics can do in the face of physical constraints.

    Re classic IOS, it's true it's non-preemptive, again for performance reason as memory and CPU were scarce 35 yrs ago, and context switch very expensive. But it does partially compensate for it by implementing per-process watchdog timer as some coarse-grained form of quantum/time slice, to guard against runaway processes. After two watchdog quantums expire, IOS scheduler terminates a process and brings another one in. Two seconds would be an eternity now, but not too terrible back in the day.

    IOS also has a crude form of protecting against memory corruption by erroneous process by implementing gaps between memory regions. The show region command can display the regions and their gaps. If a problematic process starts writing garbage into memory, it's forced to stop when encountering a gap. This trick can be useful even today, but of course IOS-XR implementing microkernel and multithreading is overall a much better architecture than the monolithic IOS, cleaner for one.

    Henk's point on the lack of improvement in BGP implementation in the last 20 yrs is very much worth paying attention to, and his remark "it is easer to write drafts than to write code" is spot-on. Could it be that due to the explosion of the code base, now in the hundred thousand lines of code, it's simply led to architectural dead end due to complexity and therefore, too hard to convert this code base into a multithreading equivalent?

    Among the biggest issues of multithreading are synchronization and inter-dependency, and this gets much harder to solve as the code gets more and more complex. Inter-thread synchronization overhead and OS-scheduler inefficiency are the main reasons why as we start to add more core, performance will hit a peak and then reverse as more cores are added. In fact, Juniper's RIB sharding touches on this topic as well.

    So not only do we need better implementation of protocols, don't forget the centralized (again, centralization doesn't scale) OS scheduler will be one of the biggest, if not the biggest bottleneck, as you have more and more cores at your disposal. This problem is exactly the same one plaguing router's crossbar fabric, as the central scheduler hits its limit when interface speed improves by leaps and bounds.

    And don't forget the compiler. Just because CPU vendors come up with more cores, doesn't mean they can come up with a superb compiler that can generate codes that take advantage of the cores. The failure of VLIW/EPIC Itanium is a glaring example; certain things only work in PPT. When it comes to massive parallelism, we can't omit any factor as they're not isolated, but interplay into complex outcomes.

    And Bela's fractional virtual CPUs just don't scale. That's why VM vendors like VMware highly recommend matching the physics, that is having congruent vCPU and pCPU topologies. I don't know anything about VAX scheduler to have a comment, but given 50 yrs of scheduling research has so far failed solve the problem, I won't hold my breath on VAX (or OS for that matter) weaving magic.

    In a word, imho, don't expect any significant improvement in quality of BGP implementations anytime soon. Plus pay more attention to the FIB download and insertion bottleneck. This can be the most painful part of the problem and can get really nasty at the million-route scale or higher.

Add comment
Sidebar