Cache-Based Packet Forwarding

In the previous blog post in this series I described how convoluted routing table lookups could become when you have to deal with numerous layers of indirection (BGP prefix ⇨ BGP next hop ⇨ IGP next hop ⇨ link bundle ⇨ outgoing interface). Modern high-end hardware can deal with the resulting complexity; decades ago we had to use router CPU to do multiple (potentially recursive) lookups in the IP routing table (there was no FIB at that time).

Network devices were always pushed to the bleeding edge of performance, and smart programmers always tried to optimize the CPU-intensive processes. One of the obvious packet forwarding optimizations relied on the fact that within a short timeframe most packets have to be forwarded to a small set of destinations. Welcome to the wonderful world of cache-based forwarding.

The blog post has been updated based on extensive feedback by Henk Smit (see comments)

The first cache-based packet forwarding mechanism I’ve encountered was Cisco’s venerable fast switching (now long dead). The first packet toward an unknown destination would be process switched. The results of the lookups needed to forward the packet (IP routing table + ARP/ND1 table) would be stored in a fast switching cache to be used by subsequent packets. The following information was stored in the fast switching cache:

  • IP prefix (using a calculated subnet mask based on an “interesting” algorithm2)
  • IP next hop (used only for cache invalidation)
  • Outgoing interface
  • Complete layer-2 header

The performance difference between process switching and fast switching was significant. Fast switching was performed within the interrupt processing (packet forwarding could interrupt whatever else the router was doing), whereas process switching (as the name implies) was done within the IP Input process that had to compete for CPU cycles with control-plane processes. To make matters worse, Cisco IOS used non-preemptive scheduling leading to hilarious results: transit packets had to wait for goodwill of other processes like OSPF SPF run or BGP table scan to release CPU before they could be processed3.

Fast switching implementations had to deal with the usual caching headaches:

  • Incomplete cached information – in an ECMP scenario, only a single path toward a destination would be cached, resulting in suboptimal load balancing4.
  • Cache invalidation – it was pretty easy to invalidate fast switching entries whenever a directly-connected next hop disappeared, or whenever a prefix in the routing table changed. Figuring out what entries one needs to invalidate when one of the interim steps in a recursive routing table lookup changes must have been an interesting challenge5
  • Cache trashing – when the number of active destinations exceeded the size of the cache, most packets became process switched, resulting in significant performance drop.

To make matters worse, Cisco implemented fast switching in hardware (autonomous switching), resulting in orders-of-magnitude performance drop when cache trashing forced a router to fall back to process switching.

Fast switching worked pretty well on low-end edge routers and failed miserably in high-end (Internet) core routers, the last nail in its coffin being the ubiquitous port scans. A single hacker scanning the Internet at a reasonably low scan rate could bring a high-end router to its knees by polluting its fast switching cache.

Internet core experienced numerous regional brownouts in those days until Cisco changed the packet forwarding paradigm and rolled out Cisco Express Forwarding (CEF) – a packet forwarding mechanism that relies on fully evaluated forwarding table6 instead of a forwarding cache.

The laws of physics haven’t changed in the interim7 – whenever you encounter a cache-based forwarding scheme (conversation learning, original LISP, Cisco SDA…) you can kill its performance with a simple address scanning tool, or as I said in the past “cache-based forwarding never scaled.

There’s an obvious exception to that rule: the forwarding cache might be large enough to hold all potential destinations, and the number of changes is so small that the cache maintenance doesn’t overload the CPU… but then any cache-based forwarding scheme becomes nothing else but a steaming pile of unnecessary complexity.

But wait, it gets worse. Every few years someone gets the awesome idea of caching individual TCP/UDP flows. What could possibly go wrong?

Another Data Point for RFC 1925 Rule 11

When I was updating this blog post to ensure everyone understands I’m referring to the original LISP ideas, I remembered an extensive comment Victor Moreno wrote in 2017 on my Why Is Cisco Pushing LISP in Enterprise Campus? blog post.

The impact of mobility events in a LISP network (as you know from past reviews published in your blog) is limited to signaling amongst the network elements involved in active connections between the devices. However, the impact of mobility events in a BGP network is unbound. Even if you have conditional FIB programming, all changes are pushed to all participants.

And also…

I happened to be in the process of posting a document that describes a wealth of other functionality that is possible by the simple principle of the demand based control plane and a discussion on why this is best realized with a demand protocol.

Please note that the demand-based control plane is a nice euphemism for cache-based forwarding. Now compare what Victor wrote in 2017 to what Bela wrote in 2021:

LISP is a more complex animal nowadays. Nowadays it is used with reliable transport and full PubSub. There is no caching behavior at all. Each xTR has a full routing table.

I would say that the evolution of LISP proves (A) the point of this blog post as well as (B) RFC 1925 rule 11.

More Details

  • Inside Cisco IOS Software Architecture book has an excellent overview of process switching, fast switching, optimum switching and autonomous switching. It was published in 2000 and hasn’t been updated since, but you could still browse it for a few days with a trial Safari subscription.
  • You’ll find an extensive discussion of various packet forwarding mechanisms in the Switching, Routing and Bridging part of How Networks Really Work webinar.

Revision History

2022-02-28:
Reworded a number of things and fixed the inaccuracies based on extensive feedback by Henk Smit, including: :
  • Changed “walking the routing table” into “multiple lookups in the IP routing table”
  • Fixed the “packets waiting for SPF run” part
  • Didn’t touch the fast switching details – checked with Henk, and we agreed that our memories are too hazy for a definitive answer. Have to find someone who remembers how things were done.
2022-02-25
Clarified that I had the original LISP ideas in mind, and added a paragraph about evolution of LISP proving RFC 1925 Rule 11.

  1. Contrary to Henk’s comment, it seems IPv6 had a fast switching implementation. I even found a document explaining IPv6 multicast process- and fast switching on Catalyst switches. That must have been ultra-fast ;) ↩︎

  2. Fast switching did not use the usual tree-based data structure that would allow it to deal with variable-length subnet masks. The details are best forgotten; a few of them are documented in Inside Cisco IOS Software Architecture book. It might even be that the early versions of fast switching created host entries (not prefixes) in the fast switching cache. Will try to find someone whose memory is better than mine. ↩︎

  3. A failure to release CPU in a timely manner would result in much-beloved CPUHOG syslog messages. Not that you could do anything if one kept popping up; they were just an indication of suboptimal code. ↩︎

  4. I could write a whole blog post on the intricacies of load balancing with fast switching, but it makes no sense to waste everyone’s time on the details of an obsolete implementation. ↩︎

  5. I have a weird feeling that they simply flushed the whole cache when they couldn’t decide what to do, but have zero evidence for that claim. See also comments by Henk Smit. ↩︎

  6. Limited amount of router memory is the only reason I can think of that made Cisco go down the cache-based forwarding rabbit trail. Many routers didn’t have enough memory to store two or three copies of routing information from the Internet default-free zone (BGP table, IP routing table, CEF table). Henk Smit disagrees with me (see comments), but I still remember routers that would get kicked out of CEF switching due to memory shortage and subsequently blackhole all MPLS services which relied on CEF switching. ↩︎

  7. That might surprise believers in the magical powers of unicorn dust and PowerPoint slides. I’m pretty sure the regular readers of this blog are immune against that hallucination. ↩︎

Blog posts in Packet Forwarding Basics series

8 comments:

  1. This is how OpenVSwitch (https://blog.ipspace.net/2013/04/open-vswitch-under-hood.html) works, too: A first packet triggers a lookup and evaluation of ACLs, then the flow entry is cached for X seconds. ECMP forwarding is preserved because there is a 'multipath' action that is evaluated for each packet; I know because I personally fixed the hashing (https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/359489.html)

    This implementation is not quite so "obsolete" as one might think; for example, 1000s of Nuage VRS/NSGs use it as we speak

  2. LISP is a more complex animal nowadays. Your knowledge of LISP seems to be quite outdated.

    Nowadays it is used with reliable transport and full PubSub. There is no caching behavior at all. Each xTR has a full routing table.

    Replies
    1. Thanks for the comment. Updated the blog post accordingly.

  3. In general, I support the main point of the blog post: cache based forwarding is a problematic scalability mechanism, i.e., scalability is limited and performance collapses beyond the limit.

    Topology based forwarding has scalability limits, too. Once there are more routes than can fit the (hardware) forwarding tables, networking devices crash, fall back to CPU based forwarding (i.e., performance collapses), or deliver some packets to the wrong destination (I've seen all of those…).

    Firewalls, i.e., devices where network topology is just a small part of the forwarding decision, often employ flow caches (for individual data flows) for performance optimization. In some "high-end" firewalls this flow cache is offloaded to hardware. Since per flow offload is often the best such a device can use, it is done in practice, and usually works well enough. [A stateless packet filter in front of a stateful firewall can help staying inside the performance envelope of said firewall.]

    Regarding networking devices primarily used for routing and bridging, the Enterasys Networks N-Series and their successors, K-Series and S-Series, based on CoreFlow resp. CoreFlow2 ASICs, used caching of individual flows in hardware forwarding tables as their only forwarding architecture. In practice, those switches worked well in the "enterprise" networks they were designed for. Of course it was possible to create overload with specific tests to demonstrate the potential for problems.

    Since per flow hardware offload allowed implementing complex, but still high performance, traffic filtering policies, replacing CoreFlow(2) based devices with networking devices using topology based forwarding often provided challenges regarding traffic filtering policies (i.e., either keep complex line-rate traffic filters, or use a different networking device with faster interfaces and topology based forwarding, but not both).

    I provide the above examples to illustrate the complexities involved. In theory, cache based forwarding does not scale. In practice, it may scale sufficiently, at least for a while (e.g., a decade inside an "enterprise" network before deploying the next device generation). It definitely breaks down when overloaded.

  4. > we had to use router CPU to walk

    I think you are mixing up two things. The forwarding entries had a simple combination of: (destination, outgoing interface, mac-address (actually encaps header)). So during packet-forwarding, no walk needed to be done. Just a simple lookup. The "walking you are referring to happened during "routing table maintenance". E.g. BGP walking its NLRI entries, to check if the nexthop still had the best IGP-metric, and whether nexthops were still valid. This was not done during route-lookup.

    Two-stage lookups during forwarding were introduce later. Because of PIC. This allows the RIB to send 1 message to the FIB, and the FIB makes 1 adjustment, impacting many (possibly tens or hunderds of thousands of routes). The trade-off here is: slightly more complex lookup (2 stages) versus much quicker updating the FIB. Note, FIBs in hardware can do in the order of 10K+ changes per second. Not very fast.

    Replies
    1. Bad wording. Should have said "multiple lookups in the IP routing table". Fixed.

  5. > IP prefix (using a calculated subnet mask based on an “interesting” algorithm.

    Are you sure it was a prefix? That is not what I remember. (Note, I have never worked on forwarding, I probably don't know what I am talking about. And I am not speaking in behalf of previous, current or future employers).

    1) I remember doing "show ip fast-switching" or some sort command. Seeing destination ip-addresses. Not prefixes. It's been almost 30 years, so maybe I remember it wrong.

    2) When I heard about OpenFlow doing on-demand caching, I was already wondering how they would scale that. Because with on-demand caching, you can't do summarization. Well, I wouldn't know how to do that. Let's take an example:

    A router has 3 routes: 0/0 -> eth0, 10.1/16 -> eth1 and 10.1.1/24 -> eth2. Suppose a packet arrives for 10.1.9.9. A cache lookup happens -> no entry. So packet is sent to the controller (aka process-switched). 10.1/16 -> eth1 matches. So an entry for 10.1/16->eth1 is created in the forwarding cache. Next a packet for 10.1.1.1 arrives. Box does a lookup in the cache, 10.1/16->eth1 matches. The packet is forwarded out eth1. That is wrong. I have no idea how to fix this, unless you a) do caching per ip-address in stead of prefix, or 2) prepopulate the cache (aka FIB-switching). If someone can solve this problem, please let us know.

    > packets had to wait for things like OSPF SPF run or BGP table scan to complete before they could be processed

    Utter non-sense. Sorry. The IOS scheduler would indeed not interrupt processes. But that didn't mean processes would keep running forever. On the contrary. One of the arts of writing router-software was to cut you task up in smaller steps, and do so called "process_yield()"s between every step. The idea was that a process could run for a few milliseconds, and then voluntarily release the CPU. Next time that process was scheduled (e.g. after packets were forwarded or someone did a show-command), the process would continue with the SPF or table-walk were it left off last time. If a process didn't give up the CPU, a CPUHOG warning was displayed. And the IOS-programmer couild introduce more sched_yield()s in her/his code.

  6. > ECMP scenario, only a single path would be cached

    Not true. A single path for a particular IP-address would be cached. Very similar to today's hashing-function. The only difference is that in fast-switching, the port-number was not used in the hashing. What happened is that when a destination-address was encountered that was not in the cache yet, the packet was process-switched. And in the RIB, when a destination had multiple next-hops, the entry had a pointer to the last next-hop used. So process-switching used the next next-hop to install the fast-switch cache-entry. That is why it was said: process-switching load-balances per packet, fast-switching load-balances per ip-address. I remember giving advice to TAC-customers to disable fast-switching when they wanted per-packet load-balancing over 2 slow lines (e.g. 2 ISDN channels).

    > Figuring out what entries one needs to invalidate must have been an interesting challenge

    I think this problem was never solved. Cisco tried a few approaches. And then realized it wasn't possible. I think this process lasted about a year (or less) (somewhere in 1995). In the end, the only valid solution was to invalidate the whole cache on any change to the routing table.

    Cisco then started with FIB-switching (which later was called CEF).

    > they simply flushed the whole cache when they couldn’t decide what to do

    That is what I remember too.

    > ARP/ND table

    I think fast-switching was dead before IPv6 took off. I think all cisco platforms had switched to FIB-switching very quickly. The CEF project started spring 1996. I think all platforms, including low-end, had switched to CEF in 2, maybe 3 years. (Someone else might remember better than me). Nobody was seriously using IPv6 in 1998. I doubt fast-switching every supported IPv6.

    > Internet core experienced numerous regional brownouts in those days until Cisco managed to get its act together

    This makes it look like cisco was clueless. May I remind you that in those years (1994-1997) there was the biggest explosion of growth of the Internet. And that besides cisco there were zero companies that could build routing software that came even close to working properly. It was until 1998-1999 that another company started selling routers that were deployable (and that software was mostly written by ex-cisco programmers).

    > exception .... cache might be large enough to hold all potential destinations

    No, even then caching doesn't work. Not if the cache-entries change once in a while. The problem is not as much the size of the cache, it is the maintaining of the cache.

    An old joke about programming: "There are only two hard things in Computer Science: cache invalidation and naming things". (Quote by Phil Karlton, according to the Internet). Fast-switching is another proof of this.

    > Limited memory was only reason Cisco went down the cache-based forwarding

    I don't think so. I think the reason was fast-forwarding was done during an interrupt-handler. And thus had to be quick. A simple hash-table was used as cache. If fast-switching had been using the RIB, the locking (or actually disabling interrupts) when changing the RIB would have been a nightmare. And as look-entries were a combination of recursive-lookup, arp-lookup and pre-computed encaps-header, why not put that in a separate table?

  7. > This implementation is not quite so "obsolete" as one might think; for example, 1000s of Nuage VRS/NSGs use it as we speak

    That is because you are talking about forwarding on a host. Not on a router. With current CPU and RAM, almost solution works to solve a toy problem on a toy device in a toy network.

    > Firewalls, i.e., devices where network topology is just a small part of the forwarding decision, often employ flow caches (for individual data flows) for performance optimization.

    As you (Erik) note, firewalls are different from routers. Because 1) a firewall typically has fewer flows through it than a (core) router. 2) A typical firewalls has only 2 interface, "inside" and "outside". And those interfaces don't change often. The challenges of a firewall are totally different than the challenges of a (core) router.

    > In practice, it may scale sufficiently

    Again, when solving toy problems on toy devices in toy networks, almost anything will work. :)

  8. The future opportunity is to use selective subscription in LISP. Then you can have a full control of destinations that are interesting for you. So you can still reduce the memory needs, you do not need to have a full routing table everywhere, but you will not suffer by the generic caching algorithm issues. Your LISP map-cache will be under your full control.

    A combination of PubSub with selective subscription might be an interesting solution for large scale mobility support. Especially, with active multi-link moving networks. We are currently studying these options...

    Unfortunately, selective subscription implementation is lagging behind. But it might come soon...

Add comment
Sidebar