Living with Small Forwarding Tables
A friend of mine working for a mid-sized networking vendor sent me an intriguing question:
We have a product using an old ASIC that has 12K forwarding entries, and would like to extend its lifetime. I know you were mentioning some useful tricks, would you happen to remember what they were?
This challenge has no perfect solution, but there are at least three tricks I’ve encountered so far (as always, comments are most welcome):
- Conversational learning
- Virtual aggregation
- Selective route download
Conversational learning is what you use when you failed to learn the packet forwarding history lessons:
- Build a forwarding table (Forwarding Information Base – FIB) in software
- Start with empty hardware FIB, and punt all forwarded packets to the CPU
- Whenever a new packet arrives to the CPU, find corresponding forwarding entry in the software FIB and install it in the hardware FIB
Congratulations, you reinvented cache-based forwarding, and you’ll have to deal with cache coherence, cache aging and eviction, and you’ll fail miserably when someone starts scanning the address space. It’s also a bit hard to implement a default route in hardware FIB. The proof is left as an exercise for the reader.
All that obviously doesn’t stop the networking vendors from retrying to reinvent this particular broken wheel whenever their hardware designers mess up (see also: Nexus 7000 F1 linecard), or whenever they have a bit of hardware they’re desperate to sell (see also: SmartSwitches).
Conversational learning or any other cache-based forwarding might work within a small network where the number of potential destinations is comparable to the hardware FIB size. Trying to use the same trick with the Internet Default Free Zone (DFZ) is a recipe for disaster as Cisco discovered ages ago when their fast switching mechanism caused severe brownouts in large ISPs.
Here’s another idea from the MacGyver & Co:
- Imagine an edge router connected to two ISPs that happens to have small FIB and full DFZ BGP table (because whatever crazy reason).
┌────────────┐ ┌────────────┐ │ ISP-A │ │ ISP-B │ └────────────┘ └────────────┘ ▲ ▲ │ │ │ ┌────────────┐ │ └─┤ EDGE ├─┘ └────────────┘
- Now assume that one of the ISPs is the transit ISP, and use a default route toward it. Bonus points if the default points to 22.214.171.124 or 126.96.36.199 to cope with ISP’s bad hair day1
- Once you have a more-specific and a less-specific prefix pointing to the same next hop in your routing table, you don’t have to install the more-specific prefix in the hardware FIB2
Obviously you’re trading FIB size for convergence time. For example, you cannot use Prefix Independent Convergence. You could also get into a situation where a particular failure scenario explodes the hardware FIB size beyond its capabilities. An example might be the primary ISP losing most of the DFZ BGP table while still announcing prefix toward the IP address you use as the next hop of the default route.
Selective Route Download
DFZ BGP table has almost a million entries, but you could safely ignore most of those prefixes. After all, do you really care about clients in Fiji or Madagascar trying to reach your e-commerce server when you don’t even ship to those countries? It turns out that in most cases a few thousand prefixes cover more than 90% of your Internet traffic. Combine that with the a reliable default route and you’re done. Now for the tricky question: how do you select those prefixes?
You could rely on a good network design. For example, if you’re peering at an Internet Exchange Point (IXP) and use an upstream ISP, the you don’t care about any prefix not advertised from the IXP peers:
- Set up a default route pointing to the upstream ISP
- Filter out all other prefixes received from the upstream ISP
- Set a limit on the number of prefixes accepted from every IXP peer3
- Go have a well-deserved beer4.
Unfortunately it’s a bit hard to package that idea into a shipping product when you know your customers will try to misuse your software in every imaginable way (and a gazillion others). That’s when it’s time to fall back to traffic monitoring:
- Using the forwarding table, and whatever traffic monitoring technology you have available, identify the “hot” prefixes
- Create a prefix list and use it as a filter between routing table and hardware forwarding table.
- Periodically update the prefix list to cope with shifts in traffic patterns.
The selective route download5 functionality is available in (at least) Arista EOS, Junos, and Cisco IOS XE. If your favorite box supports it, please leave a comment.
For more details:
- Listen to the SDN Router @ Spotify chat with David Barroso, and follow the related links.
- In a follow-up episode, David described the operational experience (spoiler : it turned out in most cases they didn’t have a problem at all).
- I also covered the idea in the SDN Use Cases webinar.
I wrote a ton of blog posts dealing with similar scenarios ages ago. Search for BGP blog posts written between 2006 and 2010. ↩︎
With a few caveats left for the reader to figure out. You could cheat and use RFC 6769 as an inspiration. ↩︎
You don’t want them to dump the whole DFZ BGP table into your lap due to a fat-finger incident. ↩︎
Or another beverage of your choice. You can even make it a non-alcoholic beer. ↩︎
Selective Route Download usually works as a filter between BGP table and routing table (RIB), not between routing table and FIB. If that’s the case on your platform, you can only use it for BGP routes. ↩︎
Another mechanism used in some routers ("switches") is to use CAM table space for one prefix length, e.g., /24, in addition to TCAM. The route lookup then uses both a longest prefix match via TCAM and a CAM lookup for the designated prefix length to determine the next hop.
Of course this requires CAM in addition to TCAM in the ASIC. I would expect that the ASIC's forwarding pipeline needs support for this scheme, too (e.g., CAM lookup results usually do not include the next hop rewrite information required for IP forwarding). So this is probably not applicable to the ASIC in question.
Yeah, Arista managed to use ARP or MAC table to do that on some merchant silicon ASIC. At the very least, that requires the ability of the lookup table to match a field at unusual offset.
As for next hop rewrite, the result of that lookup could point to a dummy per-next-hop VRF with a default route pointing to the next hop.
The same trick, perhaps using the same merchant silicon ASIC, was used by Brocade (now Extreme). Others using said merchant silicon ASIC may have used this as well.
AFAIR the actual prefix length stored in CAM was configurable. As such this would probably need the ability to apply a mask while addressing CAM.
Using IP forwarding entries ("ARP table") would probably suffice to save next hop information. I did not think of that, thanks! But different to an ARP or multicast group entry, the longest prefix match via TCAM might return a better match when, e.g., /24 prefixes are added to the fixed table, but, e.g., a /28 is found via TCAM.
So a piece that may be missing in other ASICs could be the ability to use two different lookup operations and use the longest match from those for the forwarding decision.
I had some indirection as described by you in mind when writing my comment. I have no idea if usual ASICs allow to add something like this to the processing pipeline if it was not envisioned when designing the ASIC. This is probably highly implementation specific.
An alternative to SRD is to use BGP to manage the FIB. BGP allows the active routes to be quickly updated and when coupled with real-time top prefix measurements can significantly increase the effectiveness of a small hardware forwarding table, see Internet router using merchant silicon
Combination of selective router download and mpls tunnels 1. Advertise peers next-hops(/32) through bgp-lu 2. Do not change bgp nexthop from peers 3. Use bgp add-path for multi paths to the same prefix. Local IP lookup is bypassed by label operation pop and forward and is effectively delegated to remote intra-AS bgp routers. The idea is borrowed from EPE concepts.
This might very creatively solve the problem for a core or egress router (assuming the ASIC in question supports MPLS). What about the ingress router?
As I mentioned the IP lookup is effectively offloaded to the ingress routers.
Lets move that ingress labeling function to the server generating the content! Wait what, we now rebuild Fastly CDN with a small change in packet format 🙃 https://www.fastly.com/blog/building-and-scaling-fastly-network-part-1-fighting-fib
What I’ve done most frequently is to engineer as intelligently of selective route download as possible. This assumes 2 routers and iBGP between them (the 2 customer edge routers), and then at least 1 tier 1 upstream connected to each. This is a fairly common scenario for small/medium entities who need that sort of redundancy, but can’t (or won’t) buy a box large enough to handle the full table.
On each router I accept and filter out any prefix not originating from the tier 1, or it’s customers. Depending on size you may have to filter on length as well. Then using LP you prefer those prefix on each router (depending on the customer side you may have to vary this some in the event that you prefer 1 carrier over the other because of bandwidth or other various reasons).
Then on each router you have a default that you’re checking reachability to for everything not in the table.
The virtual aggregation option reminded me of the ViAggre Paper.