Packet Forwarding 101: Header Lookups
Whenever someone asks me about LISP, I answer, “it’s a nice idea, but cache-based forwarding never worked well.” Oldtimers familiar with the spectacular failures of fast switching and various incarnations of flow switching usually need no further explanation. Unfortunately, that lore is quickly dying out, so let’s start with the fundamentals: how does packet forwarding work?
Packet forwarding used by bridges and routers (or Layer-2/3 switches if you believe in marketing terminology) is just a particular case of statistical multiplexing – a mechanism where many communication streams share the network resources by slicing the data into packets that are sent across the network. The packets are usually forwarded independently; every one of them must contain enough information to be propagated by each intermediate device it encounters on its way across the network.
Packet forwarding in an intermediate device is always composed of these steps:
- Receive a packet;
- Extract forwarding information from the received packet;
- Find the next hop or outgoing interface from a lookup table;
- Enqueue the forwarded packet into an output queue where it waits to be sent to the next device.
The lookup step is often the most complex part of packet processing:
- Exact matches of a small set of values (VLANs, MPLS labels) are straightforward: allocate a large enough table to accommodate all possible values and populate it with forwarding information.
- Exact matches of a large set of values (MAC addresses, IP addresses, IPv4 /32 prefixes…) usually use hash tables. A data-munging function produces a unique small number out of the (too large) value and uses that as an entry into the forwarding table. Sounds easy until you realize that the function could produce the same result for multiple input values. Welcome to the wonderful world of hash collisions and hacks like Cuckoo hashing.
- Lookups of network prefixes (using longest-prefix matching) traditionally used a tree-like data structure (example: trie) that requires several accesses to get the answer, significantly reducing the packet forwarding performance. Alternate solutions include hash tables for prefixes with known length (example: /24 IPv4 prefixes) or Bloom filters (see Longest-Prefix Matching Using Bloom Filters for details).
- More complex lookups use TCAM – an extraordinarily versatile but also power-hungry and expensive bit of hardware – or dirty hacks like “let’s walk down an access control list and compare every line with the header of the packet we’re trying to forward.”
Compared to the awkwardly slow way we were taught access control lists work, TCAM works almost like magic:
- The network operating system stores matching values and corresponding bit masks into TCAM
- When presented with an input value, the best match (considering the match values, bitmasks, and matching priorities) pops up in a single lookup.
While we usually associate TCAM with hardware-based packet forwarding, I’ve seen what seems to be an efficient software implementation of TCAM in Open vSwitch. It was elegant but so complex that I don’t want to look at it again ;)
Ready to deal with even more complexity? How about recursive lookups, from BGP prefixes to BGP next-hops (which can be BGP prefixes themselves, adding another layer of indirection) to IGP next-hops or static routes. Modern hardware can accept most of that complexity through a hierarchical forwarding table1. In the good old days, complex hardware was too expensive, so the routers had to do the recursive walk through the routing table for every packet until someone got a bright idea: what if we would cache the results of the recursive lookup?
Coming up next: cache-based forwarding
- You’ll find an extensive discussion of various packet forwarding mechanisms in the Switching, Routing and Bridging part of How Networks Really Work webinar.
- Hierarchical FIB and its use in prefix-independent convergence is covered in the Fast Failover part of the same webinar.
As expected it’s hard to get the actual details. Minh Ha pointed me to ASR9K documentation; if you’re aware of a similar document describing other hardware platforms please write a comment. ↩︎
LISP is a difficult animal. It has multiple variants. Nowadays, lot of people use it with PubSub and reliable transport. This is very much different from the original LISP. It is not a pull model anymore.
Since with PubSub new routes are pushed as soon as possible, there is no real caching problem.
However, selective subscription is not implemented yet by Cisco, so it is a half baked solution. All of your xTRs will have a full routing table, that might be OK for a LAN, but not for a bigger WAN. This global PubSub is creating a potential scalability problem. Let us hope, that selective subscription will be implemented sometimes in the future... Then it might become useful in bigger WANs.
LISP has a big advantage compared to BGP. No best path selection, so it is much faster. It is the ideal protocol for network mobility, especially for multi-link, multi-homed scenarios. However, this use case requires a lot of further development yet...
It seems to me that the PubSub LISP you described is functionally equivalent to a well-implemented BGP route reflector with RT-based ORF (or whatever that thing is called).
The only icing missing from that cake seems to be a fast BGP route selection mechanism -- after all, if a node only has sessions with IBGP RRs, and if the network designer claims they should be sending identical information, why go through the route selection at all? Just take whatever comes in. What am I missing?
Thanks a million! Ivan
I always enjoy Bela's great insights, esp. on hardware and transport networks, but this time I beg to differ. LISP, is a false economy. It was twisted from the start, unscalable right from the getgo. In Networking and OS, to name (ID) something is to locate it, and vice versa. So the name LISP itself reflects a false distinction. Due to this misconception, LISP proponents are unable to establish the right boundary conditions, leading to the size of xTRs' RIB diverging (going unbounded). In a word, it has come full circle back to BGP, an exemplary manifestation of RFC 1925 rule 6.
As always, misunderstanding the fundamentals leads to exploding complexity and dead ends, so LISP has problems with path liveness check and state synchronization as well:
All 3 problems severely limit scalability, so LISP essentially fails in its original goal. One can argue that LISP works fine for small networks, but small networks need no design. There, a brute-force sequential search (flat) method for routing is good enough, in a word, anything goes. It's the big networks that need hierarchy, and LISP can't enforce this hierarchy essential for scaling, because it can't impose the right boundary conditions.
LISP is in a pretty similar situation to TCP congestion control, wherein people, due to a lack of understanding, naively think it can be solved by "careful tuning" of parameters. It cannot, because end-to-end CC is a dead end. So as long as you keep clinging to it, all you do is putting lipstick on a pig.
Just as the key with CC is understanding that end-to-end CC attempts to solve problem on the wrong layer, and so all end-to-end transport protocols, try as they may, are fundamentally incapable of resolving it, and what has to be done, is a renormalization of the CC's length scale and layering, the key to understanding why LISP is unworkable on large scale, is realizing that people have been asking the wrong question. Routing explosion is an addressing problem; it has to be solved based on an understanding of how addressing should be structured.
As it stands, IP is missing more than half the structure, with IP and MAC redundantly naming the same thing (the interface). This, coupled with provider-based addressing, plus one global address space for the entire Internet, will always lead to unbounded RIB sizes, and routing update explosion. Topological addressing can deal with that, and when we think in terms of topology, new structures start to emerge, including the understanding that provider-independent address assignment, is the right way to go. Topological PI addressing will help set up the proper boundary conditions, and RIB size can go down by several times and potentially be bounded too.
Tony P's comments on Valley-free routing, is essentially a description of topology-based (resilient) addressing, where distortion on the network graph has no impact on underlying addressing structure -- it's topologically protected:
Since this scaling metric is universal enough, the same solution applies equally to DC networks, which essentially are just one type of SP network. It can lead to better simplification and less pain.