Feedback: Recursive BGP Next Hop Resolution
The Recursive BGP Next Hops: an RFC 4271 Quirk blog post generated tons of feedback (thanks a million to everyone writing a comment on my blog or LinkedIn).
Starting with Robert Razsuk who managed to track down the original email that triggered the (maybe dubious) text in RFC 4271:
The text in section 5.1.3 was not really targeting to prohibit load balancing. Keep in mind that it is FIB layer which constructs actual forwarding paths.
The text has been suggested by Tom Petch in discussion about BGP advertising valid paths or even paths it actually installs in the RIB/FIB. The entire section 5.1.3 is about rules when advertising paths by BGP.
As expected, Dr. Tony Przygienda added tons of behind-the-scenes details:
First, if you go over vanilla IGP (which is normal) you’ll get ECMP anyway ;-) The discussion starts to get very hairy if you resolve over next-hop that has multiple protocols underlying and similar interesting things like MPLS shortcuts (that’s the preference stuff and multiple RIB resolutions that no RFC really explains since it’s really implementation dependent and what differentiates a heavy lift routing device vs. a home router ;-).
And more such non-trivial cases exist so first/best entry only is a safe choice (but Dr. Li as co-author may have some further comment here ;-) Beyond that, all serious stacks do all kind of load balancing across e-bgp, i-bgp and all kind of mixed 2547 paths (@ least our does ;-). Just look for the sharp-edged knobs like “multipath vpn-unequal-cost equal-external-internal” ;-) Mileage may vary and when doing this one has to understand that those knobs are peeling all kind of yellow stickers off that prevent looping in default case ;-)
A quick illustration of how complex things can get was supplied by Blake Willis: guess what happens when a BGP next hop is resolved over another BGP prefix that has multiple resolved paths. Junos has a nerd knob just for that edge case.
Finally, Bela Varkonyi documented that if one reads the BGP RFC with rosy (and a bit creative) glasses, it’s perfectly fine to do load balancing over IGP next hops for a particular BGP next hop… just imagine you’re doing the recursive next hop calculation every time you need it (example: for every packet):
Load balancing is still possible depending on the implementation. If you make a single lookup for a specific next hop address for all occurrences and cache this even for later use, then of course this would disable load balancing since you would get the same answer for all occurrences. But it is not prescribed. You can do an independent recursive lookup for each next hop occurrence when it is needed. Then you can pickup a different single lookup result for each individual query from multiple possible choices. This is still load balancing that is not violating section 5.1.3.
The behavior all depends on how do you generate FIB entries from the RIB. You should not store and cache next hop lookups, but rather do the lookup every time independently when you need it. However, you would need some logic that returns a different value for the lookup on the same next hop at each query.
Obviously you wouldn’t do anything along those lines if you need high-speed forwarding, but you could precalculate all possible results and store them into a forwarding structure, which we happen to call hierarchical FIB.
And yes, we are discussing how many angels can dance on an ASIC, but one should have a bit of nerdy fun every now and then ;)
This topic is of crucial engineering importance, and advanced as it is, it's so fundamental to router architecture I don't think we should label it the network equivalent of Angel on a pin, or nerdy-fun variety Ivan ;) . IMO, anyone who calls themselves network engineer, needs a at least a passing familiar with these kinds of topic in general, and THIS ONE in particular. And since you belong in the top-tier of Networking experts, it's only natural you'd get interested in this area.
The feature described in the Juniper doc pointed to by Blake isn't a nerd knob. It's a fundamental part of Hierarchical FIB. What they describe in the doc is a simple case. It can get more complex when we involve MPLS VPN or EVPN, which in that case is called Chained Composite Next Hop in Junos, and it can have between 5 and 6 levels of nested lookups before you get to the outgoing direct interface, depending on whether you have port channel enabled or not.
IOS-XR also has similar FIB structure, starting from VRF prefix -> label -> BGP protocol next hop -> IGP next hop -> Port channel -> OIF, give or take one for two levels depending on the feature(s) one's using.
That's why the more features you turn on, the more hardware is needed, the more heat is produced, the bigger the power bill gets etc etc. In a word, complexity explosion. And that's why engineering a high-end router is very, very difficult, as mentioned by Tony P last year. The notion that throwing money at a problem is enough to solve it, and so deep-pocket Big Tech like MS, Google, Amazon, will hold great advantage if they decide to create a carrier-class router, is one big misconception. Take MS: they can't even make a perfect OS or any software product for that matter, with their endless resources, and their research on topological material has been a total failure.
As always, thanks for an insightful comment!
> It can get more complex when we involve MPLS VPN or EVPN, which in that case is called Chained Composite Next Hop in Junos, and it can have between 5 and 6 levels of nested lookups before you get to the outgoing direct interface
Well, we had something like that "forever". The only question is how many levels of indirection can be copied into the hierarchical FIB, which obviously depends on the underlying hardware (MX would be different than QFX10K or QFX5K). Links to relevant documents would be highly appreciated.
FWIW, this draft includes an interesting discussion of flattening the FIB hierarchy: https://datatracker.ietf.org/doc/html/draft-ietf-rtgwg-bgp-pic-17
Re the ietf doc, in section 2.1.1 they mentioned:
"An alternative method consists in "flattening" the dependencies when programming the BGP destinations into HW FIB resulting in potentially eliminating both the BGP Path-List and IGP Path-List consultation. Such an approach decreases the number of memory lookups per forwarding operation at the expense of HW FIB memory increase (flattening means less sharing thereby less duplication), loss of ECMP properties (flattening means less pathlist entropy) and loss of BGP-PIC properties."
Yes, that's classic engineering trade-off, similar to what I put in the 2020 comment, as increasing the number of lookup per packet will increase fowarding delay, pps performance, and increase heat. This is an example of the ugliness of low-level complexity invisible to the software guys. They can't just keep inventing kludges and be naïve that the hardware will just cope. That's why it's best to keep the features to a minimum, just like Bruce Lee's motto: the height of cultivation always runs through simplicity ;).
Sorry for the late reply Ivan, only saw your reply just now. Yes, we went thru this discussion back in late 2020, and I linked to one of the Cisco docs describing hFIB in my comment here:
https://blog.ipspace.net/2020/11/fast-failover-without-lfa-frr.html#224
Keep in mind that in a hFIB, TCAM only appears in the first level of of memory. The rest is SRAM as they all seem to involve pointers to data structures.
As for how many levels of indirection, I'd say this much (as mentioned in the Cisco doc) is about as far as one can go. Manufacturability is a big issue, because as you include more hardware, you have to increase the wriring, which increases the layers of the PCB, making it much more time-consuming to design and test, and delay time to to market.
Heat dissipation also becomes a big problem then, as anyone who has ever entered a DC and listened to rumbling and screeching of these devices can attest. More complicated hardware jammed into that limited space which is a router's enclosure, and the air flow may no longer be smooth, it can get turbulent. At that stage, laminar flow of Fluid Dynamics can no longer model the whole situation of heat flow, and turbulence will have to be used. But despite way more than a hundred years of intensive study, turbulence is not well understood because it's nonlinear and chaotic. So while it's all fun game to design a router on paper and talk about blue-sky features in a router OS, at the end of the day it means nothing if it can't be realized in hardware. And that's why engineering a high-end router is a very difficult task.