Recursive BGP Next Hops: an RFC 4271 Quirk
All BGP implementations I’ve seen so far use recursive next hop lookup:
- The next hop in the IP routing table is the BGP next hop advertised in the incoming update
- That next hop is resolved into the actual next hop using one or more recursive lookups into the IP routing table.
Furthermore, all BGP implementations I’ve seen used multiple recursive next hops (if available) to implement load balancing toward the BGP next hop – that’s how we made EBGP load balancing work in Stone Age of networking.
Life was good… until Dmitry Perets sent me an email with a disturbing question: unless both of us can’t read standards anymore, all those implementations violate RFC 4271.
Here’s the weird part of Section 5.1.3 of RFC 4271 (highlight mine):
The immediate next-hop address is determined by performing a recursive route lookup operation for the IP address in the NEXT_HOP attribute, using the contents of the Routing Table, selecting one entry if multiple entries of equal cost exist.
Interestingly, nothing was said about recursive lookup in RFC 1771 or early drafts of RFC 4271. To make the whole thing even more mysterious, interim draft versions of RFC 4271 contained this text:
The NEXT_HOP attribute is used by the BGP speaker to determine the actual outbound interface and immediate next-hop address that should be used to forward transit packets to the associated destinations. The immediate next-hop address is determined by performing a recursive route lookup operation for the IP address in the NEXT_HOP attribute using the contents of the Routing Table.
The final text of RFC 4271 first appears in draft-ietf-idr-bgp4-18 from October 2002.
Now a plea to my grumpy old readers: if anyone remembers why that change was made, please add a comment.
The text in section 5.1.3 was not really targeting to prohibit load balancing. Keep in mind that it is FIB layer which constructs actual forwarding paths.
The text has been suggested by Tom Petch in discussion about BGP advertising valid paths or even paths it actually installs in the RIB/FIB. The entire section 5.1.3 is about rules when advertising paths by BGP.
Please see the archive email I found to prove the above: https://mailarchive.ietf.org/arch/msg/idr/OHlGLdQOF5lSa_NR7oOaDjse8y8/
In my opinion section 5.1.3 has nothing to do with load balancing. It is just expressing the natural fact that one address can be resolved only to one address and not to a list of addresses.
If load balancing is still possible depends on the implementation. If you make a single lookup for a specific next hop address for all occurencies and cache this even for later use, then of course this would disable load balancing since you would get the same answer for all occurences. But it is not prescribed. You can do an independent recursive lookup for each next hop occurence when it is needed. Then you can pickup a different single lookup result for each individual query from multiple possible choices. This is still load balancing that is not violating section 5.1.3.
The behavior all depends on how do you generate FIB entries from the RIB. You should not store and cache next hop lookups, but rather do the lookup every time independently when you need it. However, you would need some logic that returns a different value for the lookup on the same next hop at each query.
Older implementations might prefer saving CPU cycles by caching the lookup results, but new implementations do not need to do that.