Two weeks ago I started with a seemingly simple question:
If a BGP speaker R is advertising a prefix A with next hop N, how does the network know that N is actually alive and can be used to reach A?
… and answered it for the case of directly-connected BGP neighbors (TL&DR: Hope for the best).
Jeff Tantsura provided an EVPN perspective, starting with “the common non-arguable logic is reachability != functionality".
Now let’s see what happens when we add route reflectors to the mix. Here’s a simple scenario:
Assuming we’re running IBGP sessions between loopback interfaces, and all AS edge routers use next-hop-self so all the BGP next hops within the AS are the loopback interfaces, can we can trust that the advertised next hops are safe for packet forwarding? TL&DR: Heck NO.
In Trust We Trust
In our scenario R1 advertises subnet A with next hop L1, the update is sent to R2 and reflected to R3. Here’s the chain of implicit trust that leads to R3 selecting L1 as the next hop for A:
- R2 has to trust that R1 did the right thing, can reach A, and can forward packets to A (we discussed this part in the last blog post).
- R3 has to trust that R2 (the route reflector) did its job correctly;
- R3 has to have L1 in its routing table;
- R3 has to trust that the intra-AS routing protocol did its job and calculated the correct next hop to reach L1;
- R3 has to trust that all the intermediate nodes on the IGP-computed path between itself and R1 know how to forward the traffic toward A (not just toward L1).
In the BGP over MPLS core design the situation is slightly different. After verifying that L1 is in R3’s routing table, R3 has to:
- Hope that it has an LSP toward L1 in its LFIB, or that everyone in the forwarding path knows how to reach A.
- Having an LSP toward L1, trust that the LSP is not broken, and that everyone on the LSP does proper label swapping.
What could possibly go wrong?
In the last blog post I mentioned two things that can go wrong in a directly-connected BGP scenario:
- Access control lists that would drop traffic toward A;
- RIB-to-FIB mismatch (lovingly called ASIC wedgie). Obviously that’s not just a myth, managing that mismatch was presented as one of the first use cases for Cisco’s Network Assurance Engine.
Let’s add a few other minor details to the mix:
- Best path selection on BGP route reflectors could generate a persistent loop.
- Ever heard of BGP Wedgies? There’s a whole RFC on the topic.
- Then there’s BGP-to-IGP synchronization and IGP-to-LDP synchronization.
- Finally, you could get corrupted LFIB anywhere on the path.
For even more BGP fun, read Considerations in Validating the Path in BGP (RFC 5123).
After considering all that, do you really care whether R1 advertises a prefix with the next hop equal to the source IP address of its IBGP session, or with a third-party next hop that it believes works?
Back to EVPN
I’m guessing that the original question that triggered this series of blog posts had a hidden assumption (and I apologize in advance if I got it wrong):
In the EBGP-only data centers, it’s better to run IBGP between loopback interfaces of leaf- and spine switches than to advertise loopback VTEPs over EBGP sessions on leaf-to-spine links, because we can rely on direct next hop (loopback VTEP advertised over IGBP between loopbacks) more than on third-party next hop (loopback VTEP advertised as third-party next hop over EBGP session).
Considering everything I wrote above, reachability != functionality, and the myriad things that can go wrong, I would consider this a minor detail, and the least of your worries. Also, remember the conclusion of my previous blog post on this topic: “You might as well stop bothering and get a life, networks usually work reasonably well.”
More to Explore
I’ll slowly get to the routing protocols in the How Networks Really Work webinar (parts of it are available with free ipSpace.net subscription), and we have tons of content on leaf-and-spine fabric designs (including routing protocol selection) and EVPN.
You might also want to explore other BGP resources we’ve created in the last decade and a half.