Unexpected Interactions Between OSPF and BGP

Tuesday, June 22, 2021 07:16 UTC

Unexpected Interactions Between OSPF and BGP

It started with an interesting question tweeted by @pilgrimdave81

I’ve seen on Cisco NX-OS that it’s preferring a (ospf->bgp) locally redistributed route over a learned EBGP route, until/unless you clear the route, then it correctly prefers the learned BGP one. Seems to be just ooo but don’t remember this being an issue?

Ignoring the “why would you get the same route over OSPF and EBGP, and why would you redistribute an alternate copy of a route you’re getting over EBGP into BGP” aspect, Peter Palúch wrote a detailed explanation of what’s going on and allowed me to copy into a blog post to make it more permanent:

I suspect that this may be a race condition. To BGP, a locally originated (includes locally redistributed) route is preferred to an eBGP-learned route. This is because locally originated BGP routes have the weight of 32768 while all others use weight 0 by default.

With this in mind, there are two possible scenarios:

RIB already contains the OSPF-learned route and BGP redistributes it before it learns the eBGP route. Then, even with the eBGP-learned route, BGP bestpath picks the locally originated one, so the OSPF route stays.
RIB does not contain the OSPF-learned route before BGP learns the eBGP route. Then the BGP bestpath picks the eBGP route (since there’s no alternative) and installs it into RIB with AD=20. Even if OSPF tries to offer its route to RIB later, it will fail due to AD=110. (3/4)

So based on the initial state (the OSPF-learned route being or not being in the RIB), we have two possible outcomes under otherwise identical configuration and conditions - hence, a race condition. Note that this would also happen with other IGPs redistributed into BGP.

Of course Peter was right, but pilgrimdave81 kept wondering about the underlying reasons for that behavior:

Yes, this is exactly what happens, and changing timing changes it, however I’d have thought it would work out that the “locally originated” bgp route actually came from an inferior source protocol and corrected itself?

Here’s Peter with another detailed explanation:

Assuming there is an “inferior source protocol” is the mistake here:

OSPF has its route in RIB
BGP redistributes it
BGP decides the locally originated route is the best
hence, no eBGP route can beat it based on BGP bestpath rules

BGP does not think speculatively: “If the route to redistribute would not be there, the eBGP route would win, and then it would win even over the OSPF route to redistribute since the AD is lower”. This kind of what-if decision making does not exist in routing protocols.

There are three moving parts here: BGP, OSPF, and RIB. Each one of them acts based on the momentary state. OSPF is really unperturbed – its SPF always produces the same result, regardless of what’s going on in BGP and RIB. At worst, its routes won’t be picked as best by RIB.

BGP, on the other hand, is the destination of the redistribution, and so its bestpath results depend on what routes are there to choose from. By default, in presence of a locally injected route, any other path will lose. The type of the chosen path also determines its AD.

And here’s the trick: If the timing allows to install an eBGP route in the RIB by BGP, then OSPF has no chance of beating it, so the BGP won’t ever see that there is another option for that route to consider. Essentially, from this point of view, BGP locks itself out.

The concept of an “inferior source protocol” is based on the AD, but neither OSPF nor BGP are in business of comparing ADs, ever. In the case of redistribution taking place, it’s actually BGP deciding that the OSPF route redistributed into BGP is better, due to weight.

AD of the routes offered to RIB by BGP is the result of the best path selection, not an input to it. If BGP decides that an iBGP variant is the best, then it’ll offer it to RIB with AD=200; if eBGP variant is the best, then with AD=20. AD comes as a consequence.

And as a universal rule, a protocol that redistributed into itself a route from another protocol won’t attempt to install that very same route back into RIB - because it could change its origin protocol, so the redistribution would no longer apply to it. It’d flap wildly.

Lesson learned: having too many moving parts results in interesting and hard-to-grasp behavior… more so when the details are implementation-specific as Jeff Tantsura detailed in a follow-up tweet:

Junos doesn’t differentiate (AD wise) between eBGP and iBGP routes, both have AD of 170 (OSPF 10/150), so IGP always wins (predictably). EOS, while supporting weight, doesn’t set it to higher value for redistributed routes.

In other words: a perfect scenario to troubleshoot when your network is down on a Sunday night. Don’t try to fix it by using BGP as the universal answer to life, the Universe and everything. Having a sane and simple design is a much better alternative.

Recent posts in the same categories

IP routing

OSPF

BGP

5 comments:

Donald Sharp 22 June 2021 01:18

I was actually looking at just this situation with static routes and BGP a few weeks back. In my opinion this is a bug plain and simple. Any situation where an order of operations causes different results is something that needs to be fixed. There is no need for speculation. BGP already knows the admin distance of the route that won and can easily do a comparison to see if it would win and if so send it's route down to the RIB. This would solve the problem.

Anonymous 22 June 2021 08:02

The question is: Does the affected subnet belong to the company or not? It's either one or the other (or you would have a problem). Then based on whether it's internal or external to the company you would choose the right routing protocol. The numbers of administrative distance for the various routing protocols are there for a reason. They are based on reliability and trustworthiness of the routing protocol. I see no need for complicated redistribution on multiple points. So definitely rethink your design.

AW 23 June 2021 10:55

There may be an aspect I'm missing here but this looks like the same issue we would run into frequently at an MSP I worked at using "floating" backup static routes for customers. The AD of the static route may be set higher (ie: 250) but due to the behavior of BGP redistributed routes you'll have an issue. If you add the static route after your BGP route was already loaded it's especially insidious because all looks fine until BGP bounces, then the redistributed static route gets loaded into BGP with default weight of 32768 and now when BGP comes back up the BGP learned route loses over the redistributed route and the static stays the active route.

The solution we used was to just use route-map / policy to strip the weight of those redistributed floating static routes setting them to 0. That way when BGP compared the redistributed route you don't run into that issue. I believe we also set LP lower on the redistributed route as well to make that predictive since solving the weight issue and having equal LPs simply would move you to the locally originated and then AS-PATH length checks where both would fail to prefer the eBGP over the OSPF redistributed route.

I thought at the time this was a well known problem but to be fair I haven't run into much discussion on it on the internet so glad to see it talked about here! Looking forward to seeing more comments!

Replies

James 20 July 2022 10:07

Yes, we use this same approach with EIGRP and BGP on nx-os, unsetting the weight and updating LP. I also didn't find much discussion online at the time, but it's been operating successfully for about 5 years now and put through it's paces.

Dmytro Shypovalov 24 June 2021 07:32

EOS in single-agent (gated) mode doesn't check whether the route is local (redistributed/aggregate) or received from a BGP peer. But multi-agent prefers local over received, much like Cisco.

30 June 2021 03:34

I used to see this a lot with MPLS WAN, where the BGP at a site redistributed OSPF or EIGRP site routes promiscuously, and there was no filter so that remote prefixes would also be learned. When a remote link goes down, one of two site BGP peers loses the prefix and learns the local copy, which stays stuck then, even when the remote link comes back up. My conclusion (consistent with this blog) is to filter ALWAYS ALWAYS and don't redistribute into BGP remote routes from a local routing protocol. And use IBGP for lateral handoffs.

Add comment