Can BGP Route Reflectors Really Generate Forwarding Loops?

TL&DR Summary: Yes (if you’re clumsy enough).

A while ago I read Impact of Graceful IGP Operations on BGP – an article that described how changes in IGP topology result in temporary (or sometimes even permanent) forwarding loops in networks using BGP route reflectors.

Is the problem real? Yes, it is. Could you generate a BGP RR topology that results in a permanent forwarding loop? Yes. It’s not that hard.

I Think I Broke It

Here’s a particularly broken topology (connections with arrows are BGP sessions, R1 and R2 are route reflectors, all other routers are their clients).

Assuming E1 and E2 advertise an identical external prefix (IP-External) over their IBGP sessions, this topology results in a permanent forwarding loop. Can you figure out why that’s the case?

Although the above topology looks utterly silly, you might get dangerously close to it in a well-designed network after one or more IBGP session failures.

Route Reflector Is Just a Router

A BGP route reflector is nothing more than a regular BGP router with a few route propagation rules bolted onto the update generation code. It still follows all the rules of BGP route processing:

  • Receive routes from BGP peers;
  • Install all (accepted) routes into BGP RIB;
  • For every IP prefix, select the best route from all RIB routes describing that prefix;
  • Advertise the best routes to BGP peers (and here’s where a route reflector deviates from standard BGP behavior – it advertises best routes to its clients regardless of how it got them).

BGP Route Selection Rules Reexamined

Route reflectors in our diagram (R1 and R2) receive BGP routes for the IP-External prefix from E1 and E2. Assuming the updates are identical, R1 and R2 select their best BGP route based on lowest IGP cost – R1 will select the route advertised by E1; R2 will select the route advertised by E2.

BGP Route Reflectors and Forwarding Topology

Next step in our forwarding loop creation: R1 advertises the best route to IP-External with E1 as the BGP next hop to all its clients (E1, E2 and C2). E1 and E2 ignore the update (they both have a better route), C2 uses it.

Similarly, R2 advertises its best route to IP-External with E2 as BGP next hop to E1, E2 and C1. C1 uses the advertised route.

Based on its BGP table (copied into IP routing table and FIB) C1 sends the traffic for IP-External toward BGP next hop E2. Recursive next hop resolution results in C2 being the true next hop. Likewise, C2 sends the traffic for IP-External to C1. Bingo! A permanent forwarding loop.

Can you prevent the forwarding loop? You can avoid most of them with a good design – make sure your BGP sessions don’t deviate too much from the forwarding topology. Ideally, all BGP route reflectors in a cluster would have equal IGP cost to all BGP next hops.

Alternatively, if you’re OK with a slightly slower convergence, you could establish IBGP sessions across physical links (so they’re guaranteed to follow the forwarding topology) and turn every single router into a BGP route reflector ... effectively turning BGP into RIP.

MPLS is the answer. What was the question?

OK, good design clearly helps. But can you prevent the forwarding loops? Sure – MPLS is always the right answer (these days it might be LISP, but I’m digressing).

If you deploy LDP-based MPLS forwarding in your BGP network, the packets toward the BGP next hops always carry an LDP-generated label for the BGP next hop (not the external prefix) and thus get across your autonomous system without a single additional L3 lookup. Even if the forwarding tables derived from BGP information don’t make much sense, the packets still get to some egress router (although they might take a suboptimal path).

You don’t need to deploy a BGP-free core to make MPLS work. MPLS-based forwarding gets rid of forwarding loops even in traditional networks with BGP running on every router.

9 comments:

  1. Try to follow the physical topology, always peer clients with at least two RRs. If you can afford it (RIB wise) use unique cluster-ID. If using MPLS VPN use unique RD per VRF per PE.

    Would you agree with the above statements?
    Replies
    1. Sounds like a good plan ;)
  2. Just ensure routers peer with both RRs. They either have both or none.
    It's difficult to emulate something similar in production...

    Nikolay
  3. I solve this by setting SOO community on the E1/E2 routers. SOO is copied to OSPF using a route tag (if OSPF is used as PE-CE protocol).

    http://www.cisco.com/en/US/docs/ios/12_4t/12_4t11/htbgpsoo.html
  4. I suppose AddPath (http://tools.ietf.org/html/draft-walton-bgp-add-paths-06) is elegant approach for resolving this issue.
  5. Couldn't E1/2 use next-hop-self in this scenario to avoid even introducing the external prefix into the mix?
    Replies
    1. Doesn't solve the problem - root cause is route selection process in route reflectors, not BGP next hops.
  6. Idea with best external , add-path with n or all , shadow RRs or session all adds extra state and require extra memory. If logical bgp topology cannot follow phsycail ( it should ) , then I believe best two approach is , first R1 could take the C1's IGP cost to external peering router into consideration and calculate best path from C1's point of view or there is one another draft which is angular distance aproach which puts the al the nodes basically in a circle and calculate the angular distance from RR to other nodes. First approach is like secondary SPF calculation of remofe LFA or Not-Via tunnels so it adds maybe extra initial processing load but end result is less memory need, agree Ivan ?

    Thanks
  7. Did you actually test this? Based on BGP route selection criteria, the E routers should pick the eBGP route over the reflected iBGP route. So, from what I can see, there shouldn't be any loops in your set-up.
    Replies
    1. Reread the blog post. The forwarding loop is between C1 and C2.
  8. is there no BGP peering between C1 and E1 ? why won`t C1 forward the traffic to E1 directly ?
  9. Good post!

    But I would be really surprised if I saw anything like this in real life. RR is there to resolve the overhead caused by IBGP full-mesh. And the reason for IBGP full-mesh is to ensure all IBGP speakers “see” the same. Back to this topology, E1,E2 are the common clients for R1 and R2, however, C1 and C2 belong to only one RR. If we remove the RR from the picture, we would see a topology that IBGP full mesh is not there. So to resolve this problem, just have C1 join the Cluster with R1 and C2 join the cluster with R2.

    Replies
    1. I hope you wouldn't see anything like this in real life -- as I wrote, you have to be clumsy enough ;)

      However, were you unlucky enough to have just the right combination of BGP sessions failing... Nah, that would never happen, we all monitor the state of our BGP sessions and react the moment they flap, don't we 😜

Add comment
Sidebar