BGP Route Reflectors in the Forwarding Path

Bela Varkonyi left two intriguing comments on my Leave BGP Next Hops Unchanged on Reflected Routes blog post. Let’s start with:

The original RR design has a lot of limitations. For usual enterprise networks I always suggested to follow the topology with RRs (every interim node is an RR), since this would become the most robust configuration where a link failure would have the less impact.

He’s talking about the extreme case of hierarchical route reflectors, a concept I first encountered when designing a large service provider network. Here’s a simplified conceptual diagram (lines between boxes are physical links as well as IBGP sessions between loopback interfaces):

┌──────┐         ┌──────┐     ┌──────┐
│      ├─────────┤      ├─────┤      │
│  C1  │         │  P1  │     │  A1  │
│      ├──┐ ┌────┤      ├─┐ ┌─┤      │
└──────┘  │ │    └──────┘ │ │ └──────┘
          │ │             │ │
          └─┼─────┐       └─┼────┐
            │     │         │    │
┌──────┐    │    ┌┴─────┐   │ ┌──┴───┐
│      ├────┘    │      ├───┘ │      │                          
│  C2  │         │  P2  │     │  A2  │
│      ├─────────┤      ├─────┤      │
└──────┘         └──────┘     └──────┘
  • A1 and A2 are access routers
  • P1 and P2 are POP aggregation routers. They are route reflectors, and A1 and A2 are their clients.
  • P1 and P2 have IBGP sessions with C1 and C2, but have no idea that they are not participating in the IBGP full mesh
  • C1 and C2 are route reflectors. P1 and P2 are their clients.

Taken to the extreme, every router in the network is a route reflector, and all its IBGP neighbors are its route reflector clients.

Does that work? When I asked a friend in Cisco TAC whether such a setup would work in 1990s the best answer I could get was “looking at the code I couldn’t see why it wouldn’t work.” He was right, hierarchical route reflectors work surprisingly well, we used them numerous times (but without any next-hop magic), and they are pretty common in large networks.

Obviously this design only makes sense when the intermediate routers have to know the prefixes advertised by the edge routers. Using it with MPLS/VPN or EVPN is a waste of CPU cycles and memory1.

Interestingly, if your IBGP sessions follow the physical topology, then it’s OK to change the BGP next hop on reflected routes2. Back to Bela:

If you implement a native RR topology following the physical topology, then changing the next hop might be meaningful, since the RR is always in the user data plane path and shares the fate of user data packets. BTW, this is the original IP networking concept. The routing protocol shall follow the same path as the user data plane.

While I agree with him that routing should rely on shared fate, we usually use IGP to meet that requirement. In ye olden days we redistributed BGP into IGP, and believed a BGP prefix is reachable if IGP brought it to the other edge of the autonomous system. In the meantime, the global BGP table exploded, and the best we can do is to configure the routers to accept only loopback prefixes as the BGP next hops (as opposed to default route or aggregated prefixes). In the MPLS world, one could go a step further and consider a BGP next hop valid only when the ingress router has a valid LSP to the next hop.

An irrelevant aside: if you run IBGP sessions over the physical links, and change BGP next hops on every IBGP session, you managed to implement “BGP as a better IGP” design paradigm with IBGP. Congratulations, you saved an incredible amount of memory shortening the AS-path, and created a design that will keep you employed (and on pager duty) for decades.

Finally, Bela compared centralized BGP route reflectors with SDN controllers:

Ivan loves criticizing the centralized SDN controller design. The centralized RR design has almost the same issues… :-) I know people still prefer doing that, but this was not the original intention of RR. It is a misuse of the original RR concept.

For the record: I did not criticize the idea of centralized SDN controllers, but the stupid idea of centralized control plane. It’s well known3 that some problems (example: optimal bandwidth reservations) cannot be solved with a distributed system of autonomous agents.

As for the “original intention of route reflectors” (as I happened to be around at that time): route reflectors were designed to be a more scalable replacement for IBGP full mesh, not a mechanism to implement shared fate in IBGP. See the Introduction section of RFC 1966 for more details. How we use them today and what we think is the right way of using them is a different story ;)


  1. Running BGP route reflectors for EVPN address family on spine switches is still OK – you have to run them somewhere. ↩︎

  2. That would go against the SHOULD NOT in section 10 of RFC 4456, but that never stopped vendors trying to win a deal or network engineers throwing kludges at a badly broken design. ↩︎

  3. A wonderful handwaving phrase used when you’re too lazy to find supporting evidence. ↩︎

1 comments:

  1. If you look into Figure 3 in RFC 1966 then you can still see that they follow the physical topology. RRs were selected by their central positions in the network and not in a separate plane independent from the data plane and centralized in an offset position. At that time it would be anyhow too expensive.

    Please, also remember that at that time the networks were much smaller. So following the physical topology was straightforward and you had no automation because of the small scale. The centralized RR has one strong motivation on making writing automatic scripts easier.

    And as you identified already, seamless MPLS is a kind of hybrid of the two approaches. However, in the old thinking, all kind of BGP sessions were assumed to share the fate with data plane. There was no separation from the BGP and the physical topology. In the opposite, the RR was invented to keep the physical topology and the BGP topology aligned. Since the full mesh requirement of iBGP violated this basic principle. The RR was restoring the compliance to the routing principle. It was more than just a scalability solution. Do not forget that rebuilding a BGP session and sending all advertisements again was a costly and slow operation. A centralized RR with dozens of sessions in those days would need a very long recovery time with random sequencing. RR aligned with the physical topology reduced the CPU resources needs and convergence time to the minimum. Later, with the more CPU resources, this consideration was less important...

    Of course, changing the next hop is more an academical possibility and in most networks an unnecessary complication. I just wanted to point out that there is no theory against it. However, an engineering design should take into account a lot of other considerations, even business issues.

Add comment
Sidebar