Leave BGP Next Hops Unchanged on Reflected Routes
Here’s the last question I’ll answer from that long list Daniel Dib posted weeks ago (answer to Q1, answer to Q2).
I am trying to understand what made the BGP designers decide that RR should not change the BGP Next Hop for IBGP-learned routes.
As always, there are (at least) two answers to that question:
- BGP route reflectors are solving the “propagation of information” problem, not the “finding the optimal path” problem. Think of them as the flooding part of OSPF, not the SPF part.
- You might not want to bring all the traffic to the route reflector, inspect it there1, and send it to the egress router.
That last bit was especially true in the days of yore when the routers forwarded packets in software2. Core routers needed all the performance they could squeeze out of their software, which meant more expensive CPUs and really expensive high-speed memory. Nobody thought about wasting all that hardware on calculating and disseminating BGP routes – route reflectors were often placed outside of the forwarding path. Changing BGP next hop on reflected routes to put an underpowered device into the forwarding path was a perfect recipe for disaster.
The days have changed, high-speed packet forwarding is done in hardware3, and it doesn’t matter which device in the network is used as a route reflector. Most data center switching vendors that publish designs combining IBGP with IGP deploy BGP route reflectors on spines (because the CPU on the spines has nothing to do anyway). In that case, it might be OK to change the BGP next hops on spines – every IP packet has to cross a spine switch anyway, so who cares. However, the kids who want to look really cool build their fabrics with EBGP, and don’t care about route reflectors anyway.
Even in leaf-and-spine fabrics you don’t want to mess with the BGP next hops if you’re doing anything else but pure IP transport. MPLS is so last millennium, so let’s take the new kid on the block: SRv6. Assume you’re building Ethernet-over-SRv6 service with EVPN control plane4. Do you really want all tenant packets to land on the spine switches for further inspection and forwarding? You do? Then why exactly did you waste millions on SRv6-capable gear in the first place? You could have bought the cheapest white box switches and run bridging on them.
Also, are you sure you want to have all VLANs and VRFs defined on the spine switches? If you don’t have them configured, the spine switch cannot figure out what to do with the packets rushing to it.5. Finally, do you want to pollute the spine forwarding tables with MAC+IP addresses and prefixes from every tenant?
The same reasoning applies to more mundane technologies like VXLAN.
Long story short: Don’t you dare to change the BGP next hop in the middle of the fabric.
Totally off-topic: The above requirement trips all the cool kids who were so proud of their EBGP-only fabrics, because EBGP loves changing BGP next hops. There’s a single vendor I’m aware of who realized the SNAFU and implemented EVPN in a way that violates the usual EBGP expectations. Everyone else has to deal with awkward configurations or crazy stuff like IBGP sessions between loopbacks advertised with EBGP… and every now and then a senior manager working for a large vendor gets extremely upset when I call that concoction stupid.
Anyway, back to BGP route reflectors. By now you should be able to explain why it’s a bad idea to change the BGP next hop on reflected routes, but there’s a single scenario I’m aware of where doing that is a must: hub-and-spoke DMVPN tunnels. I’m guessing a similar reasoning applies if you bought hub-and-spoke (E-tree) Carrier Ethernet service. Anything else? Please leave a comment!
What else could you do once you pulled it there? The poor route reflector has to figure out where to send it next, and to do that in needs to do a full lookup. ↩︎
… and we weren’t arrogant enough to call that Software-Defined Networking. ↩︎
… and Software-Defined Networking means whatever a funding-desperate startup is doing. ↩︎
What else could give you such a cushy job security? ↩︎
Such a setup would be a doozy to troubleshoot ↩︎
> There’s a single vendor I’m aware of who realized the SNAFU and implemented EVPN in a way that violates the usual EBGP expectations.
I know of two, with a hint regarding a third (with the third based on work from the first). I expect the first to be the one you have in mind. ;-)
> Anything else? Please leave a comment!
I have seen an MPLS design using RFC 3107 BGP-LU to build transport LSPs across IGP regions. They used IBGP with RRs. The region border routers were RRs that changed the next hop to themselves and generating a BGP-LU label.
P.S. Sorry for the acronym soup.
Hi Ivan, great post as always, thanks for sharing
>> [...] but there’s a single scenario I’m aware of where doing that is a must [...]
Reiterating Erik's comment above, I believe Seamless MPLS is a reference architecture for Service Providers that uses in-band RR with next-hop adjustments, towards creating "islands" of MPLS.
Discrete instances of IGP and MPLS share Border Nodes [BN] (like an ABR in LS protocols), with PEs peering to BN then BN peering to RR in the central island. BGP-LU runs inter-island. You end up with 3 layers of MPLS: Transport, Service, and VPN. I think this is what Erik is describing.
All major vendors have good articles on the reference design, including great diagrams by Juniper and Huawei. (fwiw, Cisco have historically called this "Unified" MPLS, but its the same thing).
You might suspect that controllers like Crosswork/WAE and Paragon/Northstar suit this design well when you want to conjoin RSVP paths.
> BGP route reflectors are solving the “propagation of information” problem, > not the “finding the optimal path” problem
I guess they tried to do both, since BGP RR have been designed to send only the best path (from their point of view) to the network. And from a design point of view, this does not work well with moving the RR outside the data path.
> there’s a single scenario I’m aware of where [changing the BGP next hop] is a must: hub-and-spoke DMVPN tunnels
I would prefer to NOT configure the Hubs as RR (expect may be for the inter Hub sessions) : you only need to collect remote routes and you do not want to send every route to every spoke when a single aggregate can do the the job. To enforce that, i usually use a new AS for each Hub and use the same 65000 as for the spokes.
> I guess they tried to do both, since BGP RR have been designed to send only the best path (from their point of view) to the network. And from a design point of view, this does not work well with moving the RR outside the data path.
Sending only the best path was the best they could do -- it was a side effect of the format of BGP update messages. A RR could send multiple paths but all but the last one would be ignored without the AddPaths support (read AddPaths RFC for more details). Changing the BGP update message format would open a whole other barrel of worms.
Do remember that RFC 1966 was published in 1996, years before IPv6-over-BGP4 or BGP4 capabilities RFCs came along.
> I would prefer to NOT configure the Hubs as RR (expect may be for the inter Hub sessions) : you only need to collect remote routes and you do not want to send every route to every spoke when a single aggregate can do the the job
That's assuming you can aggregate the address space into a few well-defined prefixes.
The original RR design has a lot of limitations. For usual enterprise networks I always suggested to follow the topology with RRs (every interim node is an RR), since this would become the most robust configuration where a link failure would have the less impact. Of course, this would work well only with relatively small routing tables. It is also more difficult to automate. But for small networks this would be a safe bet...
In large, centralized RR designs, a single link failure might cause a lot of transitions that would take a lot of time. Not nice for high availability.
For safety critical networks we always have to enable additional paths for RRs. Otherwise, you convergence time would be usually too slow.
If you implement a native RR topology following the physical topology, then changing the next hop might be meaningful, since the RR is always in the user data plane path and shares the fate of user data packets. BTW, this is the original IP networking concept. The routing protocol shall follow the same path than the user data plane. The centralized RR design violates the basic IP network principles with all its consequences. The centralized design is a step back to the classical TDM telco architecture. :-)
Seamless MPLS is a hybrid between the fully distributed and centralized RR approaches.
Ivan loves criticizing the centralized SDN controller design. The centralized RR design has almost the same issues... :-) I know people still prefer doing that, but this was not the original intention of RR. It is a misuse of the original RR concept.