Next Hops of BGP Routes Reflected by Arista EOS
Imagine a suboptimal design in which:
- A BGP route reflector also servers as an AS edge (PE) router1;
- You want to use next-hop-self on AS edge routers.
Being exposed to Cisco IOS for decades, I considered that to be a no-brainer. After all, section 10 of RFC 4456 is pretty specific:
In addition, when a RR reflects a route, it SHOULD NOT modify the following path attributes: NEXT_HOP, AS_PATH, LOCAL_PREF, and MED.
Arista EOS is different – a route reflector happily modifies NEXT_HOP on reflected routes (but then, did you notice the “SHOULD NOT” wording?2)
The behavior is easy to reproduce in a 4-router lab with the following BGP topology:
I configured BGP on RR the way I would have done it on Cisco IOS:
router bgp 65000
router-id 10.0.0.1
bgp cluster-id 10.0.0.1
bgp advertise-inactive
neighbor 10.0.0.2 remote-as 65000
neighbor 10.0.0.2 next-hop-self
neighbor 10.0.0.2 update-source Loopback0
neighbor 10.0.0.2 description e1
neighbor 10.0.0.2 route-reflector-client
neighbor 10.0.0.2 send-community standard extended
neighbor 10.0.0.3 remote-as 65000
neighbor 10.0.0.3 next-hop-self
neighbor 10.0.0.3 update-source Loopback0
neighbor 10.0.0.3 description e2
neighbor 10.0.0.3 route-reflector-client
neighbor 10.0.0.3 send-community standard extended
neighbor 10.1.0.10 remote-as 65100
neighbor 10.1.0.10 description x1
neighbor 10.1.0.10 send-community standard
!
address-family ipv4
neighbor 10.0.0.2 activate
neighbor 10.0.0.3 activate
neighbor 10.1.0.10 activate
The only difference I noticed when comparing Arista EOS configuration with Cisco IOS one was the need to specify route-reflector-client and next-hop-self per-neighbor and not within an address family. That might be a good choice: it makes little sense to have some neighbors as RR clients in one address family but not in another one, and having attributes specified per neighbor not per-AF-per-neighbor ensures you’re not making stupid mistakes.
The BGP table on E1 was a shocker: prefix 10.0.0.3/32 (reflected route from E2) has RR as the next hop. The originator-id is set to 10.0.0.3, proving the route was originated by E2, but the next-hop is set to cluster-id (10.0.0.1), proving the next hop was changed by RR when reflecting the route.
e1#sh ip bgp | begin Network
Network Next Hop Metric AIGP LocPref Weight Path
* > 10.0.0.1/32 10.0.0.1 0 - 100 0 i
* > 10.0.0.2/32 - - - - 0 i
* > 10.0.0.3/32 10.0.0.1 0 - 100 0 i Or-ID: 10.0.0.3 C-LST: 10.0.0.1
* > 10.0.0.4/32 10.0.0.1 0 - 100 0 65100 i
I almost made a perfect mess creating a route map to change next hops on external BGP routes (but not on internal ones) when I noticed the nerd knob I needed to get Arista EOS behavior more in line with the recommendation of RFC 4456: bgp route-reflector preserve-attributes. All of a sudden, the BGP table changed to what I expected to see:
e1#sh ip bgp | begin Network
Network Next Hop Metric AIGP LocPref Weight Path
* > 10.0.0.1/32 10.0.0.1 0 - 100 0 i
* > 10.0.0.2/32 - - - - 0 i
* > 10.0.0.3/32 10.0.0.3 0 - 100 0 i Or-ID: 10.0.0.3 C-LST: 10.0.0.1
* > 10.0.0.4/32 10.0.0.1 0 - 100 0 65100 i
Reproducibility Is the Key
You’ll find the lab topology and configuration files on GitHub. The tar archives contain device configurations (initial and fixed) and containerlab configuration needed to set up the lab3.
Alternatively, you can use netlab to set up the lab:
- Install netlab and your preferred lab environment
- Copy topology.yml file into an empty directory
- Execute netlab up
You can specify virtualization provider or default device type with netlab up, making it easy to test the route reflector behavior on a dozen devices supported by netlab.
-
Because you ran out of budget, or because you forgot you needed a route reflector in your BGP network, and then randomly chose one of the routers to do that. ↩︎
-
Maybe that should be upgraded to REALLY SHOULD NOT? ↩︎
-
Some Assembly Required: you’ll have to install Docker, containerlab and Arista EOS container on a Linux host. ↩︎
Different vendor defaults can be surprising, indeed.
Many vendors use a default different from the Arista EOS default described above. Some allow to configure similar behavior:
On Cisco IOS-XR there is the command
ibgp policy out enforce-modifications
to get the behavior you described for Arista EOS above.On Cisco IOS the
neighbor <IP> internal-vpn-client
command enables this for iBGP PE<--->CE connections.Huawei VRP has the configuration command
reflect change-path-attribute
to enable changing path attribute of reflected routes via policy.To be fair, 'nexthop-self' isn't the default behavior when advertising towards the RR client, if you notice the config: neighbor 10.0.0.2 next-hop-self for the RR client 10.0.0.2. If you don't configure that, it will be nexthop-unchanged, which would be compliant to the 'SHOULD' behavior.
So in a way, the default behavior difference here is really whether a config would take its face value, or a strict higher layer would always forbid it, by default.
Yeah, you could say that I asked for it ;)
I definitely found the behavior unexpected, more so as other platforms with very similar syntax behave in a different way. Will reword it a bit (give me a few days).
I've also found that when doing an eBGP Route Server setup across a shared subnet (Third Party Next Hop), Arista changes the next hop to self while Cisco doesn't. To make it worse, adding next-hop-unchanged didn't work (though the command took), you needed to set it via a route-map. Even worse, routes not learned over the shared interconnect were swept up in this, and dropped because:
RFC-4271 section-5.1.3 Clause 2 of section 5.1.3: 2) When sending a message to an external peer, X, and the peer is one IP hop away from the speaker:
* - By default (if none of the above conditions apply), the BGP speaker SHOULD use the IP address of the interface that the speaker uses to establish the BGP connection to peer X in the NEXT_HOP attribute.
sh ip bgp nei x.x.x.x showed: Nexthop invalid for single hop eBGP: 1
Making the peering eBGP multihop, even though its 1 hop away, allowed the route in, per another part of the same RFC: 3) When sending a message to an external peer X, and the peer is multiple IP hops away from the speaker (aka "multihop EBGP"):
Or , change matching criteria in the NH unchanged prefix list.