Real Life BGP Route Origination and BGP Next Hop Intricacies

During one of the ExpertExpress engagements I helped a company implement the BGP Everywhere concept, significantly simplifying their routing by replacing unstable route redistribution between BGP and IGP with a single BGP domain running across MPLS/VPN and DMVPN networks.

They had a pretty simple core site network, so we decided to establish an IBGP session between DMVPH hub router and MPLS/VPN CE router (managed by the SP).

Unfortunately, things are never as easy as they seem during the initial discussion – inevitably a few skeletons everyone forgot about fall out of a dusty closet. In this case, it was a VPN concentrator connecting remote users to the core network.

In the past, the customer used static routes on MPLS/VPN CE router and DMVPN hub router to reach the prefixes behind the VPN concentrator.

As they have a single VPN concentrator, static routes proved to be sufficient… but the customer had to open a ticket with the ISP (and wait a few days for a maintenance window) every time they wanted to change the static routes on the provider-controller CE router.

With the IBGP session between DMVPN hub router and MPLS/VPN CE router, the customer started wondering: “Would it be possible to originate the prefixes reachable through the VPN concentrator on the DMVPN hub router?

After warning them that their idea creates a single point of failure (which they accepted, looks like there’s nothing mission-critical behind the VPN box), we started discussing the technical details, in particular what the BGP next hop of the route originating from the DMVPN hub router would be and how the traffic would flow.

A Slight Digression into Layer-3 World

If the customer had a layer-3 core network, there would be no discussion. The layer-3 switch (replacing the layer-2 network drawn as an Ethernet segment in my pictures) would advertise the prefixes behind the VPN concentrator and we could easily set the BGP next hop to the IP address of the layer-3 switch with neighbor next-hop-self command. Alas, we were facing a slightly more interesting challenge.

BGP Route Origination Details

The question we had to answer before deploying the proposed solution was “What would the BGP next hop of a static route redistributed into BGP be?

If the DMVPN hub router advertises a static route pointing to the VPN concentrator with the BGP next hop set to the IP address of the VPN concentrator, traffic between sites in MPLS/VPN network and sites behind VPN concentrator flows directly. If the DMVPN hub router sets itself as the BGP next hop, the traffic takes an undesired detour. Time for a lab test.

Virtual Lab to the Rescue

Having access to cloud-based beta version of Cisco Modeling Lab turned out to be one of the best things I got from a vendor in the last few months.

The blog post was written in 2014. A decade later I’d use Tailscale VPN to access the Intel NUC sitting in my office and build the test topology in netlab.

Even though I was presenting at a conference in Germany, I managed to create a simple lab during one of the breaks and verify my understand of how BGP route origination works (on second thought, it might have been faster to read what I wrote in 2011):

In our case, DMVPN hub router shouldn’t change the BGP next hop of the static routes – the BGP next hop should point to the VPN concentrator, resulting in optimal traffic flow.

A Few Printouts

I was using this configuration on DMVPN hub router to generate the desired BGP route:

router bgp 65000
 network 192.168.4.0
!
ip route 192.168.4.0 255.255.255.0 10.0.0.42

As you can see, BGP next hop in BGP table matches IP next hop in IP table:

DMVPN#show ip route 192.168.4.0
Routing entry for 192.168.4.0/24
  Known via "static", distance 1, metric 0
  Advertised by bgp 65000
  Routing Descriptor Blocks:
  * 10.0.0.42
      Route metric is 0, traffic share count is 1
DMVPN#show ip bgp 192.168.4.0
BGP routing table entry for 192.168.4.0/24, version 11
Paths: (1 available, best #1, table default)
  Advertised to update-groups:
     1         
  Refresh Epoch 1
  Local
    10.0.0.42 from 0.0.0.0 (192.168.0.1)
      Origin IGP, metric 0, localpref 100, weight 32768, valid, sourced, local, best

BGP next hop is not modified on IBGP sessions; the managed CE router thus receives the correct value for the BGP next hop:

CE#show ip bgp 192.168.4.0
BGP routing table entry for 192.168.4.0/24, version 11
Paths: (1 available, best #1, table default)
  Not advertised to any peer
  Refresh Epoch 1
  Local
    10.0.0.42 from 10.0.0.1 (192.168.0.1)
      Origin IGP, metric 0, localpref 100, valid, internal, best

Meanwhile on Planet Earth

Unfortunately, the reality quickly diverged from the optimistic theoretical results – the BGP next hop on the MPLS/VPN CE router pointed to the DMVPN hub router.

A few minutes of troubleshooting identified the culprit: neighbor next-hop-self that we had to configure on the IBGP session between DMVPN hub router and MPLS/VPN CE router to circumvent lack of IGP between the two BGP routers (IBGP is supposed to be run in combination with an IGP that resolves BGP next hops).

We could have solved the problem with a route-map that would set BGP next hop for non-local routes and keep it unchanged for locally originated routes, but decided that the extra complexity simply isn’t worth it (sometimes you have to know when to give up).

3 comments:

  1. Hello,

    The case is interesting, but I find the solution in the category "we do things, but we don't know why we do them". Let me explain:
    - Wasn't *i*BGP supposed to be used within an *autonomous* (routing) system ? Such as "same entity controls all equipment within the AS" ?
    - Another question is why the CE (*customer* edge) is managed by the provider not the customer (which happens to already able to run/operate BGP on other equipment) ?

    Globally it is a workaround to a problem that was supposed to be a solution in the first place.
    Replies
    1. Good points, but sometimes reality intervenes ;) Provider-controlled CE-router (yeah, I know, an oxymoron) is a pretty common scenario, usually due to SLA issues, so I would say this is in the "we do things because we have to do them this way" category.
      .
      BTW, do you object to IBGP being run within the main site or not being run across DMVPN?
  2. As a general rule, neither of the 2 aspects can be objected, but this assumes that "withing the main site" means "within the same location of the autonomous system" and "one autonomous system, one AS number" (basic BGP-related assumption). This is obviously not the case. It's like saying "red is blue".

    Concerning the "provided-controlled CE" (a.k.a. "managed services"), this is something that can be negotiated; it's not easy, but it may be possible (evoking a "possible deal-breaker condition" helps). As for the SLAs, "waiting the next maintenance window" for static routes modification, this sounds like a serious issue to me. So yes, reality hits, and sometimes for the wrong reasons.

    I would also like to present some other possible scenarios (hopefully they did not seem to be applied case here) :
    - The provider operating the CE decides to force (ex. via route-map) the next-hop for the incoming routes to a "known-good" gateway (BGP peer, previously-agreed "Default gateway, .....
    - The provider uses filters inbound routes. If you need to announce new routes you may need their prefix-list updated, with the same reactivity as adding static routes.

    Those being said, this may be an interesting school-case for learning how BGP signalling works, iBGP edition (for the eBGP edition I would definitely choose the route-servers from an IXP).
Add comment
Sidebar