BGP next hop processing

Following my IBGP or EBGP in an enterprise network post a few people have asked for a more graphical explanation of IBGP/EBGP differences. Apart from the obvious ones (AS path does not change inside an AS) and more arcane ones (local preference is only propagated on IBGP sessions, MED of an EBGP route is not propagated to other EBGP neighbors), the most important difference between IBGP and EBGP is BGP next hop processing.

It’s best to explain BGP next hop processing through a set of examples; mine will be based on the following small network:

And since this post is getting way too long, here’s a rough table of content:

BGP next hop of a locally originated routes

When a router originates a BGP route configured with a network router configuration command or through route redistribution (redistribute router configuration command), it sets the BGP next hop to the IGP next hop (the same value you’d find in the IP routing table). BGP next hop is set to 0.0.0.0 for routes with unknown next hops – connected interfaces, static routes to null 0 or summary routes configured with aggregate-address router configuration command.

You can set the BGP next hop of a locally originated BGP route to any value you like with a route-map applied to network, redistribute or aggregate-address router configuration command. But remember: just because you could doesn’t mean that you should.

When a BGP route with missing next hop is sent to BGP neighbors, the BGP next hop is set to the source IP address of the BGP session.

Example: PE-A originates BGP prefix 10.0.1.0/24 based on a static route to null 0. When it sends this BGP prefix to X1 and X2, BGP next hop is set to 192.168.0.3. BGP next hop in update sent to RR is 10.0.0.1.

If you use common BGP design recipes (IBGP sessions configured between loopback addresses and EBGP sessions configured across directly-connected subnets), and the BGP next hop is unknown, the BGP router advertises its loopback address as BGP next hop on IBGP sessions, making BGP table resilient to topology changes inside your network.

For routes with known next hops, the router applies standard IBGP/EBGP next hop processing rules (see below) when sending the BGP updates to its neighbors.

Route reflector cannot change BGP next hop of reflected routes

Large autonomous systems use BGP route reflectors. BGP route reflectors cannot change any attribute of the routes they reflect. The BGP next hop advertised by an edge router is thus propagated unchanged across the whole AS.

Exception: you can change BGP next hop on a route reflector with an inbound route-map. Don’t do this outside of a CCIE lab.

Example: Prefix 10.0.1.0/24 originated by PE-A is propagated by RR to PE-B. BGP next hop is still 10.0.0.1.

BGP next hop is set to router’s own address on EBGP sessions

The internal details of an AS should not influence packet forwarding between autonomous systems (and we cannot assume that a router external to our AS would know our internal details). The BGP next hop is thus changed to router’s own IP address (source address of the EBGP session) in outgoing EBGP updates.

Example: When PE-B sends BGP prefix 10.0.1.0/24 (with next-hop 10.0.0.1) to X3, it sets BGP next hop to 192.168.2.1.

You can always set the BGP next hop to any value you like with an outbound route-map. Risky (because it’s hard to check whether the next hop you advertise is actually reachable), but ensures pretty decent good job security.

Next hop optimization on EBGP sessions

EBGP next hop is not changed if the BGP next hop in the BGP table belongs to the same IP subnet as the EBGP neighbor to which the update is sent. This rule ensures optimum packet forwarding in partially-meshed EBGP deployments (example: internet exchange points).

Example: X1 sends BGP prefix 172.16.0.0/16 to PE-A. Next hop is set to the source address of the EBGP session between X1 and PE-A (192.168.0.1). When PE-A propagates the BGP prefix to X2, it does not change the next hop (X1, PE-A and X2 are in the same subnet).

You can disable the EBGP next hop optimization with neighbor next-hop-self router configuration command. This command is particularly useful in partially meshed multi-access networks (Frame Relay, ATM, Phase 1 DMVPN, private VLANs), see Using BGP in Phase 1 DMVPN Networks post for more details.

Example: Assuming neighbor 192.168.0.2 next-hop-self is configured on PE-A, the BGP next hop of all BGP routes sent to X2 from PE-A will be 192.168.0.3 and the traffic between X1 and X2 will flow through PE-A.

BGP next hop is not changed on IBGP sessions

All routers within an autonomous system are assumed to be able to reach the same set of subnets (advertised through IGP). Consequently, when an AS edge router propagates external BGP prefixes to internal BGP peers, it does not change the BGP next hop.

Example: X1 sends BGP prefix 172.16.0.0/16 with next hop 192.168.0.1 to PE-A. When PE-A propagates that prefix to RR, the BGP next hop is still 192.168.0.1. When the same prefix is reflected to PE-B, the next hop is still unchanged. PE-B therefore needs IGP path toward 192.168.0.0/24 or it cannot forward the traffic toward 172.16.0.0/16.

You could make BGP next hops reachable via BGP paths. While it might work, don’t do this at home (or in your production network).

As with EBGP sessions, you can force the AS edge router to become BGP next hop by using neighbor next-hop-self router configuration command on all IBGP sessions (I would usually use an IBGP peer session and peer policy template to simplify my configuration).

Example: X1 sends BGP prefix 172.16.0.0/16 with next hop 192.168.0.1 to PE-A. Assuming neighbor 10.0.0.2 next-hop-self has been configured on PE-A, the BGP next hop of the BGP route sent to RR will be 10.0.0.1.

IBGP next hop design rules

You can design IBGP in your autonomous system in two fundamentally different ways:

  • IBGP routes point to external BGP next hops (default behavior)
  • IGBP routes point to loopback interfaces of AS edge routers (next-hop-self is configured on IBGP sessions on AS edge routers).

If you don’t change the BGP next hop on AS edge routers, you have to propagate external subnets with your IGP. You can either configure external subnets as passive interfaces or redistribute them into your IGP. The two methods are almost identical if you use IS-IS; OSPF is a slightly different story. Flap of a passive OSPF interface causes full SPF run, whereas addition or removal of an external route (type-5 or type-7 LSA) results in partial SPF run. Redistribution of external subnets is thus preferred if you use OSPF.

However, it’s never a good idea to allow external events (like link flaps in your access network) to influence the stability of your core IGP. Using next-hop-self on AS edge routers (and changing the external next hops into edge router’s loopback address) is thus almost always the preferred design.

Need help?

Our professional services team has designed numerous very large BGP-based networks. Get in touch if you need me (or one of our experts) for a few day on-site network design/review workshop; the ExpertExpress option might be the right choice for smaller-scale challenges.

16 comments:

  1. Hi,

    Isn't MED actually propagated on EBGP routes?

    ReplyDelete
  2. MED that is received on a prefix from a(n EBGP) neighbor is not propagated to an EBGP neigbor. I.e. if PE-A receives a prefix from X1 with a MED set, and then advertises that same prefix to X2, the MED attributed will not be set (by default).

    ReplyDelete
  3. Hi Ivan, excellent summary, but there are few statements that may require a little bit of clarification.

    Firstly, for leaking the edge link IP prefixes into IGP. A while ago, it made total sense to either change the next-hop to self or leak edge prefixes into BGP, to maintain reachability to provider managed devices at customer premises. This ensured network stability, to some extent. Nowadays, the requirements for fast convergence based on BGP NHT/PIC may dictate that edge link prefixes are leaked into IGP, for the purpose of fast event propagation. Furthermore, preserving the eBGP next-hop has some useful accounting implications, e.g. when exporting BGP next-hop in Netflow and looking to construct "external" traffic matrix. And network stability could be still controlled by using exponential event dampening (low-pass filtering).

    Secondly, using redistribution no longer has advantage of "faster SPF" over type-1 LSA injection with the introduction of iSPF (invented in ARPANet!) to both OSPF/ISIS. Even without iSPF, SPF takes insignificant time of overall convergence process on modern CPU's - the majority of time is spent updating distributed forwarding tables after a change. Furthermore, redistribution might be even considered dangerous due to type-5 LSA's having larger flooding scope (there have been well-known precedents with that), not to mention that type-5 LSAs consume more memory and create more flooding overhead (less of concern, though).

    Thirdly, changing next-hop to self on a route-reflector *may* be required even in production network if you need to ensure that RR is in the forwarding path to avoid route deflection (not the best design, though). This operation is also a key component for building hierarchical LSPs using BGP-based label propagation for overlay LSPs.

    Regards,

    Petr

    ReplyDelete
  4. Ivan,

    Excellent post. Loved it! Really brought closure to the previous post. Thanks so much.

    Will

    ReplyDelete
  5. Thanks for this post, I eventually sit down and read through it.
    Just to add a minor point for other readers, the "Next hop optimization on EBGP sessions" is also known as the "third-party next-hop" feature. :)

    ReplyDelete
  6. The observation of SPF run for advertising the DMZ link into IGPs (IS-IS and OSPF) is really SUPERB! :-)

    also found 2 nice linkz about SPF...
    http://routingfreak.wordpress.com/2008/03/04/shortest-path-first-calculation-in-ospf-and-is-is/
    http://routingfreak.wordpress.com/2008/03/06/the-complete-and-partial-spf-in-is-is/

    ReplyDelete
  7. All of your posts are top notch - thank you and keep up the great blog! 8-)

    ReplyDelete
  8. Hi Ivan,

    I just ran into an issue I was not expecting that made me search on google about BGP routes with next-hop resolved through another BGP routes.
    In my case an iBGP route's next-hop was resolved through another iBGP route. Both routes were installed in the routing table. The next-hop itself was reachable (ping/traceroute). However the destination was not: when debugging with 'debug ip packet', the router picked a loopback as the source (instead of the outgoing interface) and then declared the packet unroutable. Making the next-hop known through IGP fixed the problem.
    Did you ever run into such an issue ? I could not find any explanation for it. My release is 12.4(15)T10.


    Thanks for all the great information you share on your site.
    Cheers,
    Mat

    ReplyDelete
    Replies
    1. Effectively what you've discovered is that an IBGP next hop must be an IGP route, which makes perfect sense, otherwise you could get into all sorts of recursive routing.

      No, I've never run into such an issue, I never tried to do something like that.

      Delete
  9. Yes, nobody ever want to do that, it just happens that in some infamous lab environment they make you run into that kind of issue ! :)
    I am still surprised that the routes were installed in the routing table though ! It's a strange logic to install a route in the RIB and then decide that it is not valid.

    Thanks,
    Mat

    ReplyDelete
  10. Confusing... I am pulling my hair...

    Can we say it like this?

    In EBGP, the next hop address is changed as the routes are passed to the neighbor routers. At last when the routes reach an IBGP, the next hop address is kept as the last EBGP routers address?

    Simply, EBGP will update the next-hop address as the routes are passed to the neighbor routers but IBGP will NOT update the next-hop address as the routes are passed to the neighbor routers?

    We have to go to the each router inside IBGP network and run the command "next-hop-self" to inform the next downstream router that "handover the packet to me" to reach the destination network ???

    ReplyDelete
    Replies
    1. You're almost correct.

      Next hop will be changed on EBGP session __unless_ the neighbor's IP address is in the same subnet as the current next hop.

      Next hop will NOT be changed on IBGP session __unless__ you specify next-hop-self, which you'd do only on the AS boundary (usually you want to use IGP to control intra-AS routing toward BGP next hops).

      Then there are the weird cases including 'next-hop-unchanged' for inter-AS MPLS/VPN or setting the next hop to a bogus value for remote-triggered black holes ;)

      As for pulling your hair - we all went through that phase when faced with BGP.

      Delete
    2. Thanks a lot Sir...

      The BGP feature "EBGP session __unless_ the neighbor's IP address is in the same subnet as the current next hop" may be enabled because in Multi-access environments the BGP routers may need to pass the packets one or two extra hops.

      What I meant was in EBGP Multiaccess environments all the other routers in a subnet are reachable from any router, but may not be BGP neighbors. This may cause additional hops?

      Am I right Sir?

      Delete
    3. EBGP next hop processing tries to avoid unnecessary packet forwarding across a single subnet. It works (recursively) even when there's only a partial mesh of EBGP sessions across the subnet _as long as_ nobody changes the next-hop manually.

      Delete
    4. Got it Sir..

      Thank you very much...

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.