Disable flapping BGP neighbors

It looks like we’re bound to experience a widespread BGP failure once every few months. They all follow the same pattern:

  • A “somewhat” undertested BGP implementation starts advertising paths with “unexpected” set of attributes.
  • A specific downstream BGP implementation (and it could be a different implementation every time) a few hops down the road hiccups and sends a BGP notification message to its upstream neighbor.
  • BGP session must be reset following a notification message; the routes advertised over it are lost and withdrawn, causing widespread ripples across the Internet.
  • The offending session is reestablished seconds later and the same set of routes is sent again, causing the same failure and a session reset. If the session stays up long enough (because the routers have to exchange approximately 300,000 routes), some of the newly received routes might get propagated and will flap again when the session is reset.
  • The cyclical behavior continues until a manual intervention.

I don’t understand why Cisco IOS allows the cyclical BGP session behavior (as it’s pretty obvious things will not get better by themselves once the same session is reset a few times), but there’s at least something we can do with Embedded Event Manager: shut down the offending neighbor after seeing the BGP-3-NOTIFICATION syslog message often enough in a short period of time.

You’ll find the EEM applet that shuts down a flapping BGP neighbor in the CT3 wiki.

8 comments:

  1. May be bgp dampening can help?

    ReplyDelete
  2. When the neighbor is shut down after matching the criteria, is there any way to unshut through programming after some time.

    regards
    shivlu jain

    ReplyDelete
  3. Ivan Pepelnjak29 August, 2009 17:24

    Not directly. EBGP session loss does not count as a BGP route flap. The BGP flap dampening would only kick in in the downstream routers.

    ReplyDelete
  4. Ivan Pepelnjak29 August, 2009 17:25

    Sure. Just some more programming ;)

    ReplyDelete
  5. cyclical BGP session behavior is for one simple reason - if you are working after hours (usually at night) and you screw something you don't need to call upstream technician every time you change something.

    ReplyDelete
  6. Sounds reasonable, but why does it have to flap every 10 seconds for eternity? Built-in exponential backoff with a cap would make a lot of sense.

    ReplyDelete
  7. i would like to thanks every one who acting int this website ........

    thanks

    ReplyDelete
  8. You're welcome ;) It's nice to receive feedback like this 8-)

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.