It looks like we’re bound to experience a widespread BGP failure once every few months. They all follow the same pattern:
- A “somewhat” undertested BGP implementation starts advertising paths with “unexpected” set of attributes.
- A specific downstream BGP implementation (and it could be a different implementation every time) a few hops down the road hiccups and sends a BGP notification message to its upstream neighbor.
- BGP session must be reset following a notification message; the routes advertised over it are lost and withdrawn, causing widespread ripples across the Internet.
- The offending session is reestablished seconds later and the same set of routes is sent again, causing the same failure and a session reset. If the session stays up long enough (because the routers have to exchange approximately 300,000 routes), some of the newly received routes might get propagated and will flap again when the session is reset.
- The cyclical behavior continues until a manual intervention.
I don’t understand why Cisco IOS allows the cyclical BGP session behavior (as it’s pretty obvious things will not get better by themselves once the same session is reset a few times), but there’s at least something we can do with Embedded Event Manager: shut down the offending neighbor after seeing the BGP-3-NOTIFICATION syslog message often enough in a short period of time.
You’ll find the EEM applet that shuts down a flapping BGP neighbor in the CT3 wiki.