Root cause analysis: oversized AS paths

The Tuesday's BGP experiment has generated quite a splash: Cisco has discovered a new BGP bug that can be triggered only if you have a long enough AS-path and do outbound AS-path prepending (and a few of us learned more BGP intricacies we never wanted to know), lots of people have (hopefully) discovered the importance of the bgp maxas-limit configuration command and at least some ISPs have implemented inbound prepending filters that I wrote about almost a year ago. However, most of us thought that the original problem arose due to inexperienced operators of a leaf AS.

Mikael Abrahamsson was the first to notice that the number of prepends matches the low-order 8 bits of the offending AS number. Further contributors to NANOG mailing list confirmed that two autonomous systems with very long prepends are using BGP routers from Mikrotik. You configure those boxes with commands that have syntax deceptively close to Cisco's, but expect the number of AS numbers to prepend, not the AS-path. Obviously no range checking is done on the configuration parameter and the high-order 8 bits are ignored.

So it looks like the incident started with a box that accepts invalid configuration parameter used in an AS with very high value in low-order 8 bits (very improbable, but obviously not impossible). Numerous ISPs that did not limit the BGP updates they were propagating and an IOS bug did the rest.

2 comments:

  1. Hi Ivan,

    Very interesting observation, looking at the modulo to determine if these share the same root cause.
    I created a list of longest ASpaths per day, starting with February 1st.
    I added a feature that checks if number of prepends matches the low-order 8 bits of the offending AS number. Indicating that it's likely caused by the same bug.
    The list can be found here:

    http://bgpmon.net/maxASpath.php

    Interesting is that the first time this was observed was actually on February 9th. I didn't hear anyone talk about that. However I do know that one of my upstreams experienced BGP problems (flaps) exactly at the time of the update.

    Keep up the good work,
    Andree

    ReplyDelete
  2. Fantastic tool. Will you post it to NANOG mailing list?

    The huge impact on the Internet was caused by a Cisco bug, which was only triggered if the AS-path length was close enough to 255 and the AS-path prepending was used.

    Obviously someone in the core was doing just the right amount of prepending to reset the BGP sessions and trigger the flood.

    The first time the offending AS was far enough from the core that the eventual flaps were not noticed (although there had to be some non-IOS boxes in the path for you to get the AS-path length of 257).

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.