Root Cause Analysis: Oversized AS Paths
The “BGP experiment” a small European ISP performed in February 2009 has generated quite a splash: Cisco has discovered a new BGP bug that can be triggered only if you have a long enough AS-path and do outbound AS-path prepending (and a few of us learned more BGP intricacies we never wanted to know), lots of people have (hopefully) discovered the importance of the bgp maxas-limit configuration command and at least some ISPs have implemented inbound prepending filters that I wrote about almost a year ago. However, most of us thought that the original problem arose due to inexperienced operators of a leaf AS.
Mikael Abrahamsson was the first to notice that the number of prepends matches the low-order 8 bits of the offending AS number. Further contributors to NANOG mailing list confirmed that two autonomous systems with very long prepends are using BGP routers from Mikrotik. You configure those boxes with commands that have syntax deceptively close to Cisco's, but expect the number of AS numbers to prepend, not the AS-path. Obviously no range checking is done on the configuration parameter and the high-order 8 bits are ignored.
So it looks like the incident started with a box that accepts invalid configuration parameter used in an AS with very high value in low-order 8 bits (quite improbable, but obviously not impossible). Numerous ISPs that did not limit the BGP updates they were propagating and an IOS bug did the rest.
Very interesting observation, looking at the modulo to determine if these share the same root cause.
I created a list of longest ASpaths per day, starting with February 1st.
I added a feature that checks if number of prepends matches the low-order 8 bits of the offending AS number. Indicating that it's likely caused by the same bug.
The list can be found here:
Interesting is that the first time this was observed was actually on February 9th. I didn't hear anyone talk about that. However I do know that one of my upstreams experienced BGP problems (flaps) exactly at the time of the update.
Keep up the good work,
The huge impact on the Internet was caused by a Cisco bug, which was only triggered if the AS-path length was close enough to 255 and the AS-path prepending was used.
Obviously someone in the core was doing just the right amount of prepending to reset the BGP sessions and trigger the flood.
The first time the offending AS was far enough from the core that the eventual flaps were not noticed (although there had to be some non-IOS boxes in the path for you to get the AS-path length of 257).