On November 22nd, 2023, AMS-IX, one of the largest Internet exchanges in Europe, experienced a significant performance drop lasting more than four hours. While its peak performance is around 10 Tbps, it dropped to about 2.1 Tbps during the outage.
AMS-IX published a very sanitized and diplomatic post-mortem incident summary in which they explained the outage was caused by LACP leakage. That phrase should be a red flag, but let’s dig deeper into the details.
Reading the incident report, it seems to me (and I would love to be corrected) that Juniper switches used by AMS-IX forward LACP packets received on a non-LAG port to other bridged1 ports on the same switch. As much as I’m trying, I can’t figure out in which universe that would be anywhere close to a sane choice.
LACP (Link Aggregation Control Protocol) was designed to be used between adjacent devices. While I know we have to live with abominations like two devices pretending they’re a single system, I lack polite words to describe the idea of forwarding layer-2 control packets that are supposed to be used between adjacent devices onto other links. Unfortunately, I’m also aware of potential MacGyver-type use cases for that monstrosity: let’s buy two Carrier Ethernet links and pretend we can bundle them into an end-to-end Link Aggregation Group.
However, even if the vendor account teams dazzled by a humongous purchase order can get persuaded that a bridge needs such a dangerous nerd knob, one would hope that configuring it would be hard and would generate all sorts of “if you do this, the universe might collapse into a black hole” type of warnings2; one can only hope flooding packets sent to well-known IEEE-defined MAC addresses is not the default behavior, but then boxes from the same company happily talk BGP with total strangers. Feedback from anyone familiar with Junos layer-2 implementation would be most welcome.