Here’s how the AMS-IX failure impacted ATLAS probes (world-wide monitoring system run by RIPE) – no wonder, as RIPE uses AMS-IX for their connectivity.
My friend Jeremy Stretch saved the daily traffic graph for posterity in one of his tweets:
As you can see from the graph, Internet lost 2 Tbps of transit capacity, and many networks using AMS-IX (including some cloud services providers) were severely impacted.
You might wonder what the root cause for the outage was. Here’s the relevant tweet:
Also, keep in mind that there are a few significant differences between AMS-IX and clueless geniuses that tell you to build large layer-2 fabric (hopefully stretched across two data centers):
- AMS-IX is one of the large Internet exchanges and they usually know what they’re doing… and still bad things happen;
- AMS-IX has been in business for almost 20 years and thus has significant operational experience. They’ve learned loads of lessons during past outages and have built their own tools (like ARP sponge) to make their infrastructure more reliable;
- Internet exchanges that don’t want to dictate routing policies of their members have to be layer-2 fabrics (the proof is left as an exercise for the reader), while your data center doesn’t have to be.
Want Even More Horror Stories?
Jay Swan pointed me to a recent Cisco Live presentation (BRKDCT-3102), which documented several interesting layer-2 failures, including a split-brain cluster – I was telling people about these scenarios for years, and it’s so nice to have corroboration from a major vendor (not sure what the evangelists of layer-2 fabrics and DCI solutions working for that same vendor think about that presentation ;).