Another Spectacular Layer-2 Failure

Matjaž Straus started the SINOG 2 meeting I attended last week with a great story: during the RIPE70 meeting (just as I was flying home), Amsterdam Internet Exchange (AMS-IX) crashed.

Here’s how the AMS-IX failure impacted ATLAS probes (world-wide monitoring system run by RIPE) – no wonder, as RIPE uses AMS-IX for their connectivity.

My friend Jeremy Stretch saved the daily traffic graph for posterity in one of his tweets:

As you can see from the graph, Internet lost 2 Tbps of transit capacity, and many networks using AMS-IX (including some cloud services providers) were severely impacted.

You might wonder what the root cause for the outage was. Here’s the relevant tweet:

As I said many times before, it’s not a question of whether a large layer-2 fabric will crash, it’s only the question of when and how badly.

Also, keep in mind that there are a few significant differences between AMS-IX and clueless geniuses that tell you to build large layer-2 fabric (hopefully stretched across two data centers):

  • AMS-IX is one of the large Internet exchanges and they usually know what they’re doing… and still bad things happen;
  • AMS-IX has been in business for almost 20 years and thus has significant operational experience. They’ve learned loads of lessons during past outages and have built their own tools (like ARP sponge) to make their infrastructure more reliable;
  • Internet exchanges that don’t want to dictate routing policies of their members have to be layer-2 fabrics (the proof is left as an exercise for the reader), while your data center doesn’t have to be.

Want Even More Horror Stories?

Jay Swan pointed me to a recent Cisco Live presentation (BRKDCT-3102), which documented several interesting layer-2 failures, including a split-brain cluster – I was telling people about these scenarios for years, and it’s so nice to have corroboration from a major vendor (not sure what the evangelists of layer-2 fabrics and DCI solutions working for that same vendor think about that presentation ;).

Latest blog posts in Disaster Recovery series

10 comments:

  1. It seems that BRKDCT-3102 does not have the mentioned case studies...
    -bgolab
    Replies
    1. They're at the very end of the slide deck.
  2. I once read they even don't rely on L2 protocols like STP to provide redundancy (which maybe hit them in this case) but they do this on L1 with photonic switches (tiny mirrors redirect the fiber links)
    Replies
    1. Yes, they do not use STP or any other L2 protocols. Instead, they use VPLS to create a L2 fabric. And yes they use photonic switch to provide redundancy on customer facing port on the VPLS PE router but that has nothing to do with L2 loop.
  3. Anonymous simce I don't want to be pinpointed to my customer, but AMS-IX guys seems to be a little clumsy lately.
    Multiple mistakes during upgrade, missing communication, loss of monitoring, lack of responsiveness during customer request/complaints.

    We have been reassured they are handling these bad behavior, but we still have faults related to humans too often, the RIPE70 case being the top of the iceberg.

    Other IX seems to handle things a little better, even if I have to admit their network being a lot simpler.. Nevertheless these fault are not related to net complexity, but rather to user actions.
  4. And this is the reason why I always tell people that when a port is unused you have to configure it as a L3 port ("no switchport" in Cisco/Arista terms) rather than have it shutdown...
    Replies
    1. that´s a nice best practice ;)
  5. Well, operator error could cause collateral damage on any layer, not just L2.
    See recent BGP leak from June 12-th, which caused serious problems for 2 hours:

    www.bgpmon.net/massive-route-leak-cause-internet-slowdown/
  6. I have to address the assertion regardin IXP stuck to L2 ("Internet exchanges that don’t want to dictate routing policies of their members have to be layer-2 fabrics").

    On a L3 IXP, you'd eventually peer with the router's BGP instance, and add your netowrk's routes to the IXP's routing tables.

    Now, you'd want to establish a direct session to another member to offer different routes and discard the common routing information for your prefixes.

    Why couldn't you encapsulate that with any protocol supported in hardware by your router (GRE, IPIP, L2TPv3…) ? Of course it looks like a waste of ressources, but it would remove complexity from the IXP itself, making it even more robust…
    Replies
    1. In theory, you could. In practice, high-speed reasonable-cost linecards don't always support tunneling in hardware... and you'd end up with a total mess of tunnels (not to mention the MTU issues).

      There are probably other considerations that I'm not aware of as well - in any case, all big IXPs use L2 approach (and some new ones use customer-L2-over-transport-IP for internal stability) - either they're all stupid, or we're missing something ;)
Add comment
Sidebar