Large Layer-2 Domains Strike Again…

I started January 2018 blogging with a major service provider failure. Why should 2019 be any different? Here’s what Century Link claimed was causing two-day outage (more comments here).

Supposedly it was a problem with the management network used by their optical gear, but it looks a lot like a layer-2 network spanning 15 data centers and no control-plane policing on the managed devices… proving yet again that large-scale layer-2 networks are a really bad idea.

Please note that it doesn’t matter whether they had problems with a stretched Ethernet segment or something else. According to their explanation a single device broadcasting packets was able to affect devices across multiple locations – as I’m trying to explain for years (not that many people would listen and/or care), a single broadcast domain is a single failure domain no matter what $vendor PowerPoints or whitepapers claim, and it’s not a question of whether the concoction will fail but when. Keep that in mind the next time your $vendor rep brings dancing unicorns into the room.

On a tangential note, cloud providers that know what they’re doing don’t support anything else but unicast routing for a really good reason – check out the details in AWS Networking webinar.

Finally, just in case you think failures like this one are a black swan event, check the list of post-mortems and associated lessons learned collected by Dan Luu… keeping in mind that most of the failures are never reported.

Latest blog posts in Disaster Recovery series


  1. There never is a single cause to any problem. It is a causation of multiple things that aggregate into an outage. In this case no visibility.
    Old token-ring had visibility on management frames that you could troubleshoot using any only promiscuous DOS based frame capture program. Took me a few minutes to find a problem and not two days.
    The two days implies they had no decent tiger teams. I suggest the following for them:
  2. There is also the complexity of operating a complex system patched together by years of mergers and acquisistions. CL is an amalgam of CL, L3 and Time Warner Cable, plus more that I don't have the time to research. Perhpaps a large L2 domain is not as big a problem if you have the right visibility and processes in place. But since introducing humans into a highly complex system is bound to end up with a human making a honest error, then its best to engineer the human out of the equation and/or learn the lessons of the hyper-scalers limiting the failure domain.
Add comment