Large Layer-2 Domains Strike Again…

I started January 2018 blogging with a major service provider failure. Why should 2019 be any different? Here’s what Century Link claimed was causing two-day outage (more comments here).

Supposedly it was a problem with the management network used by their optical gear, but it looks a lot like a layer-2 network spanning 15 data centers and no control-plane policing on the managed devices… proving yet again that large-scale layer-2 networks are a really bad idea.

Please note that it doesn’t matter whether they had problems with a stretched Ethernet segment or something else. According to their explanation a single device broadcasting packets was able to affect devices across multiple locations – as I’m trying to explain for years (not that many people would listen and/or care), a single broadcast domain is a single failure domain no matter what $vendor PowerPoints or whitepapers claim, and it’s not a question of whether the concoction will fail but when. Keep that in mind the next time your $vendor rep brings dancing unicorns into the room.

On a tangential note, cloud providers that know what they’re doing don’t support anything else but unicast routing for a really good reason – check out the details in AWS Networking webinar.

Finally, just in case you think failures like this one are a black swan event, check the list of post-mortems and associated lessons learned collected by Dan Luu… keeping in mind that most of the failures are never reported.

Latest blog posts in Disaster Recovery series

2 comments:

  1. There never is a single cause to any problem. It is a causation of multiple things that aggregate into an outage. In this case no visibility.
    Old token-ring had visibility on management frames that you could troubleshoot using any only promiscuous DOS based frame capture program. Took me a few minutes to find a problem and not two days.
    The two days implies they had no decent tiger teams. I suggest the following for them: https://www.linkedin.com/pulse/six-tiger-team-structures-used-operations-centers-deal-ronald-bartels/
  2. There is also the complexity of operating a complex system patched together by years of mergers and acquisistions. CL is an amalgam of CL, L3 and Time Warner Cable, plus more that I don't have the time to research. Perhpaps a large L2 domain is not as big a problem if you have the right visibility and processes in place. But since introducing humans into a highly complex system is bound to end up with a human making a honest error, then its best to engineer the human out of the equation and/or learn the lessons of the hyper-scalers limiting the failure domain.
Add comment
Sidebar