Large Layer-2 Domains Strike Again…
I started January 2018 blogging with a major service provider failure. Why should 2019 be any different? Here’s what Century Link claimed was causing two-day outage (more comments here).
Supposedly it was a problem with the management network used by their optical gear, but it looks a lot like a layer-2 network spanning 15 data centers and no control-plane policing on the managed devices… proving yet again that large-scale layer-2 networks are a really bad idea.
Please note that it doesn’t matter whether they had problems with a stretched Ethernet segment or something else. According to their explanation a single device broadcasting packets was able to affect devices across multiple locations – as I’m trying to explain for years (not that many people would listen and/or care), a single broadcast domain is a single failure domain no matter what $vendor PowerPoints or whitepapers claim, and it’s not a question of whether the concoction will fail but when. Keep that in mind the next time your $vendor rep brings dancing unicorns into the room.
Finally, just in case you think failures like this one are a black swan event, check the list of post-mortems and associated lessons learned collected by Dan Luu… keeping in mind that most of the failures are never reported.
Old token-ring had visibility on management frames that you could troubleshoot using any only promiscuous DOS based frame capture program. Took me a few minutes to find a problem and not two days.
The two days implies they had no decent tiger teams. I suggest the following for them: