My latest spanning tree protocol (STP) posts generated numerous comments, some of them so relevant that I decided to summarize them into another blog post.
Weird Things Happen
The unidirectional link scenario mentioned by Antonio is pretty well known:
I'd like to add unidirectional links to the list of STP problems, this can happen because a bad transceiver or a fiber is broken or badly manipulated, and only vendor kludges can solve you, adding an intervendor compatibility problem
However, Christoph described a scenario I never considered (or heard about):
Perfect example: any Cisco router with EtherSwitch-based (HWIC-...ESW, EHWIC-...ESG and associated ISR routers) switchports. All ports on a module are one single broadcast domain & are up/up at power on not matter what was configured - until IOS is running has parsed the configuration.
Don’t Turn It Off
Several readers pointed out how disastrous the idea of turning off STP is. The winner is the example posted by Bela Varkony:
I have seen big meltdowns when someone turned off STP. Once it was not even in the Ethernet switches (since they were oversized and could cope with the increased traffic), but in the connected firewalls. Whoever it was has stopped the air travel for a half a day in a whole country. So be careful!
Bela (and a few others) also spelled out why it’s important to keep STP running at the network edges:
Even that STP has some challenges, never ever turn it off. You might not know who creates a loop accidentally. Let it be a misconnected cable or a booting device with strange port interconnects temporarily. STP is there as a last resort safety tool.
Kerry Thompson was even more explicit:
The biggest problem with STP that I've seen is when people who aren't exactly experts configure new switches and then disable spanning tree - because they heard from the experts that it's bad. I try to tell them that spanning tree is like smoke detectors - yes, they can be annoying and sometimes need maintenance, but don't just turn them off.
If you’re interested in this topic, make sure to read the great “Killing the Spanning Tree Canary” analogy by Kurt Bales.
In totally unrelated news, VMware keeps telling everyone how dropping BPDUs is the greatest idea since sliced bread, including a link to another article describing how STP might cause temporary loss of network connectivity. It’s amazing how VMware marketing always blames someone else for problems caused by their developers choosing to abuse perfectly well-known technologies.
Keep Layer-2 Domains Small
Another point of very vocal agreement was the need to keep layer-2 domains small. Again starting with Bela:
The good practice is to keep the STP domain as small as possible. And remember: STP was designed originally for maximum 7 hops and few switches and some dozens of host devices in mind. It is definitely not designed for connecting big data centers into a single bridge domain. Do not use something for a use case it was not designed for and not fully tested and analyzed...
Mario had similar opinion:
I for one will always try to minimize any L2 STP or bridge domains as far away from the core as possible, down to a pair of ToR switches, or if possible at the virtual switch edge. If it's one thing I've learned throughout these very informative blog posts, it's that you should never extend that L2 domain across the DC; STP will bite you sooner than later. It's just not worth the risk.
Finally, you MUST read the Anonymous’ tips on working with large bridging domains.
Even more information
If you want to know more about data center fabric architectures, attend the half-day workshop in Zurich in late March.
We’ll also be talking about layer-2 fabrics (unfortunately we still have to talk about them) in one of the upcoming sessions of the Leaf-and-Spine Designs webinar.