Maxim Gelin sent me an interesting question:
Can you please explain to me, why is STP supposed to be evil? What's wrong with STP?
STP’s fundamental problem is that it’s a fail-close, not a fail-open protocol.
Ethernet bridges (later renamed to layer-2 switches) were designed to be transparent plug-and-pray devices that you could drop anywhere into the network and hope they’ll work. They could not rely on having a control-plane protocol between adjacent nodes (like most modern routing protocols do) – lack of control-plane communication indicated lack of adjacent bridges.
That’s all nice and dandy until a bridge loses its mind, and stops sending BPDUs (control plane activity) while still forwarding traffic (data plane activity). Adjacent bridges think they have hosts plugged into the affected ports (this is the fail close part), and start forwarding traffic through those ports, resulting in a nice forwarding loop (been there, seen that).
A bridge with hung control plane would not forward BPDUs between its ports (which would stop the forwarding loop), because the forwarding entry for the STP multicast address still punts packets to the CPU.
Fail-open or fail-close?
This section was inserted on August 1st 2014 to (hopefully) reduce the terminology confusion.
As Chris Marget mentioned in his comment, the "fail-open" or "fail-close" is a clunky terminology bound to be misunderstood (as evidenced by numerous other comments).
Being an oldtimer, I always see computer networks as part of generic electrical circuits and switching landscape – for me, "fail-close" = "pass current or traffic on failure" and "fail-open" = "stop passing current or traffic".
Other people think about computer networks in valve or door analogies. For them "fail close" means "the door or valve is closed on failure – there’s no traffic" and "fail open" obviously means "the door or valve is opened on failure, and the traffic passes".
In the context of this blog post "fail close" means "a failed/confused bridge continues to forward the traffic, and the bridged network will send the traffic across such bridge." You might have a different opinion on what "open" or "close" means, and it’s as valid as any… but quoting Cisco’s documentation won’t make your point any more valid (it just proves that the writer of that document agrees with your view of what opens or closes on failure). I would however appreciate a pointer to a more authoritative source (although I doubt it exists).
Back to bridging and STP
The solution to the confused bridge traffic forwarding problem is quite simple: Cisco IOS has bridge assurance – you configure a port to expect an adjacent bridge, and the port doesn’t forward traffic if it doesn’t receive BPDUs from the other end.
Fail-close nature of STP isn't its only drawback. The original STP had numerous other challenges, from slow convergence to lack of VLAN awareness. Unfortunately the IEEE decided to keep heaping kludges on top of STP until the whole thing nearly toppled over – it’s like trying to build the global Internet by tinkering with RIP ad nauseam instead of designing BGP.
The generic solution to this particular problem (and a few others, including hosts turning into bridges) seems to be extremely simple: allow a switch port to be a host-facing port (implicitly configuring BPDU guard and a few other things) or a fabric port (implicitly configuring bridge assurance and VLAN trunking). Why hasn’t any vendor implemented such a simple concept? I can’t figure it out – your comments are most welcome!