Spanning Tree Protocol (STP) and Bridging Loops
Continuing our bridging loops discussion Christoph Jaggi sent me another question:
Theoretically STP should avoid bridging loops, and yet you claim they cause data center meltdowns. What am I missing?
In theory, STP avoids bridging loops. In practice, there are numerous reasons STP got a bad name.
Blocked alternate paths. That’s a design choice you have to accept if you want to have plug-and-pray networking instead of proper routing protocols. Not much we can do here.
Forward-on-failure behavior. This is the only real grudge I have against STP as protocol – all links forward traffic until BPDUs cause some of them to be blocked.
If you insert a device that drops BPDUs in the forwarding path, or if a switch loses its control plane (for example, due to a memory leak), a forwarding loop is almost guaranteed.
A unidirectional link (due to bad transceiver or cable) could also result in a forwarding loop when the bridge that should have put the link in blocking state doesn’t receive the BPDUs (thanks to Antonio Ojea for pointing this out).
Slow link establishment. Vanilla STP waits 30 seconds before it starts forwarding traffic onto a link, and vendors ignorant of how networks work start sending their precious traffic as soon as they see layer-1 carrier on a server NIC, forcing networking vendors to implement all sorts of kludges like portfast and bpduguard.
The kludges implemented by networking vendors are not reliable. For example, BPDU Guard kicks in after the first BPDU is received, potentially resulting in a temporary forwarding loop before the first BPDU reaches the switch.
Too many kludges cause configuration errors. Understanding all possible kludges vendors implemented around STP and the relationship between them is hard, and even the big guys sometimes get it wrong.
Virtualization vendors drop BPDU frames. What could possibly go wrong if someone configures bridging between two VM NICs? It’s definitely better to pretend it’s someone else’s problem and blame the network instead of explaining to the sysadmin why his VM was kicked off the virtual switch after he made a stupid configuration error.
Some fabric vendors ignore(d) STP and propagate(d) BPDUs across the fabric, dramatically increasing the blast radius of any misconfiguration.
Want to Know More?
How Networks Really Work, Data Center Infrastructure for Networking Engineers and Leaf-and-Spine Fabric Architectures webinars discuss large-scale bridging and routing on layer-2, including SPB, TRILL, VXLAn and EVPN.
Perfect example for this:
Any Cisco router with EtherSwitch-based (HWIC-...ESW, EHWIC-...ESG and associated ISR routers) switchports. All ports on a module are one single broadcast domain & are up/up at power on not matter what was configured - until IOS is running has parsed the configuration.
I thought that load balancing at Layer 2 as implemented in TRILL has nothing to do with Layer 3 load balancing. Having Layer 2 load balancing is about having greater capacity i.e. using redundant links for traffic.
Layer 3 load balancing about using several L3 gateways - we do not care how layer 2 traffic is distributed at the physical layer (balanced or not). It's different topic how the routing protocols cope with multiply same L3-cost paths
If you could comment on your original statements to clarify what you are had in mind.
Bogdan Golab
Also, you have to start thinking in layers and not agglutinate multiple problems into a bigger mess (see also RFC 1925 section 2.5).
Bogdan Golab
I try to tell them that spanning tree is like smoke detectors - yes, they can be annoying and sometimes need maintenance, but don't just turn them off.
I have seen big meltdowns when someone turned off STP. Once it was not even in the Ethernet switches (since they were oversized and could cope with the increased traffic), but in the connected firewalls. Whoever it has stopped the air travel for a half a day in a whole country. So be careful!
The good practice is to keep the STP domain as small as possible. And remember: STP was designed originally for maximum 7 hops and few switches and some dozens of host devices in mind. It is definitely not designed for connecting big data centers into a single bridge domain.
Do not use something for a use case it was not design for and not fully tested and analyzed...
And do not forget, if you create large bridge domains you also created large error domains. Cut-through switching is a tool for distributing and amplifying errors...
Spanning Tree:
TRILL RBridges block spanning tree and provide a new level above bridging but below Layer 3 routing.
SPB bridges run at the bridging level. They continue to maintain a spanning tree (or multiple spanning trees) hooking together any attached bridging to produce one huge spanning tree. Frames are forwarded by spanning tree or by shortest path depending on VLAN.
And regarding load-balancing over parallel paths - TRILL uses standard L3 per-hop routing decisions including usual ECMP support, while 802.1aq computes multiple trees by XORing SYSIDs with different bitmasks which produces suboptimal results and doesn't consider per-hop specifics.
As I said, your understanding of the data planes is correct, but I disagree with the control plane summary. In most cases the fabric switches would run STP with the access layer for numerous reasons, and you could either integrate the fabric with STP or make the fabric switches root bridges in the access layer.
Also, while the ECMP argument is valid, I don't think there's any significant difference between the two mechanisms in leaf-and-spine fabrics.
Obviously, if you have real-life experience please share it.
Finally, I did cover both in the last session of the leaf-and-spine fabric designs webinar 2 days ago, so I'm a bit fluent on the differences ;)
Yes, in leaf&spine fabrics, SPB's weird load-balancing approach gives similar results - as long as all links are up. But try to disconnect one of the links... And the real fun starts with more complex / arbitrary topologies ;-)
"unidirectional links" cause STP problems? Well they do if you don't configure UDLD on your STP links. So that is a configuration problem. It isn't STP's fault.
STP has no concept of adjacencies, and treats lack of incoming BPDUs as a permission to use the link for forwarding, so I would say we're talking about a suboptimal protocol design.
UDLD is just a vendor-specific kludge solving a problem that shouldn't have existed in the first place.
Obviously you could claim that "if you haven't configured UDLD, it's your fault that STP brought down your network", but the root cause for the need to have two protocols configured when one should do is still STP.