Spanning Tree Protocol (STP) and Bridging Loops

Continuing our bridging loops discussion Christoph Jaggi sent me another question:

Theoretically STP should avoid bridging loops, and yet you claim they cause data center meltdowns. What am I missing?

In theory, STP avoids bridging loops. In practice, there are numerous reasons STP got a bad name.

Blocked alternate paths. That’s a design choice you have to accept if you want to have plug-and-pray networking instead of proper routing protocols. Not much we can do here.

It doesn’t matter if you’re doing routing on layer-2 (TRILL, SPB) or on layer-3 (IP) – you need a proper routing protocol to use alternate paths.

Forward-on-failure behavior. This is the only real grudge I have against STP as protocol – all links forward traffic until BPDUs cause some of them to be blocked.

If you insert a device that drops BPDUs in the forwarding path, or if a switch loses its control plane (for example, due to a memory leak), a forwarding loop is almost guaranteed.

A unidirectional link (due to bad transceiver or cable) could also result in a forwarding loop when the bridge that should have put the link in blocking state doesn’t receive the BPDUs (thanks to Antonio Ojea for pointing this out).

Slow link establishment. Vanilla STP waits 30 seconds before it starts forwarding traffic onto a link, and vendors ignorant of how networks work start sending their precious traffic as soon as they see layer-1 carrier on a server NIC, forcing networking vendors to implement all sorts of kludges like portfast and bpduguard.

The kludges implemented by networking vendors are not reliable. For example, BPDU Guard kicks in after the first BPDU is received, potentially resulting in a temporary forwarding loop before the first BPDU reaches the switch.

Too many kludges cause configuration errors. Understanding all possible kludges vendors implemented around STP and the relationship between them is hard, and even the big guys sometimes get it wrong.

Virtualization vendors drop BPDU frames. What could possibly go wrong if someone configures bridging between two VM NICs? It’s definitely better to pretend it’s someone else’s problem and blame the network instead of explaining to the sysadmin why his VM was kicked off the virtual switch after he made a stupid configuration error.

Some fabric vendors ignore(d) STP and propagate(d) BPDUs across the fabric, dramatically increasing the blast radius of any misconfiguration.

Want to Know More?

How Networks Really Work, Data Center Infrastructure for Networking Engineers and Leaf-and-Spine Fabric Architectures webinars discuss large-scale bridging and routing on layer-2, including SPB, TRILL, VXLAn and EVPN.

9 comments:

  1. >>"or if a switch loses its control plane (for example, due to a memory leak)"

    Perfect example for this:
    Any Cisco router with EtherSwitch-based (HWIC-...ESW, EHWIC-...ESG and associated ISR routers) switchports. All ports on a module are one single broadcast domain & are up/up at power on not matter what was configured - until IOS is running has parsed the configuration.
  2. Ivan, I think it's time you write one more post on STP. I humbly suggest this title "The Death of Spanning Tree Protocol"...
    Replies
    1. Sadly, until everyone stops the bad practices of allowing bridging across multiple interfaces (including Windows and Linux hosts), or implements a proper signaling protocol (which even the data center fabric vendors can't really agree on), we still need STP to detect forwarding loops.
    2. "It doesn’t matter if you’re doing routing on layer-2 (TRILL, SPB) or on layer-3 (IP) – you need a proper routing protocol to use alternate paths."

      I thought that load balancing at Layer 2 as implemented in TRILL has nothing to do with Layer 3 load balancing. Having Layer 2 load balancing is about having greater capacity i.e. using redundant links for traffic.
      Layer 3 load balancing about using several L3 gateways - we do not care how layer 2 traffic is distributed at the physical layer (balanced or not). It's different topic how the routing protocols cope with multiply same L3-cost paths

      If you could comment on your original statements to clarify what you are had in mind.
      Bogdan Golab
    3. Apart from the flooding behavior (and summarization implications) layer-2 and layer-3 forwarding are today just two sides of the same coin.

      Also, you have to start thinking in layers and not agglutinate multiple problems into a bigger mess (see also RFC 1925 section 2.5).
    4. Probably I am not smart enough to understand it;)
      Bogdan Golab
  3. The biggest problem with STP that I've seen is when people who aren't exactly experts configure new switches and then disable spanning tree - because they heard from the experts that it's bad.
    I try to tell them that spanning tree is like smoke detectors - yes, they can be annoying and sometimes need maintenance, but don't just turn them off.
  4. I think STP (and it's various kludges) are here to stay, and will not be going anywhere anytime soon. Yes, a pure layer 3 solution would be a nice goal for many (after you weigh the pros and cons - ie: VM mobility, cost of L3 ToR switch, staff experience, etc). Until then, as Ivan had mentioned, there are always alternatives which may require a forklift upgrade in hardware (TRILL, Fabricpath, SPB).. or wait for the good 'ol SDN market/products to mature and prove themselves out (hopefully not in the same track record as IPv6). I for one will always try to minimize any L2 STP or bridge domains as far away from the core as possible, down to a pair of ToR switches, or if possible at the virtual switch edge. If it's one thing I've learned throughout these very informative blog posts, it's that you should never extend that L2 domain across the DC, STP will bite you sooner than later.. it's just not worth the risk. Use OTV (point solution), or better yet have L3 between DC's in place and take advantage of some form of scripting/automation for VM mobility.
  5. Even that STP has some challenges, never ever turn it off. You might not know who creates a loop accidentally. Let it be a misconnected cable or a booting device with strange port interconnects temporarily. STP is there as a last resort safety tool.
    I have seen big meltdowns when someone turned off STP. Once it was not even in the Ethernet switches (since they were oversized and could cope with the increased traffic), but in the connected firewalls. Whoever it has stopped the air travel for a half a day in a whole country. So be careful!

    The good practice is to keep the STP domain as small as possible. And remember: STP was designed originally for maximum 7 hops and few switches and some dozens of host devices in mind. It is definitely not designed for connecting big data centers into a single bridge domain.

    Do not use something for a use case it was not design for and not fully tested and analyzed...

    And do not forget, if you create large bridge domains you also created large error domains. Cut-through switching is a tool for distributing and amplifying errors...
  6. I'd like to add unidirectional links to the list of STP problems, this can happen because a bad transceiver or a fiber is broken or bad manipulated, and only vendor kludges can solve you, adding an intervendor compatibility problem
  7. 802.1aq shortest path bridging was a great idea, implemented about 10 years too late...
    Replies
    1. Disagree. Bridging was never a good (let alone great) idea. However, among all possible options, 802.1aq or TRILL are the least horrible.
    2. 802.1aq does not solve lot of the serious problems and still runs MSTP through the network. Its loadbalancing over parallel links is at best weird and overall usefulness of 802.1aq is very questionable. On the other hand, TRILL effectively removes all typical L2 problems, since it in fact transparently converts L2 network into L3 routed domain - ingress switch packs incoming L2 packet into TRILL container, TRILL network routes the container using usual L3 routing principles, protocols (IS-IS) and safety belts (TTL, RPF check) and the egress switch discards the container and sends L2 packet to its final destination.
    3. Marian, I would suggest that before comparing two technologies you get your facts right. While your description of TRILL data plane is correct, the rest of your claims aren't.
    4. Ivan, I have hands-on experience with both. OK I should have formulated it more precisely - so I better copy the whole paragraph from RIPE TRILL tutorial:

      Spanning Tree:
      TRILL RBridges block spanning tree and provide a new level above bridging but below Layer 3 routing.
      SPB bridges run at the bridging level. They continue to maintain a spanning tree (or multiple spanning trees) hooking together any attached bridging to produce one huge spanning tree. Frames are forwarded by spanning tree or by shortest path depending on VLAN.

      And regarding load-balancing over parallel paths - TRILL uses standard L3 per-hop routing decisions including usual ECMP support, while 802.1aq computes multiple trees by XORing SYSIDs with different bitmasks which produces suboptimal results and doesn't consider per-hop specifics.
    5. Do keep in mind that the RIPE TRILL tutorial was delivered by a TRILL evangelist, so you can't possibly hope for an unbiased view.

      As I said, your understanding of the data planes is correct, but I disagree with the control plane summary. In most cases the fabric switches would run STP with the access layer for numerous reasons, and you could either integrate the fabric with STP or make the fabric switches root bridges in the access layer.

      Also, while the ECMP argument is valid, I don't think there's any significant difference between the two mechanisms in leaf-and-spine fabrics.

      Obviously, if you have real-life experience please share it.

      Finally, I did cover both in the last session of the leaf-and-spine fabric designs webinar 2 days ago, so I'm a bit fluent on the differences ;)
    6. Well, large STP domains are major problem in L2 which SPB doesn't solve. TRILL's approach of removing STP from the core and limiting it to smallest possible islands is IMHO much safer.

      Yes, in leaf&spine fabrics, SPB's weird load-balancing approach gives similar results - as long as all links are up. But try to disconnect one of the links... And the real fun starts with more complex / arbitrary topologies ;-)
  8. I don't have strong feelings either way on spanning tree but I did have an expectation that it would be on by default for any switch I purchased. Cisco Ethernet Switch modules for the ISR4K have spanning tree turned off by default. Let the buyer beware. http://www.cisco.com/c/en/us/products/collateral/routers/3900-series-integrated-services-routers-isr/data_sheet_c78-612808.html
  9. "unidirectional links" cause STP problems? Well they do if you don't configure UDLD on your STP links. So that is a configuration problem. It isn't STP's fault.

    Replies
    1. STP has no concept of adjacencies, and treats lack of incoming BPDUs as a permission to use the link for forwarding, so I would say we're talking about a suboptimal protocol design.

      UDLD is just a vendor-specific kludge solving a problem that shouldn't have existed in the first place.

      Obviously you could claim that "if you haven't configured UDLD, it's your fault that STP brought down your network", but the root cause for the need to have two protocols configured when one should do is still STP.

Add comment
Sidebar