Reader Comments: Spanning Tree Woes

My latest spanning tree protocol (STP) posts generated numerous comments, some of them so relevant that I decided to summarize them into another blog post.

Weird Things Happen

The unidirectional link scenario mentioned by Antonio is pretty well known:

I'd like to add unidirectional links to the list of STP problems, this can happen because a bad transceiver or a fiber is broken or badly manipulated, and only vendor kludges can solve you, adding an intervendor compatibility problem

However, Christoph described a scenario I never considered (or heard about):

Perfect example: any Cisco router with EtherSwitch-based (HWIC-...ESW, EHWIC-...ESG and associated ISR routers) switchports. All ports on a module are one single broadcast domain & are up/up at power on not matter what was configured - until IOS is running has parsed the configuration.

Don’t Turn It Off

Several readers pointed out how disastrous the idea of turning off STP is. The winner is the example posted by Bela Varkony:

I have seen big meltdowns when someone turned off STP. Once it was not even in the Ethernet switches (since they were oversized and could cope with the increased traffic), but in the connected firewalls. Whoever it was has stopped the air travel for a half a day in a whole country. So be careful!

Bela (and a few others) also spelled out why it’s important to keep STP running at the network edges:

Even that STP has some challenges, never ever turn it off. You might not know who creates a loop accidentally. Let it be a misconnected cable or a booting device with strange port interconnects temporarily. STP is there as a last resort safety tool.

Kerry Thompson was even more explicit:

The biggest problem with STP that I've seen is when people who aren't exactly experts configure new switches and then disable spanning tree - because they heard from the experts that it's bad. I try to tell them that spanning tree is like smoke detectors - yes, they can be annoying and sometimes need maintenance, but don't just turn them off.

If you’re interested in this topic, make sure to read the great “Killing the Spanning Tree Canary” analogy by Kurt Bales.

In totally unrelated news, VMware keeps telling everyone how dropping BPDUs is the greatest idea since sliced bread, including a link to another article describing how STP might cause temporary loss of network connectivity. It’s amazing how VMware marketing always blames someone else for problems caused by their developers choosing to abuse perfectly well-known technologies.

Keep Layer-2 Domains Small

Another point of very vocal agreement was the need to keep layer-2 domains small. Again starting with Bela:

The good practice is to keep the STP domain as small as possible. And remember: STP was designed originally for maximum 7 hops and few switches and some dozens of host devices in mind. It is definitely not designed for connecting big data centers into a single bridge domain. Do not use something for a use case it was not designed for and not fully tested and analyzed...

Mario had similar opinion:

I for one will always try to minimize any L2 STP or bridge domains as far away from the core as possible, down to a pair of ToR switches, or if possible at the virtual switch edge. If it's one thing I've learned throughout these very informative blog posts, it's that you should never extend that L2 domain across the DC; STP will bite you sooner than later. It's just not worth the risk.

Finally, you MUST read the Anonymous’ tips on working with large bridging domains.

Even more information

If you want to know more about data center fabric architectures, attend the half-day workshop in Zurich in late March.

We’ll also be talking about layer-2 fabrics (unfortunately we still have to talk about them) in one of the upcoming sessions of the Leaf-and-Spine Designs webinar.

12 comments:

  1. (disclaimer: i'm the anonymous referenced by Ivan above and i'd like to share more information. i'm posting anonymous again to protect the guilty)

    several vendors have commented that mlag (vlag, vpc, mc-ae, mct, we had them all) is "a way to get rid of spanning tree". don't fall into that trap. last network i worked on did this at the core. as a result there were a hundred mstp islands each with their own root as the core was silently gobbling the bpdus (since stp was fully disabled - seemed to me to be a retarded thing to do....). enabling stp was a no-go given the size of the l2 domain (500+ switches) and the impact to letting stp settle.

    the comment that you never know what device will cause a loop. how bout a vendor implementation of standard lacp which will forward frames on a port in an lacp bundle even when there are no lacp pdus seen, rather than go lacp-blocked. insta-loop in the face of cpu utilization spikes and missing lacp short timers. oops...

    oh, and mstp is nice and all, but when you put all vlans in the same instance you don't really gain a whole lot over rstp except the scale; some switches don't support a lot of rstp instances. the obvious takeaway there is, when the uplink port goes blocking because of a loop somewhere, all vlans drop at the same time. in some cases including the inband vlan you use to manage switches. debugging that network in the face of outage is a monstrous mess.

    and as an aside here:

    other things that bit us hard were small unknown things like etherchannel guard. i'd never heard of that one, but once a loop occurred it took a lot down because the Po went err-disabled and the gateway was over that link.

    and, shout out to the expert express. Ivan was a champ to do an hour+ call to try to help us work through our former stupidity. i'd *strongly* recommend him to anyone that asked.
    Replies
    1. Regarding your first paragraph, not that I necessarily disagree with it as a whole, but what's wrong with having a lot of MSTP islands with their own roots? They shouldn't interfere with each other, and to me it seems like it'd even scale better (not that you SHOULD need something that scales better than a single MSTP domain).
    2. You can use all the protection mechanisms available and still have issues if you don't control your L2 domain. I had a customer with both data centres melting down because a supervisor blade had some wierd failure that cause a loop over the inter data centre links. I suggested switching over to the redundant supervisor which fixed the problem .... until the problem supervisor rebooted and took over again, even though it was configured not to take over. We had to eject that supervisor blade. The hardware vendor was very keen to get that one back and analyze it!
    3. I certainly agree that disabling STP (and not replacing it with some other L2 loop prevention scheme) is ridiculous, but this bit has me confused:

      "the core was silently gobbling the bpdus"

      Disabling STP shouldn't cause that. In fact, the only reason that STP BPDUs are limited to link scope is because 802.1D (and later) devices know they're supposed to sink BPDUs. A non-802.D speaker should flood these frames to all ports because they have the I/G bit set. They're just regular frames to such a bridge.

      Section 5.8.2 of my favorite technical book addresses the issue somewhat humorously: http://www.slideshare.net/siswisnu/wileytheallnewswitchbookthecompleteguidetolanswitchingtechnologyaug2008/287

      Unless... Did these folks *disable* STP and then *enable* BPDUfilter?
    4. anonymous here. it wasn't cisco, and no, we didn't enable bpdufilter. hence the comment i thought it was silly that the vendor was doing that. it shouldn't have been.

    5. "we didn't enable bpdufilter"

      Wow. Sinking BPDUs while not participating in STP is the worst possible combination of behaviors. It's like the vendor was *trying* to melt the network.

      Why not name the vendor? Public shaming (or the threat of it) is sometimes the only way to get better behavior (too bad it doesn't work with VMware, eh Ivan?)

      The only device I've seen which does this was a SMC cable modem/gateway box provided by Comcast. It had four LAN ports which blocked BPDUs, but didn't speak STP.
  2. the worst part about "mlag kills STP" pitches is that many vendor reps *actually believe* it to be true
  3. UDLD can help preventing the unidirectional link scenario.
    Replies
    1. As Antonio wrote: "Only vendor kludges can save you".
  4. I think the worst decision in the protocol design of STP was making the lower switch ID win in root bridge election. If the priorities are set to the default, this generally means the oldest switch in the network becomes root. When I was in TAC I had numerous cases of customers having meltdowns because of some ancient Cabletron sitting in a back room somewhere and having fits. (Invariably these cases came to the routing protocols team because OSPF/EIGRP was flapping and it was considered a "routing protocol issue." Thus I spent a lot of time on the RP team troubleshooting spanning tree. But anyways...) These things being said, I agree that it is best left running and hopefully configured properly. I also saw some nasty cases when STP was left off because "spanning tree is bad." Often these folks would have a bridging loop meltdown and then call it a "spanning tree loop" and thus perpetuate an irrational hatred of STP!
  5. In principle, I agree with minimizing the STP blast radius or replacing with L3 built paths - but just wanted to say that large L2 domains can work.

    Admittedly, we use MST and some vendor specific tweaks, but we have deployed up to ~300 switches in per region.
  6. I'm glad I was not the only one who broke a country....

    https://medium.com/the-technology-burrow/the-day-the-country-came-to-a-standstill-80b85e0e0db8
Add comment
Sidebar