Weird: Ports on Linux Bridge Are Stuck
Just when you thought you got used to the weirdnesses in the networking implementations, you get a curveball like this one. Life is never dull if you test network devices.
Before releasing netlab release 2.0, I ran the full suite of integration tests for all devices for which I have the images. Interestingly, most VXLAN tests failed for Cumulus Linux 4.x even though we haven’t touched that code for ages.
Next step: trying to figure out what changed. The configuration changes were minimal. Even worse, the failure was non-deterministic. Somehow, we managed to transform a Cumulus Linux 4.x VM into a Heisenberg switch.
I did the obvious thing: I rolled back the changes. VXLAN started working most of the time1, but there were still failures.
After wasting a few more hours trying random things2, I calmed down enough to start thinking. Instead of making configuration changes, I restarted the lab until it failed, and then tried to look under the hood… and there it was: the interfaces connected to a Linux bridge that we use in Cumulus VM to build a VLAN were stuck in learning state. I don’t know whether they would ever move to forwarding; I gave up after a few minutes.
Comparing our configuration of that Linux bridge with the recommended Cumulus configuration, I found a major discrepancy: while the documentation recommends using mstpctl-portbpdufilter
on the VXLAN interface3, I disabled STP on the Linux bridge with VXLAN interfaces because the recommended solution didn’t work with Cumulus containers.
It appears that the version of the Linux bridge used in Cumulus Linux 4.x had an interesting bug: if you disabled STP, the ports remained in the state they were in at the time. That must have been fixed in the meantime; we use the same hack on FRR and never had any problems.
Anyway, there’s still the question of the root cause. Why did VXLAN work for years with the brute-force approach? It turns out that we changed the order of module configuration to accommodate a Junos quirk. Previously, VXLAN would be configured at any random point after VRFs; now we want it configured after BGP (which is configured after generic routing, which is configured after VLANs).
In most cases, VXLAN configuration on Cumulus Linux disabled STP before the interfaces were added to the Linux bridge (part of the VLAN configuration), and therefore, the ports would never transition through STP states. With the change in configuration sequence, the interfaces were attached to the Linux bridge, and STP was disabled a second or two later – late enough to trigger the bug, but fast enough that it was not triggered consistently.