Can You Avoid Networking Software Bugs?
One of my readers sent me an interesting reliability design question. It all started with a catastrophic WAN failure:
Once a particular volume of encrypted traffic was reached the data center WAN edge router crashed, and then the backup router took over, which also crashed. The traffic then failed over to the second DC, and you can guess what happened then...
Obviously they’re now trying to redesign the network to avoid such failures.
All kind of random things are being suggested such as deliberately maintaining different software revisions, having a different vendor in the second DC, etc.
I don’t see what one could do to avoid this type of failures apart from using different equipment in parallel, be it equipment from different vendors or different enough boxes from the same vendor.
The multi-vendor approach avoids box-specific catastrophes, but brings its own set of complexities and inter-vendor interoperability challenges… unless you split the network in multiple independent availability zones linked with simple IP transport, and use different set of vendors in each availability zone.
However, I’m not sure whether it would make more sense to deal with the ongoing complexity or once-in-a-blue-moon crash (assuming it does happen once in a blue moon).
Did you ever have to solve such a challenge? What did you do? What would you suggest? Oh, and keep in mind that SDN controllers might not be the right answer (see also this blog post).
As long as you stick with standards-based stuff (and using proprietary features is not a decision that should be undertaken lightly), this approach should be mostly okay.
Still, when you get down into the weeds, it can be tough to find feature parity when looking at things like per-member BFD for LACP links, etc...
Awaiting a second event before taking drastic measures would be best here. And trying to keep using safer code revisions of the vendor.
What I would ask myself:
1) Is my team qualified to support a new vendor?
2) Assuming we have a well-documented network, does my team has time to study/write standards that can support when configuring or troubleshooting QoS for example?
3) Are we implementing procedures to validate if a solution will perform as expected prior the deployment?
In addition, even in a multi-vendor environment you might face issues in some corner case situations . A real case: We have a branch with two border routers working in a hot-standby fashion. The seconday (standby router) had been replaced a time ago because it was EOL and one beautiful day the primary link failed and the secondary did not take over. Reason? It had a IOS basic version that supports only EIGRP Stub routing. The problem: The router was in a transit path. Thus, the entire branch went down. Now wondering that we could have had a Juniper as primary and Cisco as secondary, this problem would have happened anyway.
All said, personally, I prefer wait to Halley Comet rather spend too much energy working to avoid it and suddenly appear a new one (hale-bopp)!
If after investigation I did find out that the issue wasn't going to be a blue moon event, I would ask "Do we know that it is specific volume that is causing this issue?" If for example we know that 500 MB of aggregate encrypted traffic is causing this software issue and it will run smoothly under that amount can we consider using the backup WAN router in a load sharing design?
If we absolutely have to solve this issue I would look into load sharing across the WAN routers. It adds complexity, but its a set of known complexity problems. Its a set of known bugs. Adding a second vendor presents a whole new set of bugs to discover and unforeseen complexity.