During a recent workshop I made a comment along the lines “be careful with feature X from vendor Y because it took vendor Z two years to fix all the bugs in a very similar feature”, and someone immediately asked “are you saying it doesn’t work?”
My answer: “I never said that, I just drew inferences from other people’s struggles.”
A Step Back
Networking operating systems are probably some of the most complex pieces of software out there. Distributed systems are hard. Real-time distributed systems are even harder. Real-time distributed systems running on top of eventually-consistent distributed databases are extra fun.
Systems optimized to work as fast as possible are by necessity full of weird bugs. Combine all that with programming custom hardware that few people in the world really understand, monolithic code bases (because history), and years if not decades of layered kludges heaped on top of each other, and one has to wonder how anything works at all.
It’s really easy to blame vendors for shipping buggy code. I’m positive every vendor occasionally ships a premature product, but I still want to believe that at least most of them do as thorough testing as they could… it’s just that it’s hard to figure out all the weird interactions and combinations of nerd knobs the software will be exposed to in real life.
What can you do?
Does that mean that you should only buy old stuff? Absolutely not, but you have to be aware of how the sausage is being made, and adjust your expectations:
- It’s perfectly fine to use new hardware or software in pilot deployment or PoC… as long as you don’t expect to have it in production next week;
- New software features could be a life-saver… but keep in mind that they might also explode into your face;
- If you’re building a mission-critical infrastructure that needs to be rock-solid, don’t use the hardware that will start shipping next month and that is only supported by the newest software release.
Above all, KISS (Keep It Simple, Stupid). For example, in a data center environment:
- Don’t use MLAG (or similar) unless absolutely necessary. In most cases, you’re better off having active/standby links to your servers;
- Don’t use MLAG-with-STP to build a data center fabric. It’s much easier to build a robust IP network and run Ethernet across it encapsulated in VXLAN;
- If you still need link aggregation at the fabric edge, you might want to stick to MLAG (or vPC or whatever your vendor calls it) for a bit longer. It’s a decade old, and thus probably safer to use than EVPN multihoming (although the latter got more mature and definitely looks sexier on your resume);
- Use EVPN when justified by network size (to reduce flood lists), additional features (proxy ARP, VRFs), or your inability to automate switch configuration. Don’t use it in small fabrics with just a few switches, in particular if your vendor’s EVPN implementation results in a Dickens-size-novel configuration file;
- Use BGP as the only routing protocol only when it absolutely makes sense - either you’re building a FANG-sized network, or it’s simpler to configure BGP than OSPF, or you need BGP anyway because you decided to deploy EVPN;
Feel free to expand this list in the comments (I hope you got the idea).