This blog post was initially sent to the subscribers of my SDN and Network Automation mailing list. Subscribe here.
In late 2018 Juniper started aggressively promoting Network Reliability Engineering - the networking variant of concepts of software-driven operations derived from GIFEE SRE concept (because it must make perfect sense to mimic whatever Google is doing, right?).
There’s nothing wrong with promoting network automation, or infrastructure-as-code concepts, and Matt Oswalt and his team did an awesome job with NRE Labs (now defunct, huge “Thank you!” to whoever was financing them), but is that really all NRE should be?
Just looking at the acronym it has three words in it:
- Network (ok, we know what this is)
- Reliability (tougher one, ask network practitioners how to calculate reliability of a complex system and watch them squirm)
- Engineering (you probably know my opinion about this one).
Reliability Engineering is also a well-defined concept, and one would assume that Network Reliability Engineering applies that concept to computer networks. Really?
While I totally agree we need to replace repetitive (and error-prone) manual operations with repeatable automated processes, I also believe you should not automate existing mess, but start with a reliable minimalistic network design without one-off exceptions and gazillion of configuration drifts caused by late-night throwing-spaghetti-at-walls google-and-paste troubleshooting sessions.
Unfortunately (as I pointed out in the podcast I did with Matt a long while ago) the NRE blog as well as Juniper whitepapers and marketing collaterals keep mum about that aspect of network reliability and focus on workflows and automation (no surprise there, Juniper is doing a really good job there).
So, what do you think? Am I too radical, or should we start with a thorough cleanup (assuming you’re dealing with a brownfield environment) and reliable designs instead of rushing head-on into automating whatever we’re doing?