Network Automation Considered Harmful
Some of the blog comments never cease to amaze me. Here’s one questioning the value of network automation:
I think there is a more fundamental reason than the (in my opinion simplistic) lack of skills argument. As someone mentioned on twitter
“Rules make it harder to enact change. Automation is essentially a set of rules.”
We underestimated the fact that infrastructure is a value differentiator for many and that customization and rapid change don’t go hand in hand with automation.
Whenever someone starts using MBA-speak like value differentiator in a technical arguments, I get an acute allergic reaction, but maybe he’s right.
Introducing network automation in a small company network with a single router and one or two access points is an obvious overkill. But then we were told only last year that we should embrace single points of failure; manual processes are the obvious next step1.
I’m assuming that most of my readers run networks slightly larger than the one described above. I’m also assuming that for some of them “Unless you’re breaking stuff you’re not moving fast enough”2 remains a soundbite worthy of Mark Zuckerberg3 instead of their company culture. I was told that in environments that care about service availability manual processes still represent a huge drawback because every change has to be planned, approved, and executed in a maintenance window, while every major cloud provider and residential ISP provisions new tenants or customers 24 hours a day.
Even worse, you have to go through the same processes when making identical changes because you can never be sure that the infrastructure is configured in exactly the same way. Obviously that results in rubber-stamping and complacency until you hit a bit of infrastructure that’s a different-enough snowflake, and then all hell breaks loose anyway.
I spent almost a decade writing about, talking about, and practicing network automation (resulting in over 380 blog posts, dozens of hours of video content, and a bunch of GitHub repositories). If someone seriously wants to dig into various arguments we had during that time, you’ll have plenty of stuff to read and watch. In the meantime, I’ll conclude with a wonderful reply left on that comment:
We underestimated the fact that infrastructure is a value differentiator for many and that customization and rapid change don’t go hand in hand with automation.
Which, if sold right, can be a good thing. I’ve called this “inflexible flexibility”. We can do absolutely anything, but it’ll take some effort to add it to the data model and tool chain that generates the configurations. After that, it’ll be 100% supported and won’t ever be something that only the one engineer that implemented the one instance knows about, catching someone else off guard at 2am when it breaks. It means no more one-offs, and that changes are inherently documented at the very least by merit of being part of a codebase with a commit that can be tracked down.
I have sometimes gotten some pushback on how this is inherently less nimble since it takes longer than just throwing a manual config at a problem, but so far haven’t had any issues with the counterargument that we don’t want untested configuration changes, and that developing a new feature for the automation process doesn’t take much longer than properly testing a new configuration in a lab before rolling it out. The additional time is easily made up by the increased reliability of the change being executed a second time (where I mean reliability in the sense of “the customer expected the change to work the first time”). Of course, this partly depends on how your exact automation process. Ours is amenable to fairly rapid changes, because our campus network is quite dynamic.
Yes, having to run everything through an automation pipeline is less rapid than just ssh’ing into a box and typing away. But it doesn’t have to be much less rapid if your staff is skilled, and raising the bar to entry for a change can often mean an opportunity for evaluation on whether the work should be done, which is a very different question than whether it can be done.
Finally, there must be some networking engineers running large networks who shun automation as much as the original commenter. I have great news for them: most vendors will gladly sell them all the licenses they need to build a Digital Twin of their network to practice on.
-
Although Ubiquity is still selling it’s Software-Defined (not really) cloud-based network management system. They must have awesome marketing. ↩︎
-
They were definitely moving fast enough in October 2021 when their outage caused Sky News to invent Bridging the Gap Protocol. ↩︎
-
Or a lame excuse for your borked Pull Request ↩︎
But...but...but.... you have to be AGILE!!!
There's an easy way to solve that problem. Send all of your staff to a Scrum class. Most of them will become developer. Some of them will become Scrum master. In the end they will write a certification exam and pass. The company then has a lot of testified "agile" employees.
I really don't see how a network any larger and more complex than a small and simple enterprise or campus network can be developed and engineered in a consistent manner without full automation. At least routing intensive networks might have very complex configurations related to e.g. routing policies and it would be next to impossible to configure them manually, at least without errors and in a consistent way.
With automation even the most complex configurations can be presented with a simple abstraction and should one need to implement any architectural change in the network, one can build the logics only once and let the automation multiply it into the running configs. We have a lot of examples of such, for example policy frameworks for BGP SOO filtering, BGP large community based selective route filtering, prefix segment injection with correct SIDs, RPKI ROA validation configurations etc. Without automation it would be also very difficult to enforce the desired configurations and would result in an opportunistic network where certain configurations are only applied if an engineer has remembered to, or cared to, configure them.
By the way, I think that "automation" is not actually the best word to describe this whole thing and I'd rather talk about programmatic approach to network configuration. Ironically, I think that Software Defined Networking would be a perfect term, if only SDN hadn't gain such a bad reputation during recent years when it was used exclusively for just about anything. Maybe "network configuration as a code" would describe pretty well what is usually referred to as "automation".
It is interesting that, as a community, after so many years of limited adoption of automation and having built the biggest networks in the world on human labor and intelligence we revert to the same debates from 10 years ago as if automation is about managing configurations.
If you took the time to read the subsequent reply in that post you'd notice that nobody is shunning automation: "If the problem is well known you can apply rules to it (automation)" The problem with networking is that it results in an unbelievable number of corner cases that only if you work with scaled networks you realize. Big systems are unpredictable and can't be automated (perhaps now with AI...)
And yes, I've dealt with networks slightly bigger than couple of routers ;)