Network Automation Expert Beginners
Some network automation skeptics came to that place the hard way: they got burned by half-baked semi-tested systems. This is what one of my good friends had to say in a LinkedIn comment:
I am suspicious of automation, as I’ve unfortunately seen too many outages caused by either human error or faulty automation. Every time it required human CLI/GUI intervention to correct it. The problem is that the more automation we push, the fewer people know how to use the “old school” way to administer stuff.
Network automation is not the only IT discipline that could cause hard-to-correct errors requiring manual intervention. I’m positive everyone knows at least one horror story resulting in manual tweaking of the Windows registry, or a sequence of arcane SQL commands1.
However, one would expect that catastrophic outages would be rare and one-off events – after all, we’ve been developing software for decades, and we should have learned a few lessons along the way. Concepts like version control, transactions, thorough testing, and input validation are considered table stakes for serious software development organizations. Unfortunately, that’s not how the pundits were selling network automation benefits to the networking engineers eager to get out of the manual configuration quagmire.
As every software engineer or architect worth their salary knows, one should start with (at least) the following:
- Requirements: what should the solution do? What services do we plan to automate? Who are the expected end-users? What can we expect from them?
- Data structures: how will we describe the services we’re planning to automate? What are the relations between objects in our data model? What integrity rules should we keep in mind?
- Overall architecture: Where will we store the data? How will the user interface look? How will we interact with the network devices? How will we recover from inevitable failures?
- Business logic: What needs to be done? How will we get from the current state of the network to the desired final state?
- Testing plan: How will we test our solution? How will we make sure the tests are relevant? How will we minimize the deployment risks?
Also, nobody in their right mind would not validate inputs to a mission-critical application2 or deploy new code in production without thorough testing. Some application would go as far as checking the status of executed actions and roll back on errors3 – something that is sorely missing in way too many home-grown network automation solutions.
I’ve seen organizations that approached network automation like any other software development project, combining networking engineers (the expert users) with software developers. Some of these projects had astonishing results, but you rarely hear about them – people working on those projects have better things to do than to deliver unpaid presentations at conferences4.
Instead, we got a deluge of blog posts, podcasts, and conference talks explaining how easy it is to automate your network if only you embrace the “we all have to become Python programmers” mantra or master Ansible. Instead of an in-depth discussion of architectures, data structures, software development methodologies, and challenges of modifying the state of a distributed system, those motivational talks often resulted in a cargo cult of expert beginners focused on low-level tools. Unsurprisingly, a quick “Ansible is so easy to use” talk followed by a glitzy demo is always sexier than “these are the five mandatory steps you should take before you can start automating your network.”
The situation is getting a bit better since the days I started talking about these concepts in the Building Network Automation Solutions online course. Network-to-Code occasionally publishes a blog post focused on automation concepts or architectures5, Anton Karneliuk seems to have a sound curriculum, and every now and then someone describes how to use NetBox (or a similar tool) in an automation solution to create the source of truth. However, it’s still common to see YAML files or Excel spreadsheets used together with Ansible playbooks in “production-grade” automation solutions.
Don’t get me wrong, I’m not saying you cannot start automating network operations until you built a full-blown system with CLI, API and GUI on top of a relational database. If you need a quick solution that will grab some data from the network devices and create a report or a graph, go for it. If you need a tool that will help you troubleshoot the network, write it. You can’t do too much harm if you’re executing read-only commands from an account that has no device configuration privileges6.
Also, if you have to build a proof-of-concept to persuade your management that it makes sense to automate service deployment, don’t waste your time on integration with a transactional data store. YAML files are good enough for the minimum viable demo. It’s also OK to continue using that approach in a small team, assuming you use version control to track the changes to the data model, and thoroughly review all the changes before they’re promoted into production. Some people found that they don’t need more than GitOps, and that’s perfectly fine as long as you know what you’re doing and what the risks and limitations of that approach are.
However, once you get the approval to build a production solution that will be used by non-experts (including members of other IT teams), you must start using better tools. Using Ansible Tower as the GUI sitting in front of Ansible playbooks controlled with external variables will quickly get you to the point where you’ll start blaming your users for entering incorrect data. In reality, you should blame yourself for choosing suboptimal tools and not validating the input data in the first place.
Long story short: don’t blame network automation if a script that someone hacked over a weekend for personal use got adopted as a company-wide “automation solution.” Even AWS got burned when someone failed to implement input sanity checks in one of their automation playbooks.
Revision History
- 2023-01-11
- David Gee and Cristian Sirbu provided extensive feedback that made this blog post much better than the original draft (all the errors and the snark are still mine 😉) Thanks a million!
-
Or hours of downtime anxiously waiting for the database to be restored from backup tapes. ↩︎
-
Considering the role of networking in modern IT infrastructure, one might argue that any read/write network automation is by default a mission-critical application. ↩︎
-
You wouldn’t want to have money taken out of your bank account and disappear because the target account of your payment request does not exist, would you? ↩︎
-
Unless they work for a major technology vendor that uses technical conferences as recruitment drives. ↩︎
-
No surprise there – they run a consulting practice and have to set the expectations if they want their projects to be successful. ↩︎
-
Hoping that the days of shoddy software that would crash when faced with rapid SNMP polling or a barrage of show commands are long gone. ↩︎
"I’ve...seen too many outages caused by either human error or faulty automation." I.e., I've seen outages caused when stuff gets changed, regardless of whether it's by humans or machines.
OK, but his point about automation is a good one. The problem is often (as in the Amazon case) runaway automation, i.e., the provisioning system starts doing things other than expected, to devices other than expected, and operators cannot figure out how to stop it. This is a serious problem--runaway automation is what brought down the 737-MAX, twice.
"...conference talks explaining how easy it is to automate your network..." Errr, guilty as charged. I work in technical marketing, though :) I think a lot of us who have done that are just trying to get people going using Python/Ansible/whatever, so that they understand what they can actually do with these tools. I often end my talks telling people to start writing code, even if it's bad code. Well, this isn't going to create software engineers with rigorous coding and testing processes. I think the hope is that our audiences will realize the possibilities and then spend the time to develop the needed discipline. And you're absolutely right, operational data is the best way to start because it is harmless. Part of the problem is just the limitation of time--in a 90 minute Cisco Live session, I spend most of it explaining YANG/NETCONF to people who don't understand it. If I spent a lot of time discussing coding practices I wouldn't be delivering the platform/OS-specific information I'm expected to deliver.
The other major problem with automation systems is that network engineers are used to working in a configuration-verify cycle. Configure the BGP peer, then see if it came up. Oh, it didn't, add the update-source. Now it's up, let's add a network statement. Now let's go back and verify the network made it into the BGP table. Now did it show up in the peer? It didn't? Etc., etc., Pushing bulk config with automation tools breaks this and makes network engineers nervous. Theoretically if the config is tested and validated in advance (on your digital twin, right? :)) then it should be proven and tested, but things often work differently when pushed to a real network.
There's a lot to be done here--I do believe we need automation for networks given the challenges of scale, etc., these days. But "ad experimentum" Python scripting of critical network components is not something to be taken lightly.
> I think a lot of us who have done that are just trying to get people going using Python/Ansible/whatever, so that they understand what they can actually do with these tools.
... and the problem is that nobody tells them how far away they would still be from doing things right.
The whole thing reminds me of a very smart pilot who claimed he can solve crew scheduling for his airline on a ZX Spectrum -- he figured out how to do a bit of programming, but had no idea how hard doing things reliably and at scale really is.
> I think the hope is that our audiences will realize the possibilities and then spend the time to develop the needed discipline.
In the ideal world, that would be the case. Meanwhile on planet Earth... 🤷♂️
'Configure the BGP peer, then see if it came up. Oh, it didn't, add the update-source. Now it's up, let's add a network statement. Now let's go back and verify the network made it into the BGP table. Now did it show up in the peer? It didn't? Etc., etc., '
The first thought I had was why can't this workflow be done by automation?
The second thought I had was a software engineer who I greatly respect telling me that if we were serious about automation, then the next generation of routing protocols etc would be designed with automation in mind.
Put another way, how easy would it be to replicate the workflow above without more programmatic changes to the network?
No one thinks of the little places with a handful of gear and a handful of sites. All these discussions only follow the money.
Cattools made a bunch of automation possible but it was still faster to change a vlan on a port by hand. And still is.
I wouldn't say that. Some vendors (Ubiquity?) have decent solutions for small networks.
If you want to use traditional networking gear with traditional device configurations and configure it the way you like, then you're right, nobody cares about that. After all, we all have to pay the bills at the end of the month.
Still, there are tons of tools (free or commercial) that one can use to automate the smallest networks, but you'll be in the IKEA land (aka "some assembly required"). Once I got so sick-and-tired of Cisco IOS static DHCP mappings that I created an Ansible playbook that managed them on a single router, but you have to be in a very special state of mind to automate something that's done once every other blue moon... although I did do that a few times just so I wouldn't have to rediscover how it was done ever again.
That's sort of how I feel at my current job at times. about 40 sites, but we don't do total refresh cycles in bulk, just as needed. Everything we do is sporadic, and I'm trying to see the ROI on learning automation for things that are done once in a while, and don't take much time to do manually anyway.
I can spend 40 hours trying to figure out how to automate something that will only take me 30 minutes, and that I will only do 10 times a year. Only for nobody to use that same automation method, or for it to become obsolete, or simply need to spend more hours to maintain and update it.
I feel like I'll end up spending way more time updating an automation tool than actually working on the network lol
I'm totally clueless so I donno.