Recovering from Network Automation Failures
This blog post was initially sent to subscribers of my SDN and Network Automation mailing list. Subscribe here.
One of my readers sent me this question:
Would you write about methods for reverting from expected new state to old state in the case automation went wrong due to (un)predictable events that left a node or network in a limbo state betwixt and between.
Like always, there’s the easy and the really hard part.
The easy part: Many network devices have configuration rollback functionality. That’s the easiest thing to do when you catch errors in your automation script and want to do a controlled recovery.
You do check error messages received from the devices you’re configuring, right… or at least consider what might happen on error, and how you’d recover from that? There’s a reason we talk about error checking in network automation online course and about handling errors in Ansible playbooks in the Ansible for Networking Engineers online course.
Next, there’s configure confirm, commit confirm or similar that automatically rolls back configuration if your automation script doesn’t confirm the changes in specified timeframe. Always use that as a safeguard against stupid errors like cutting yourself off by a mangled ACL change.
Failing that, there’s always reload in 5 or equivalent if you need a bigger hammer.
Two-phase commit. If you want to implement an all-or-nothing change across multiple devices you need an equivalent of two phase commit:
- Prepare and validate changes on all affected devices;
- Execute the change on all devices.
This sounds easy, but is really hard to do in real life if you want to deal with byzantine failures. You might want to add an extra step – using commit confirm and confirming change on all devices after it has been committed on the last device – but there’s no perfect solution. There’s always a slight chance you’ll lose communication with a device at the wrong time, and will have to escalate the recovery to network operator.
Reduce the chance of failure. Thoroughly validate input data and device state before you start making changes. Not usually considered by google-and-paste automation scripters, so I added testing and validation as a separate module in the Building Network Automation Solutions online course.
Finally, there are the big guys – Cisco NSO implements an equivalent of two-phase commit across almost anything. They deal with all sorts of implementation stupidities and unreachable devices, and can rollback a change or postpone parts of it if needed. Keep in mind it took them years to get there, so if feel you need that level of robustness it might be cheaper to buy the product than to reinvent the wheel.
Unattended automation runs over 100s or more devices tends to scare some people :)
Automation runs should be idempotent, aborting on a failure and retrying until the desires state has been reached could be an approach. And if you're not using an out of band management network for your automation runs, start doing so. Please don't route your OOB traffic over the production network. At least now you can actually rollback or retry your change, even if you've e.g. borked a critical ACL. To some people building a parallel network sounds like a useless investment, until someone brings down the network. But unless you're really running at scale, you will notice a rack or SER going down for a longer period of time (and 15+ mins is a long time, but not nearly enough for you to fix things.) And regarding 'useless investments': a virtual/physical lab doesn't hurt reliability either.
I would even add a 3rd phase: after executing the change on (one) device, verify the actual running state and possibly bail out if this compliance check fails to avoid a cascading failure (Ansible's serial:1 / any_errors_fatal:true comes to mind).
Try to write your automation scripts for both desired states: 'present' and 'absent'.
(you'll soon realize that this is not always easy, sometimes the order of steps is important. Some automation tools fare better than others.)
Unless you're storing state it makes sense to be able to reconstruct/calculate related objects/values based on a naming convention. This kinda assumes you're not intertwining or reusing objects. If you're not running into performance or scalability issues give it some thought. A machine doesn't care about 'cluttered' lists. It won't even use a list if it can deduce the object (name). Try to think like a machine :)
Of course, these aren't golden rules that should be adhered to. But it will help to put your mind in a slightly different gear to tackle issues smarter.