This blog post was initially sent to subscribers of my SDN and Network Automation mailing list. Subscribe here.
One of my readers sent me this question:
Would you write about methods for reverting from expected new state to old state in the case automation went wrong due to (un)predictable events that left a node or network in a limbo state betwixt and between.
Like always, there’s the easy and the really hard part.
The easy part: Many network devices have configuration rollback functionality. That’s the easiest thing to do when you catch errors in your automation script and want to do a controlled recovery.
You do check error messages received from the devices you’re configuring, right… or at least consider what might happen on error, and how you’d recover from that? There’s a reason we talk about error checking in network automation online course and about handling errors in Ansible playbooks in the Ansible for Networking Engineers online course.
Next, there’s configure confirm, commit confirm or similar that automatically rolls back configuration if your automation script doesn’t confirm the changes in specified timeframe. Always use that as a safeguard against stupid errors like cutting yourself off by a mangled ACL change.
Failing that, there’s always reload in 5 or equivalent if you need a bigger hammer.
Two-phase commit. If you want to implement an all-or-nothing change across multiple devices you need an equivalent of two phase commit:
- Prepare and validate changes on all affected devices;
- Execute the change on all devices.
This sounds easy, but is really hard to do in real life if you want to deal with byzantine failures. You might want to add an extra step – using commit confirm and confirming change on all devices after it has been committed on the last device – but there’s no perfect solution. There’s always a slight chance you’ll lose communication with a device at the wrong time, and will have to escalate the recovery to network operator.
Reduce the chance of failure. Thoroughly validate input data and device state before you start making changes. Not usually considered by google-and-paste automation scripters, so I added testing and validation as a separate module in the Building Network Automation Solutions online course.
Finally, there are the big guys – Cisco NSO implements an equivalent of two-phase commit across almost anything. They deal with all sorts of implementation stupidities and unreachable devices, and can rollback a change or postpone parts of it if needed. Keep in mind it took them years to get there, so if feel you need that level of robustness it might be cheaper to buy the product than to reinvent the wheel.