Upgrade Network Device Software with Ansible Playbook

One of the engineers going through my Ansible for Networking Engineers online course sent me this question:

In the Introduction section, you mention a use case of upgrading software. Do you have an example playbook?

Unfortunately, I don’t. Upgrading software is one of those things that’s almost impossible to simulate in a virtual lab.

However, it’s pretty easy to figure out a rough solution using the principles we’re discussing in the Building Network Automation Solutions online course:

  • Figure out how you’d do that manually
  • Figure out all the possible things that could go wrong and ways of alleviating that/minimizing the impact
  • Figure out how to validate things have gone well before proceeding with destructive steps.

Here’s my improvised attempt to upgrade Cisco IOS (without having access to any real-life devices because vacations):

  • Verify that the device is running expected (old) software version – you don’t want to blindly upgrade from just any software release;
  • Verify that the device has enough resources to run the new software version. Don’t trust the central inventory – it’s easy enough to gather device facts with napalm_get_facts or equivalent, or do show version and scrape the data you’re interested in.
  • Use scp to copy new software image to devices you want to upgrade (there are a few SCP examples in my sample playbooks)
  • Use a show command to inspect what’s in flash, isolate just the file you’re interested in, and verify the md5 checksum
  • Find current boot image (for example, using show run | include boot system)
  • Specify new and current boot image as potential sources (boot system configuration command in Cisco IOS) to fall back to the old image if the new one fails (soon enough).
  • Reboot, wait till you can SSH to the box again

Obviously, you’d take a few additional steps to make the process a bit safer (or at least easier to recover from failures):

  • If you have redundant devices in your network design, split them into inventory groups, and upgrade just one group of devices at the same time;
  • Perform the very first upgrade on a device that’s physically close to you so it’s easy to go over and recover if you manage to brick it;
  • Stop the playbook on any failure – if a single upgrade fails you don’t want to move forward before investigating the root cause.
  • Perform the reload-and-wait operation with increasingly large batch of devices. Tom Hollingsworth had a fun idea to use Fibonacci series.

You might also want to collect critical information (interface state, LLDP neighbors, OSPF neighbors, number of routing table entries…) before and after the upgrade to verify the upgrade didn’t break something fundamental.

Finally, nothing beats running real end-to-end tests after the upgrade; ToDD might be an interesting framework if you’re not doing them already.

More information

Add comment