Automation Gone Wild

My “this is why you need automationblog post triggered numerous comments and tweets. I loved this one:

What if the mistake was embedded into the automation process/tool (designed by humans) in the first place? Now you have a video series titled "Automation Gone Wild".

I guess this tweet is a priceless answer to that question:

On a more serious note, I’ve heard plenty of horror stories when delivering my Network Automation and SDN workshops. Here are just two of them.

Automating Cisco IOS Software Updates

A large financial organization automated IOS software updates – routers across hundreds of branch offices were regularly upgraded to latest software release overnight.

Everything worked great until someone made a typo and specified an .md5 file instead of a .bin file in the upgrade request, bricking hundreds of remote routers.

The next few days were spent frantically running around and downloading the software to the routers through the console port.

Lesson Learned: check the inputs. Make changes gradually (Tom Hollingsworth recommends Fibonacci sequences).

Automating Firewall Policy Deployment

A retail organization used vendor firewall management tool to manage firewalls on hundreds of small sites around the country. They also used that tool to deploy changes to firewall and VPN policies.

The network administrator assumed that the software might fail, so he split the sites into a dozen of groups, and spread sites in each group around the country, so that a failure of one group wouldn’t bring a whole region down.

Not surprisingly, a failure did happen due to some weird glitch in the management software, and updating one of the groups failed, leaving those firewalls in some weird inaccessible state.

Now the resilience built into the whole system worked against the poor network admin – he had to travel around the whole country fixing firewalls belonging to that particular group. Fortunately, the country in question wasn’t US or China ;)

Lesson Learned: things will fail. Always have a Plan B which does not involve driving around. reload in 5 is your friend.

Now what?

Based on horror stories like this, you might conclude that you don’t want to touch the network automation with a 10-foot pole. I’m positive the bank tellers had the same reaction when banks introduced the first computers; how could you possibly trust that thing to balance the books properly.

In the meantime, we made huge progress in software development processes, and if you want to have a better chance of a working solution you should use those processes when developing your network automation solutions.

Just because we know how to develop software properly doesn’t mean that your organization is doing it, or that people aren’t cutting corners. Then there are the expert beginners and software development horror stories.

Summary: You MUST treat the network automation like any other mission-critical application and use the same processes in its development (source code/version control, unit tests, integration tests...); implementing it with a series of scripts hacked together by a programmer wannabe (aka networking engineer) over a weekend will get you the results you deserve.

Also, if you want to know more about network automation, check out my workshop and webinars.

7 comments:

  1. 1. Canary changes and test them before...
    2. Doing incremental rollouts starting on some staging environment if possibly

    that applies to any configuration change, code deployment, software/OS upgrade, etc you want to perform,

    and remember that automation doesn't mean doing things unattended. If you are doing risky changes you can always monitor the automated rolling update and abort if something smells fishy.
  2. As long as "Hit and Run" approach will be used, nothing will be different.
  3. To the point about the firewall management software....sounds a lot like Panorama from Palo Alto Networks. I love their firewalls, but I'm not aware of any equivalent to "reload in 5" or "commit confirm 5", so if you brick it you brick it. For the engineer in that scenario, had he done the update as one large group, he would have just more driving.

    Having said that, I'm all for automating certain things, and of course testing this in the lab/pre-production first.
  4. And the automation software should of course detect if the change was successful before it went on to the next device. Even RANCID could do this in 2006.
  5. How to define successful though? Completed or committed doesn't always mean correct.
    Replies
    1. Test after the change. Look at ToDD (on keepingitclassless.net) or test-driven network development (http://blog.ipspace.net/2015/11/test-driven-network-development-with.html)
Add comment
Sidebar