Automating Cisco ACI Environment with Python and Ansible
This is a guest blog post by Dave Crown, Lead Data Center Engineer at the State of Delaware. He can be found automating things when he's not in meetings or fighting technical debt.
Over the course of the last year or so, I’ve been working on building a solution to deploy and manage Cisco’s ACI using Ansible and Git, with Python to spackle in cracks. The goal I started with was to take the plain-text description of our network from a Git server, pull in any requirements, and use the solution to configure the fabric, and lastly, update our IPAM, Netbox. All this without using the GUI or CLI to make changes. Most importantly, I want to run it with a simple invocation so that others can run it and it could be moved into Ansible Tower when ready.
Keeping the abstracted definition of the network on a Git server means we can see who changed what and when, run a validation script, and use sign-offs on the changes before changing the network. By making Git the source of truth, as well as the only input as to how to build the network, I ended up with a nice formal, yet readable, “this is the network.”
Ansible was chosen early on due to the readability that it brings with it. For better or worse, ACI is driven by RESTful API. If I developed the solution in pure Python, I’m sure the end solution would have run faster with fewer hurdles to develop around. The tradeoff would be a solution that would be harder for someone to follow up on. A second benefit to Ansible is that a lot of time was saved by not building supporting code. I don’t have to spend time framing my API calls, validating that I get an HTTP 200 back. I spent my time on useful things.
Early on, I wanted to modularize the play. From an Ansible perspective, this means lots of roles. Creating roles is crucial, as it lets you reuse code you bang out in other places. Without devolving into a debate of the merits of ACI, for better or worse, one doesn’t just push a candidate config file through a config replace tool and commits. Everything is an API call to configure an element. By grouping each set of API calls and steps into a role, I can reuse it in other plays without having to reinvent the wheel I already invented. It also makes reading plays a lot easier. Modularization isn’t just breaking the process into smaller reusable steps, but things like leaving room for knobs you don’t plan on, or want, to turn. For example, looping over creating policies to manage port channel protocols. We all have had that “I’m not supporting this thing…. until I’m told I have to.” Mine was LACP modes. “It’s a new data center fabric and I’m doing things the right way, and port channels will only use LACP to minimize the risks from misconfiguration.” Then I had to deal with a device from a networking vendor who shall remain nameless that clearly understands the LACP protocol but only supports channel-group mode on. Adding the support was literally two lines of text to the play. One to create the policy, and one to select an LACP protocol, defaulting to active if it’s not defined.
The workflow is fairly straight forward. The first thing the play does is delete the definitions and supporting Python scripts from the play’s directory, then it pulls in a fresh copy from the master branch, and runs a Python script to make sure the data is good before making any changes. Next, it runs through a lot of steps to configure every aspect of the fabric. Luckily, ACI’s APIs are idempotent, so I didn’t have to spend a lot of time making sure I wasn’t stepping on my own toes. When the play completes, a Python script is called to make sure our IPAM was updated and we had good documentation that comes solely from the source of truth.
Development of the solution was fairly straight forward. One major con of ACI is that there is no Vagrant or other virtualized platform, so everything was coded against a physical lab. Building the Ansible side was fairly easy. I was able to use the ACI-Rest module and ACI’s API inspector to fill in functionally not provided by Ansible out of the box. It was a matter of using the inspector to get the call as submitted through the GUI and filling in the values with variables. Data modeling and writing the Python script that validates the definitions took a lot of effort and is still the biggest time consumer. The upside is that the validation script gets used in two places. Besides being an early step in the play, I was able to reuse it as a critical step in the CI/CD pipeline validating that commits and merges are syntactically correct, so the merge approver doesn't have to worry whether the change is semantically sound. There is also less reservation about working in this manner, as people can be confident that their changes to definitions are going to be error-free before anything is done.
Want to be able to do something similar? Join the Building Network Automation Solutions online course.
Thanks, we use similar approach in our environment (we also use AWX for audit and WorkFlow visualization), however the biggest challenge we face is variable preparation and peer review process before committing variables to Git. I'd be particularly interested on how you overcome this challenge?
So far we managed to use Ansible "assert" module to peer check for common object naming errors deviating from our standards (regex statements), but its obviously not enough. Second you could have Staging (replica of Prod) network to ensure the change gets accepted by APIC and delivers desired outcome, however there are still limitations subject to how big/complex your network is (hard to replicate routing, but perhaps its easy to ensure EPG name gets installed into APIC).