Adjusting System State with Infrastructure as Code

This is the second blog post in “thinking out loud while preparing Network Infrastructure as Code presentation for the network automation course” series. If you stumbled upon it, you might want to start here.

An anonymous commenter to my previous blog post on the topic hit the crux of the infrastructure-as-code challenge when he wrote: “It's hard to do a declarative approach with Ansible and the nice network vendor APIs.” Let’s see what he was trying to tell us.

The goal of infrastructure-as-code approach is to have a system in a state that’s defined by machine-readable (and hopefully human-readable) definition files. The $1M question is “How do we get the system in that state?

Building from scratch. This is the easiest possible approach assuming you can use it. It’s been used forever by simple installation scripts; Docker aficionados do it every time they build a container image. It also works quite well in environments that don’t patch the servers but rebuild them from scratch. Assuming you’re willing to adjust your architecture and processes, you can even make it work for application deployment in fully-virtualized environments like public IaaS clouds.

You can use almost any scripting tool to build a system from scratch – all you need is something with minimal error detection and looping and branching capabilities (to make your scripts adjustable and/or readable). Bash and Ansible are perfect tools for the job.

Restarting the system. Most Linux services use a configuration file that specifies how that service should behave. The configuration file is effectively build-from-scratch script written in domain-specific language (DSL).

Implementing infrastructure-as-code for those services is trivial (like I wrote in the original blog post that so upset the above-mentioned commenter: these concepts are old)… but unfortunately, you can’t always restart a system.

However, you can get an amazing number of crazy things done if you did your design right: an ISP used this approach to automate their core network in early 2000s.

Adjusting the system state. This is the hardest approach to implementing infrastructure-as-code: given a running system, and a definition file specifying the desired state of the system, execute actions that will bring the system into the desired state.

There are tons of tools out there that solve this problem, from environment-specific tools like Docker Compose or Amazon CloudFormation to generic frameworks like Terraform, Chef or Puppet.

Many modern Linux services can adjust their state on their own – all you have to do is to change the configuration and tell the service to adjust what it’s doing based on the new configuration file.

Some network devices can do the same trick. The crudest form of system state adjustment is configure replace; Junos, IOS XR and Arista EOS offer a more granular approach using candidate configuration.

Assuming you have a networking device that implements configuration replacement at some reasonably good-enough level, there’s no need to reinvent the adjusting system state wheel on your own with tools like vendor API and Ansible. (or as I wrote in the past: don’t get obsessed with REST API). The problem has been solved; all you have to do is to:

  • Understand what problem you’re trying to solve;
  • Select the best tools for the job;
  • Solve the problem with minimum effort.

More Information

Latest blog posts in Network Infrastructure as Code series


  1. If it would be that easy (your last three advices), we wouldn't know what to do with that much free time.
    1. ... and if you don't know where to start, you keep going around in circles and yammering ;)
    2. We keep going around in circles (spirals) anyway but that's a different story. Let's go back to the topic which you wrote out quite well.
  2. Your assumption ("implements configuration replacement at some reasonably good-enough level") in God's ear. In the end effect it's just a replace exercise on the device/orchestration system and it doesn't matter which interface (REST, NETCONF, SSH etc.) someone use. All that matters is quality of the replace function. A "delete and create new" approach I see problematic in the networking field.
    1. "All that matters is quality of the replace function." << I wrote this:

      to make people aware of what they might need. I don't think it saw much use... the root cause is probably that people buying the gear rarely have to make it work. I wonder if they buy their cars the same way ;)
  3. In the 3 steps approach there should be the 4th: in case you cannot workarround the issue ask the vendor for fixes & pray for a fix (most likely you'll learn that it's WAD (works as designed although undocumented).

    Have 25+ years of experience in the networking field and I spent 10+ years automating systems where hundreds of networking devices are used. Our system have pre-defined number of options (but many combinations are possible). Adjusting system state even following/selecting pre-defined (certified system state) is challenging. I cannot imagine that someone can simply 'adjust' state without testing it upfront (as we do) to increase the likehood of success.

    I am not talking about trivial changes like add new vlan & static routing. Talking about mission critical voice & data network (vpn, qos, HA, subsecond-failover, etc)

    Regardless of the vendor you select there are so many bugs in the software so this reason the ideal 3 step approach must fail.
    It would work if the 'lego' blocks provided by vendors were well tested & documented (no research to understand the function) and have no hidden dependencies between features.
    Otherwise you need to test & certify all 'adjustments', particularly where timing is important (fast failure & recovery
    Maybe I live in different world but working for one of the major system vendor and using top brand networking devices.

Add comment