Testing Device Configuration Templates
Many network automation solutions generate device configurations from a data model and deploy those configurations. Last week, we focused on “how do we know the device data model is correct?” This time, we’ll take a step further and ask ourselves, “how do we know the device configurations work as expected?”
There are four (increasingly complex) questions our tests should answer:
- Are the device configuration templates working or crashing due to syntax errors or missing/misspelled variables?
- Are the device configurations generated by those templates valid? Would an actual network device accept them?
- Is our configuration deployment process working as expected?
- Do the device configurations do what we expect them to do?
Checking configuration templates seems easy:
- Create some sample device data models. While you could create them by hand, using the results of a data model transformation process is a much better approach.
- Run the device data models through configuration templates and report any errors.
While that approach catches most syntax errors1, it might not catch misspelled variable names or errors in never executed expressions. To ensure flawless configuration templates (from the template language perspective), you must identify all potential code paths and create input data that triggers them all.
We try to get close to that holy grail in netlab integration tests, which cover as many features of a configuration module (or a plugin) as possible. Admittedly, we don’t test all potential feature combinations, and we hope that the templates are simple enough that errors wouldn’t be triggered only by a convoluted combination of unrelated features2.
Checking configuration validity is a bit harder. At the very minimum, you must start a network device instance (virtual machine or container), push the templated configuration to it, and check whether it complained.
The above task seems straightforward, assuming you built an automated system that starts virtual machines or containers on demand (hint: check out netlab π). However, the sad reality is that sometimes network devices complain very politely, making it impossible to catch errors.
For example, when trying to mix network statements in an OSPF routing process with ip ospf area interface configuration commands in FRRouting, the configuration interface says “I’m sorry, I’m afraid I can’t do that” in a manner that does not resemble a hard error at all.
x(config)# router ospf
x(config-router)# network 0.0.0.0/0 area 0
Please remove all ip ospf area x.x.x.x commands first.
You could try to weasel out of this corner by configuring the device and comparing its running configuration with what you tried to configure. Congratulations, you just opened another massive can of worms: default settings that do not appear in device configurations and that might change across software versions.
Compared to the above, deploying device configurations seems like a piece of cake until you encounter yet another Ansible quirk3 and either decide to force Ansible to deploy what you want because you know better or create a complex choreography that will persuade Ansible to deploy what you need. Anyhow, moving on…
The true test of your configuration generation and deployment code is “Does it work in a real network?” That’s hard4; here’s what worked for us when developing netlab integration tests:
- Try to test individual features, not a humongous mess. Fortunately, most netlab configuration modules work in a way that aligns well with this approach.
- When dealing with a complex spaghetti mess of features (for example, VXLAN IRB scenario running OSPF in a VRF), build a series of test cases starting with simple tests (example: bridging-over-VXLAN) and slowly adding complexity. This approach will help you spot easy-to-fix errors before troubleshooting complex setups.
- If possible, test control-plane features, not just end-to-end reachability. Inspecting routing protocol results might identify hidden errors5 that would not impact packet forwarding.
- Try to test individual components that should lead to a successful test result.
Let me give you an example of the last recommendation.
One of the initial device configuration tests is configuring addresses on interfaces. It’s as simple as it can get:
- Attach two hosts to two interfaces of the device under test.
- Configure IPv4 and IPv6 addresses on those two interfaces.
- Check whether the hosts can ping one another.
Multiple devices failed to provide IPv6 connectivity between the hosts. When I investigated those failures, I found that the hosts did not have an IPv6 default route. The device configuration templates failed to configure IPv6 Router Advertisements on the IPv6-enabled interfaces.
After realizing that, I restructured the tests performed in this simple scenario from ping over IPv4 and IPv6 into:
- Ping over IPv4. This should never fail unless we’re dealing with a device with a broken data plane.
- Check for the presence of the IPv6 default route on both hosts6. This one caused the most failures (all are fixed now).
- Ping over IPv6. This should not fail if the hosts have an IPv6 default route.
Want to do something similar to test your network automation solution? Please feel free to use netlab integration tests as a starting point. As of late May 2024, we have over a hundred integration tests covering initial device configurations, BGP, DHCP, IS-IS, OSPFv2, OSPFv3, VLANs, VRFs, and VXLAN, with test results available online for the development branch and the latest release.
-
Assuming the tool you’re using parses the whole template before trying to execute it ↩︎
-
As always, there’s a bit of a gap between wishful thinking and reality, but I’m digressing. ↩︎
-
Sadly, Ansible remains the only tool with configuration deployment modules for a vast variety of platforms. If you work with a small number of platforms, check out Napalm, Nornir or Scrapli. ↩︎
-
Trust meβ’, I’ve been there π ↩︎
-
For example, incorrect prefix length of loopback prefixes advertised in OSPFv3. ↩︎
-
Our integration tests are using validation plugins. For example, the default6() function executes
ip -6 route list default
command and then checks if its output is not empty. ↩︎
I also recommend the j2lint python package for jinja2 template testing. It helps you enforce consistent best practice formatting to all your templates. It is more a format linter like black, so it’s not a must but still super useful.
Arista developed it to help with their jinja2 template heavy avd ansible collection.