Beware: Ansible Reorders List Values in Loops
TL&DR: Ansible might decide to reorder list values in a loop parameter, resulting in unexpected order of execution and (in my case) totally borked device configuration.
A bit of a background first: I’m using an Ansible playbook within netlab to deploy initial device configurations. Among other things, that playbook deploys configuration snippets for numerous configuration modules, and the order of deployment is absolutely crucial. For example, you cannot activate BGP neighbors in Labeled Unicast (BGP-LU) address family (mpls module) before configuring BGP neighbors (bgp module).
To make the ordered deployment of configuration snippets work, every host (Ansible term for managed device) has a list of modules in the module fact (Ansible term for variable) in its host_vars. For example, these are the values of the module fact for all devices in the BGP-LU lab:
pe1:
module: [ ospf, bgp, mpls ]
pe2:
module: [ ospf, bgp, mpls ]
p:
module: [ ospf, mpls ]
rr:
module: [ ospf, bgp, mpls ]
ce1:
module: [ bgp, mpls ]
ce2:
module: [ bgp, mpls ]
The module variable is used in an Ansible play to include a deploy module task list (which then includes device-and-module-specific tasks) for each module used by a network device :
- name: Deploy module-specific configurations
hosts: all
tasks:
- include_tasks: "tasks/deploy-module.yml"
tags: [ module,test ]
loop: "{{ module | default([]) }}"
loop_control:
loop_var: config_module
when: module is defined and (not(modlist is defined) or config_module in modlist)
The convoluted when
condition is used to:
- Ensure the task is not executed for devices that do not have optional configuration modules
- Deploy a subset of configuration modules specified in an optional
modlist
fact (set with-e
CLI parameter).
Looking at that code, one would assume the modules will be deployed in the order they are listed in the module variable, right? Tough luck, this is what happens (tested with various Ansible versions between 2.9 and 5.5):
TASK [include_tasks] ***************************************************************************************************************
included: /home/pipi/net101/tools/netsim/ansible/tasks/deploy-module.yml for p, pe1, pe2, rr => (item=ospf)
included: /home/pipi/net101/tools/netsim/ansible/tasks/deploy-module.yml for p, pe1, pe2, rr, ce1 => (item=mpls)
included: /home/pipi/net101/tools/netsim/ansible/tasks/deploy-module.yml for pe1, pe2, rr, ce1 => (item=bgp)
The first device in the batch has module set to [ ospf, mpls ]
, and it looks like Ansible in its infinite optimization wisdom decides it’s OK to use the same order for all other devices in the same batch. Even though PE1 (for example) has module set to [ ospf, bgp, mpls ]
, the actual order of execution is ospf, mpls, bgp
, and the BGP neighbors are never activated in the BGP-LU address family because they configuration snippets try to activate them before they are defined.
The only workaround I could find within Ansible was to set serial (batch size) to one1 to deploy configurations on a single device at a time (so Ansible has nothing to optimize). It works, but it also makes lab deployment way slower than it should have been.
Maybe I made a wrong choice and shouldn’t use something that thinks a data structure is a programming language for any serious work, but as they say, the road to (automation) hell is paved with good intentions.
Simple Tasks Are Not Affected
I found it pretty impossible that something so unexpected would not get noticed and fixed, so I did something similar with a simple debug task:
- hosts: all
tasks:
- debug:
msg: "{{ item }} on {{ inventory_hostname }}"
loop: "{{ module }}"
when: module is defined
This time, the items did not get rearranged – the debugging messages were printed in the order the modules were listed in module lists. The interleaving of tasks across multiple devices was interesting, but within a single device the order was correct.
TASK [debug] **************************
ok: [p] => (item=ospf) =>
msg: ospf on p
ok: [ce1] => (item=bgp) =>
msg: bgp on ce1
ok: [pe2] => (item=ospf) =>
msg: ospf on pe2
ok: [ce2] => (item=bgp) =>
msg: bgp on ce2
ok: [pe1] => (item=ospf) =>
msg: ospf on pe1
ok: [rr] => (item=ospf) =>
msg: ospf on rr
ok: [p] => (item=mpls) =>
msg: mpls on p
ok: [ce1] => (item=mpls) =>
msg: mpls on ce1
ok: [rr] => (item=bgp) =>
msg: bgp on rr
ok: [rr] => (item=mpls) =>
msg: mpls on rr
ok: [pe2] => (item=bgp) =>
msg: bgp on pe2
ok: [pe2] => (item=mpls) =>
msg: mpls on pe2
ok: [ce2] => (item=mpls) =>
msg: mpls on ce2
ok: [pe1] => (item=bgp) =>
msg: bgp on pe1
ok: [pe1] => (item=mpls) =>
msg: mpls on pe1
Conclusion: The weird rearranging behavior applies to include_tasks, but not to regular tasks.
But Wait, It Gets Worse
As the Ansible playbook I described above gets used from within netlab, it might be possible to work around the “Ansible optimization strategy 🤪”:
- Create a new group (modules) that will contain only the devices with configuration modules, eliminating the need for a complex when condition and default values. Use this new group (instead of all) in the Ansible play.
- Create a global list of modules (netlab_module) in the correct order and save it as a group variable to make sure all hosts get the same value
- Iterate over the global list of modules and include the deploy module task list only for those modules that are needed by individual devices (when the loop variable is in module list).
The next iteration of the Deploy module-specific configurations play was thus something along these lines:
- name: Deploy module-specific configurations
hosts: modules
strategy: "{{ netsim_strategy|default('linear') }}"
tags: [ module,test ]
tasks:
- include_tasks: "tasks/deploy-module.yml"
loop: "{{ netlab_module }}"
loop_control:
loop_var: config_module
when: config_module in module
Guess what… it doesn’t work. The moment there’s a when condition in the include_tasks task, Ansible starts rearranging the loop iterations. In the end, there’s absolutely no difference between the original code (where we iterated over different lists for different devices) and the one above (where the list we iterate over is the same, but the task is not always executed).
The Final Workaround
In the end, I decided that the only possible result of fighting software windmills is a damage to one’s sanity, and gave up. I moved the when condition into the included task list – the top-level includes are always executed, but then the tasks within the included task list might be skipped.
The details are in this commit.
-
One would expect the free strategy to work as well, but it doesn’t – it behaves in exactly the same way as the linear strategy. ↩︎
You could use Ansible roles to specify dependencies. So MPLS is dependent on BGP, BGP is dependent on OSPF, OSPF is dependent on some kind of common module.
https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html#using-role-dependencies
Yeah, roles could be a solution for the dependency management, but in 90+% of the cases I'd just like to have a different template for each module, and I don't feel like adding the same task list to every role just to be able to deploy the templates.
Time to go back to the drawing board...
Dynamic inclusion during runtime in Ansible (
include
,include_tasks
,include_role
) is like large layer-2 domains/long-distance VM motion to me ;-): it should be avoided as much as possible and quite often is not actually necessary. Replacing dynamic inclusion with static imports (import_playbook
,import_role
,import_tasks
) usually requires rearchitecting though.PS: The use of
include_vars
is probably acceptable, althoughinventory
/host_vars
/group_vars
should be preferred.Of course you're right (and I love the large L2 domains analogy 🤣)
Going back to the first principles, I could have created an Ansible playbook on the fly (after all, it's just a YAML data structure) and execute it, but I was naive enough to think I could push through the idea of using host_vars data structures to drive flow of execution.
Hello Ivan,
I am surely missing something, but i was not able to reproduce your issue with the following data and playbooks. My ansible (v 2.7 and 2.9) runs smoothly and includes the tasks in the expected order.
Your test is not exactly the same as Ivan's one. Look closely at the variable "module" in Ivan's example. In your example the "modules" variable of all hosts starts with [ ospf, bgp, mpls ]. In Ivan's example the order is different: [ ospf, mpls ] versus [ ospf, bgp, mpls ] versus [ bgp, mpls ]. Also not all hosts have ospf module.