… updated on Thursday, March 2, 2023 15:13 UTC
Measuring Virtual Network Device Boot Times
A senior engineer at Juniper Networks wasn’t happy with me mentioning resource hogs and Junos platforms in the same statement. Instead of engaging in never-ending angels dancing on pins deliberations comparing the virtues of Junos with other network operating systems, I decided to throw a bit of real-life data into the mix – I created a simple script that measures:
- The time it takes to execute vagrant up to start a single network device.
- The time it takes to deploy simple initial configuration on that device.
Before going into the details, it’s worth acknowledging that the device boot time is not something most customers care deeply about1, and thus not something that vendors would invest into. It’s just that I get annoyed every single time I have to go and make a sandwich while waiting for my lab to start.
Back to the facts. The following table contains the boot time as measured with
netlab up --no-config (which effectively translates into
vagrant up) The measured times obviously depend heavily on the underlying hardware, so take them with a grain of salt and consider the relative times (index).
|Device||SW release||Boot time||Index|
|Cisco IOS XR||7.4.2||4:53.08||932|
- All network devices apart from Cisco IOSv got two vCPU cores plus the recommended minimum amount of memory2.
- The lab server I was using has 8 cores and 32GB of memory. Nothing else was running on it during the measurement process.
vagrant upexits once it can log into a device with SSH. The boot time is thus the time from the moment the VM is started to the moment SSH server accepts an incoming session.
Some devices also need a lot of time to figure out what to do with their interfaces: Cisco NX-OS took over five minutes (5:13.35) to boot when I started it with 32 Ethernet interfaces.
But that’s not all. A network device has to be configured to be useful. The following table lists the time needed to deploy initial device configuration with
netlab initial. That command starts an Ansible playbook; a few seconds of the configuration time might be consumed by Ansible, but obviously not more than ~4 seconds (the lowest configuration time)
|Device||SW release||Configuration time||Index|
|Cisco IOS XR||7.4.2||0:24.01||555|
netlab initialdeployed minimal initial configuration, including a loopback interface and one physical interface with an IPv4 address.
- It looks like the SSH server on vSRX starts working way before the device is in a steady state. The configuration time fell down to 45 seconds if I inserted a 120-second delay between lab initialization and initial configuration.
Containers Are Faster
Frustrated by my obnoxious opinions, the Juniper engineer suggested a workaround that should make Junos shine:
If you want to run only routing, Juniper cRPD would be a better choice. And will likely beat the pants off any type of cloud, including cumulus, in boot time.
I would love to try out his idea, but it looks like I’m too stupid to be able to download Juniper cRPD. First I was asked to contact customer care after logging in with my Juniper account, then I was asked to update my account. I did that, and was told that the compliance team has to look at my data. The next day my account stopped working. I tried to reset the password and the new password was accepted but didn’t work when I tried to log in. A few days later the new password still didn’t work and the password reset page produced a very helpful error message: Invalid User Status. Please contact customer care for further assistance At that point I gave up; if a vendor web portal team can’t get their act together, I have better things to do with my life.
Anyway, the last time I was able to test cRPD it had minimal data plane awareness, making it impossible to configure it with Ansible. That made it completely useless as a potential netlab network device.
It’s worth noticing that all other container solutions I tried out have a configurable data plane, and can be configured in exactly the same way using the same tools as virtual machines or physical devices. While Arista’s implementation has a few quirks, Cumulus Linux container works surprisingly well (although it cannot handle MLAG), and the FRR container managed to run MPLS and L3VPN out of the box.
Not surprisingly, the container start times are much lower than the VM start times. Here are the results for the three containers I have installed on my lab server:
|Device||SW release||Boot time||Configuration|
Somehow I doubt that cRPD (if I ever manage to download it) would beat the pants off 1 second FRR or Cumulus Linux container start time.
Reproducing the Results
It’s trivial to reproduce the results if you disagree with my measurements:
- Install netlab
- Build Vagrant boxes for the networking devices you want to test
- Download the measuring script into an empty directory and execute
If you just want to check the initial device configurations:
- Install netlab
- Download the topology file used by the measurement script into an empty directory
netlab create -d <device>followed by
netlab initial -oand inspect the
- Documented drastic increase in boot time for Nexus OS VM with many interfaces.
Why do vendors still try to gatekeeper their software so much? I still remember in the early 2000s that we were all expecting Juniper to be liberator of the industry with their BSD related JunOS. Finally something you could somewhat explain to workings to a sysadmin.
Anyway, I would be interested how Nokia’s SR Linux stacks up. They’ve been making quite some buzz lately. Also interested if there is a difference configure with Ansible through SSH or via YANG.
... because their marketing departments still live in 1990s.
No idea how well Nokia SR Linux works, but they got the memo (and read it). You can download the container from a container registry and run it, and if you need help getting started use netlab (https://netsim-tools.readthedocs.io/en/latest/labs/clab.html)
Not that the exercise make any sense to me in general, this is kind of fake data science, which got the field bad reputation,but I wonder why you choose vSRX, which is fully fledged security appliance, to compete against routers and switches?
"Not that the exercise make any sense to me in general, this is kind of fake data science, which got the field bad reputation"
You don't care about boot times, I do. That's perfectly OK, you could stop reading right there ;)
"but I wonder why you choose vSRX, which is fully fledged security appliance, to compete against routers and switches"
Because it's the only Junos box one can download from their web site (assuming one's account eventually works, which is a totally different story) and run without having to figure out the plumbing between control-plane and data-plane VMs or how to start a VM with multiple disks in Vagrant.
As I wrote, please feel free to repeat the tests with vMX, vQFX, and some other VM (to establish the baseline) and post the results.