How Many Lab Devices Can netlab Handle?
TL&DR: Over 3000
A few weeks ago, Christian opened an issue describing how netlab breaks when the lab topology has more than 250 devices. We fixed that, only to get into another morass: some code has complexity higher than O(n) (meaning that going from 100 to 200 devices makes things more than twice as slow). Christian is working on one of those problems at the moment (it’s not that his ginormous labs won’t start, it just takes a long time), and I decided it’s time to polish a few other bits of the code.
Another annoying problem Christian encountered was the slow execution of netlab commands after the lab was started. For example, netlab connect (which is nothing more than a wrapper around ssh or docker exec command) took 14 seconds to connect to a lab device. The culprit was the netlab snapshot file, which was stored in YAML format—parsing the file describing a 400-device lab took around 15 seconds on Christian’s server.
I know YAML parsing is slow, so I wanted to change the encoding of the snapshot file to JSON. Fortunately, Stefano Sasso suggested pickling the transformed data. I got that code working, and it reduced the time needed to read the transformed data (and recreate the objects) from over five seconds to less than half a second.
In the meantime, Christian reported having a running lab with over 3000 FRR containers, while Seb has one with ~60 virtual machines.
I don’t think anyone has built a larger lab so far, and when we ship the fixed code (release 25.09 will be out in a week or two), the limiting factor for your labs will be the amount of RAM and CPU you have in your server, not netlab.