Follow-up: Nexus-OS Dropping Configuration Commands
Not long after I published the let’s drop some configuration commands rant I got a very nice email from Nicolas Delecroix, Technical Marketing Engineer in Cisco INSBU, effectively saying “Would you have time for a short WebEx call to discuss the root cause of the problem and what we did to fix it?”
Of course I agreed and here’s what they told me:
- They were able to reproduce the problem;
- The drops were caused by a very old bug in Linux TTY device driver introduced in 2009, discovered in Ubuntu ~4 years ago and present in all Linux distributions with kernels between 2.6.31 and 3.11.0.
- The bug was sitting in NX-OS for years, but got more visible due to shift to model-based device configuration architecture that added some delay in the configuration path.
- They couldn’t upgrade the Linux kernel used by Nexus-OS (currently 3.4.91) but backported the bug fix into TTY device driver used by Nexus-OS.
- The fixed TTY driver will ship with Nexus OS releases 7.0(3)I6(1) and 7.0(3)I4(7). Nicholas told me they’re targeting to ship both releases before end of May.
Now that we know what the problem is, it’s easy to figure out the workarounds. They recommended:
- Copy configuration file to the device and then use copy file running-config
- Use NX-API
These two should also work:
- Use scp file router:running config
- Use an expect script that waits for prompt before sending the next command.
Of course I had to snoop around a bit and found that:
- The bug is easy to reproduce in bash and has nothing to do with router configuration.
- The bug is causing large pastes (5K or more) to fail in any program that uses readline (the library that handles line editing) or anything similar, and is thus present on any server or network device running Linux with affected Linux kernel.
- Unless a device vendor backported the fix into the Linux TTY driver they’re using (it seems Ubuntu developers decided to do this as well) every device running affected Linux kernel might experience the same behavior.
If you’re running a network device that runs on top of a Linux kernel, it’s relatively easy to get the kernel version: go into shell, type uname –a… and let me know what you find out ;)
Finally, I’d like to thank again Nicholas and the Cisco INSBU engineers for an extremely professional approach to this problem.
Jonathan, hopefully through our action, you can see we are continuing to be focused on the user community.
Thank you.
:~$ uname -r
4.1.0-cl-5-amd64
One frequently needs to make a choice about buffer sizes when coding.
Someone made a choice of 4k for the kernel buffer allocated for reading from console.
Someone who paid no attention to any of this facilitated the ability to easily "buffer overflow" the read from console buffer. (Used OS for special purpose devices)
OK over 4 years ago the awareness of the buffer size of 4k grew per Ubuntu's records.
Cisco chose to do nothing.
This seems to me to be working as designed. an ID10T (ID Ten T problem)
So to refer to it as a BUG is not nice and disrespects the work of thise who have gone before us.
labelling a limit as a bug makes it someone else’s responsibility instead of accepting responsibility for misusing a well-defined, robust, and documented resource.