Follow-up: Nexus-OS Dropping Configuration Commands

Monday, May 8, 2017 09:30 +0200

Follow-up: Nexus-OS Dropping Configuration Commands

Not long after I published the let’s drop some configuration commands rant I got a very nice email from Nicolas Delecroix, Technical Marketing Engineer in Cisco INSBU, effectively saying “Would you have time for a short WebEx call to discuss the root cause of the problem and what we did to fix it?”

Of course I agreed and here’s what they told me:

They were able to reproduce the problem;
The drops were caused by a very old bug in Linux TTY device driver introduced in 2009, discovered in Ubuntu ~4 years ago and present in all Linux distributions with kernels between 2.6.31 and 3.11.0.

On Linux-based platforms the router configuration process is usually run as a regular process within a login shell, which means that the path your data has to take goes through ssh server, kernel TTY driver (to make SSH connection appears as just another VT100 terminal), and finally the user process.

The bug was sitting in NX-OS for years, but got more visible due to shift to model-based device configuration architecture that added some delay in the configuration path.
They couldn’t upgrade the Linux kernel used by Nexus-OS (currently 3.4.91) but backported the bug fix into TTY device driver used by Nexus-OS.
The fixed TTY driver will ship with Nexus OS releases 7.0(3)I6(1) and 7.0(3)I4(7). Nicholas told me they’re targeting to ship both releases before end of May.

Now that we know what the problem is, it’s easy to figure out the workarounds. They recommended:

Copy configuration file to the device and then use copy file running-config
Use NX-API

These two should also work:

Use scp file router:running config
Use an expect script that waits for prompt before sending the next command.

Of course I had to snoop around a bit and found that:

The bug is easy to reproduce in bash and has nothing to do with router configuration.
The bug is causing large pastes (5K or more) to fail in any program that uses readline (the library that handles line editing) or anything similar, and is thus present on any server or network device running Linux with affected Linux kernel.
Unless a device vendor backported the fix into the Linux TTY driver they’re using (it seems Ubuntu developers decided to do this as well) every device running affected Linux kernel might experience the same behavior.

If you’re running a network device that runs on top of a Linux kernel, it’s relatively easy to get the kernel version: go into shell, type uname –a… and let me know what you find out ;)

Finally, I’d like to thank again Nicholas and the Cisco INSBU engineers for an extremely professional approach to this problem.

automation

8 comments:

Unknown 08 May 2017 13:37

Thank you Ivan on following up on this. Good to see that Cisco is paying attention to the user community! This bug has bit me in the butt once or twice already, thankfully with no operational effect.

jsicuran 08 May 2017 15:28

This is very helpful. As always, thanks....

Murali 08 May 2017 19:47

Thank you Ivan for keeping us honest and also following through in publishing our response and how we fixed it!

Jonathan, hopefully through our action, you can see we are continuing to be focused on the user community.

Thank you.

Anonymous 10 May 2017 02:47

Just wondering how long this would have taken to fix going through the usual channels.

DixieWrecked 11 May 2017 21:12

Not having that problem with Cumulus.

:~$ uname -r
4.1.0-cl-5-amd64

Anonymous 28 November 2017 17:13

Let' see if I have this story straight:
One frequently needs to make a choice about buffer sizes when coding.
Someone made a choice of 4k for the kernel buffer allocated for reading from console.
Someone who paid no attention to any of this facilitated the ability to easily "buffer overflow" the read from console buffer. (Used OS for special purpose devices)
OK over 4 years ago the awareness of the buffer size of 4k grew per Ubuntu's records.
Cisco chose to do nothing.
This seems to me to be working as designed. an ID10T (ID Ten T problem)
So to refer to it as a BUG is not nice and disrespects the work of thise who have gone before us.

Anonymous 28 November 2017 18:22

Shorter version
labelling a limit as a bug makes it someone else’s responsibility instead of accepting responsibility for misusing a well-defined, robust, and documented resource.

Ivan Pepelnjak 28 November 2017 20:50

Something is dropping random characters received over a reliable (TCP) session. I call that a bug. So did everyone else - that's why they opened a bug report and fixed it. Putting a lipstick on this pig won't make it nicer.

Latest blog posts in CLI versus API series

Recent posts in the same categories

automation

8 comments: