Follow-up: Nexus-OS Dropping Configuration Commands

Not long after I published the let’s drop some configuration commands rant I got a very nice email from Nicolas Delecroix, Technical Marketing Engineer in Cisco INSBU, effectively saying “Would you have time for a short WebEx call to discuss the root cause of the problem and what we did to fix it?”

Of course I agreed and here’s what they told me:

On Linux-based platforms the router configuration process is usually run as a regular process within a login shell, which means that the path your data has to take goes through ssh server, kernel TTY driver (to make SSH connection appears as just another VT100 terminal), and finally the user process.
  • The bug was sitting in NX-OS for years, but got more visible due to shift to model-based device configuration architecture that added some delay in the configuration path.
  • They couldn’t upgrade the Linux kernel used by Nexus-OS (currently 3.4.91) but backported the bug fix into TTY device driver used by Nexus-OS.
  • The fixed TTY driver will ship with Nexus OS releases 7.0(3)I6(1) and 7.0(3)I4(7). Nicholas told me they’re targeting to ship both releases before end of May.

Now that we know what the problem is, it’s easy to figure out the workarounds. They recommended:

  • Copy configuration file to the device and then use copy file running-config
  • Use NX-API

These two should also work:

  • Use scp file router:running config
  • Use an expect script that waits for prompt before sending the next command.

Of course I had to snoop around a bit and found that:

  • The bug is easy to reproduce in bash and has nothing to do with router configuration.
  • The bug is causing large pastes (5K or more) to fail in any program that uses readline (the library that handles line editing) or anything similar, and is thus present on any server or network device running Linux with affected Linux kernel.
  • Unless a device vendor backported the fix into the Linux TTY driver they’re using (it seems Ubuntu developers decided to do this as well) every device running affected Linux kernel might experience the same behavior.

If you’re running a network device that runs on top of a Linux kernel, it’s relatively easy to get the kernel version: go into shell, type uname –a… and let me know what you find out ;)

Finally, I’d like to thank again Nicholas and the Cisco INSBU engineers for an extremely professional approach to this problem.

Latest blog posts in CLI versus API series

8 comments:

  1. Thank you Ivan on following up on this. Good to see that Cisco is paying attention to the user community! This bug has bit me in the butt once or twice already, thankfully with no operational effect.
  2. This is very helpful. As always, thanks....
  3. Thank you Ivan for keeping us honest and also following through in publishing our response and how we fixed it!

    Jonathan, hopefully through our action, you can see we are continuing to be focused on the user community.

    Thank you.
  4. Just wondering how long this would have taken to fix going through the usual channels.
  5. Not having that problem with Cumulus.

    :~$ uname -r
    4.1.0-cl-5-amd64

  6. Let' see if I have this story straight:
    One frequently needs to make a choice about buffer sizes when coding.
    Someone made a choice of 4k for the kernel buffer allocated for reading from console.
    Someone who paid no attention to any of this facilitated the ability to easily "buffer overflow" the read from console buffer. (Used OS for special purpose devices)
    OK over 4 years ago the awareness of the buffer size of 4k grew per Ubuntu's records.
    Cisco chose to do nothing.
    This seems to me to be working as designed. an ID10T (ID Ten T problem)
    So to refer to it as a BUG is not nice and disrespects the work of thise who have gone before us.
    Replies
    1. Shorter version
      labelling a limit as a bug makes it someone else’s responsibility instead of accepting responsibility for misusing a well-defined, robust, and documented resource.
    2. Something is dropping random characters received over a reliable (TCP) session. I call that a bug. So did everyone else - that's why they opened a bug report and fixed it. Putting a lipstick on this pig won't make it nicer.
Add comment
Sidebar