Let’s Drop Some Random Commands, Shall We?

One of my readers sent me a link to CCO documentation containing (at that time) this gem:

Beginning with Cisco NX-OS Release 7.0(3)I2(1), Cisco Nexus 9000 Series switches handle the CLI configuration actions in a different way than before the introduction of NX-API and DME. The NX-API and DME architecture introduces a delay in the communication between Cisco Nexus 9000 Series switches and the end host terminal sessions, for example SSH terminal sessions.

So far so good. We can probably tolerate some delay. However, the next sentence is a killer…

2017-05-08: The behavior is caused by an old bug in Linux TTY driver. Fixed NX-OS versions are planned to be shipped in late May 2017. More details here.

2017-04-05: The wonderful information disappeared from Cisco's documentation within 24 hours with no explanation whatsoever. However, I expected that and took a snapshot of that page before publishing the blog post ;)

This delay causes the configuration lines to be dropped randomly when pasting the configurations to the switches. In most cases, the severity of the issue is directly proportional to the length of the configurations that are pasted into the terminal sessions. For example, pasting an ACL with greater than 600 lines often results in more lines getting dropped than pasting an ACL with only 100 lines.

Wait, WHAT? Your latest software release is randomly dropping configuration commands and you find it appropriate to document the behavior in some obscure section of the documentation instead of fixing it? What happened to the company I liked to work with for decades? This approach literally makes me sick.

I can’t possibly fathom how someone could get the idea that it’s perfectly fine to take commands received over a reliable communication channel (SSH sessions ran over TCP the last time I checked) and randomly drop a few of them for convenience reasons. Would it be so hard to wait for the previous command to finish and then read the next line from the TCP buffer? Or use NX-API internally to execute CLI commands if that’s the only reliable way to talk to the box?

Not only does this make any CLI-based automation totally unreliable (not that it ever was completely reliable), as the documentation succinctly explains, even cut-and-paste is no longer guaranteed to work. The only “reliable” mechanism might be **scp file device:running-config unless they broke that one as well.

On a totally unrelated note, I had to switch from NX-API to CLI during my Ansible for Networking Engineers webinar because the NX-API got less reliable with every software update, returning random 404 (page not found) errors. Admittedly I was running NX-OS image in VIRL, but I got similar reports from engineers running real-life networks.

Even though my calendar claims it’s 2017 it seems like I’ll have to add another line to the Network Automation RFP Requirement: device should not drop random commands received over any management-plane communication channel. Being big-time into Model Driven Manageability doesn’t help much if you can’t get the fundamentals right.

Latest blog posts in CLI versus API series

21 comments:

  1. Might be a "not cleared up" early april fool hoax?

    //Updated:Mar 31, 2017

    greeting, Matthew
  2. Good post and good to know about 9k. the SCP approach is spot on. I had some projects with a similar issue on other platforms with certain limits/bugs with cut/paste. So I would upload my config snippet files and use alias commands or EEM applets to add or remove configuration items from the running config.
  3. I find it annoying enough if platforms have issues keeping up when copy-pasting over serial console (some 1RU Cisco switches like to do that), but over SSH? This is madness. Sheer and utter madness.

    But then, it's NX-BU, not "the company you liked to work with". *That* is probably outsourced to nowhereland today, while the rest of the company enjoys the BU infighting.
  4. Cisco 9k isn't the only platform that suffers from this. I see this all the time with multiple vendors...enough so that I've tuned my terminal app to paste much more slowly than it's capable of doing.
  5. So happy to run a real linux-based network OS when I read stuff like this. Even Ivan points out though.... CLI was always prone to random unknowns for automation. Anyone who has written a complex TCL script can tell you that. Perhaps these kinds of blatent problems will help to usher in proper (non-cli-driven) automation.
    Replies
    1. With next step in the automation saga being that devices cannot be configured at all with anything other than the "Prime" management system (not free, of course). Nice.

      Sometimes you just need CLI, and it is reasonable to say that it can be for more than just a few lines.
  6. Naaach - you thought you encountered everything in the past 15 years from malicous redundant CMM Modules in Switches left in Ashes while switching mastership, over isolated routing engines while doing a "nonstop" (!) software upgrade in virtual chassis deployments to ipv6 stacks left without function after software update in "rock solid" Routers - it even gets worse! I feel like proven technology that was the base for reliable IT infrastructures for decades has got in the vortex of home user fashioned banana engineering. So let's get surprised from upcoming errors in future.
    Replies
    1. In totally unrelated news I stumbled upon this:

      https://www.quora.com/Are-Cisco-and-Juniper-still-good-companies-to-work-for-on-engineering-roles/answer/Tony-Li-19?srid=umIx
  7. Don't worry, $VENDOR stuffed up the copy via SCP method, turns out they had two different config parsers, one for loading from CLI, one from file.

    Guess how we discovered this?
    Replies
    1. I don't think I want to know the details ;)) Two config parsers... mind blown. No wonder that code is so bloated I can only fit one of them into my VIRL VM.
  8. Well that is why N9K has API, so you don't have to cut and paste any more long snippets of CLIs, enjoy the power of APIs :)
    Replies
    1. You probably failed to read the part of my blog post where I explained how NX-API returned 404 errors... Enjoy the power of randomly-failing API ;)
    2. I sat in NX sales preso today where slide said "API for critical features". I guess critical sounded better than "some".
  9. Maybe we can get them to improve this new "Configuration Random Early Drop" (CRED) feature by preferentially dropping "accept" vs. "deny" lines when you paste long ACLs: Weighted CRED! Would be much more secure, no?
  10. Big management changes, basic business strategy problems, flushing of talent for some cost sparing...
    Do you remember Ascend, Lucent, Nortel, Digital Equipment, Alcatel, etc.?
    Similar is coming to more and more companies.
    Huawei and others are eating up their businesses... Game is over... It is time become vendor independent and rely more and more on open source. Even in hardware.
  11. This and other great stories recently made me write up another story about how Cisco simply does not seem to care: https://mirceaulinic.net/2017-04-14-cisco-xr-xml-agent-fun/
  12. This goes beyond the nx9k. I've experienced this behavior in nx7k running older versions of nxos than you'd expect.

    Attempted to copy and paste configuration from production switch to testing nx7k failed miserably as the config would seemingly randomly get completely borked up. Ended up pasting the config into word then pasting in two pages at a time. The only way to to viably load large configs is copy ftp start and reload.

    Pretty sure I've seen this on IOS XE as well
  13. I think Ivan will post an update on this soon. It turns out it is a bug in the Linux kernel and affects ANY switch vendor using that Linux kernel version. Check Ubuntu bug Bug #1208740 for details. Hope this helps anyone who experienced this issue from any vendor's gear.
Add comment
Sidebar