Building Network Automation Solutions
6 week online course starting in September 2017

Let’s Drop Some Random Commands, Shall We?

One of my readers sent me a link to CCO documentation containing this gem:

Beginning with Cisco NX-OS Release 7.0(3)I2(1), Cisco Nexus 9000 Series switches handle the CLI configuration actions in a different way than before the introduction of NX-API and DME. The NX-API and DME architecture introduces a delay in the communication between Cisco Nexus 9000 Series switches and the end host terminal sessions, for example SSH terminal sessions.

So far so good. We can probably tolerate some delay. However, the next sentence is a killer…

2017-04-05: The wonderful information disappeared from Cisco's documentation within 24 hours with no explanation whatsoever. However, I expected that and took a snapshot of that page before publishing the blog post ;)

This delay causes the configuration lines to be dropped randomly when pasting the configurations to the switches. In most cases, the severity of the issue is directly proportional to the length of the configurations that are pasted into the terminal sessions. For example, pasting an ACL with greater than 600 lines often results in more lines getting dropped than pasting an ACL with only 100 lines.

Wait, WHAT? Your latest software release is randomly dropping configuration commands and you find it appropriate to document the behavior in some obscure section of the documentation instead of fixing it? What happened to the company I liked to work with for decades? This approach literally makes me sick.

I can’t possibly fathom how someone could get the idea that it’s perfectly fine to take commands received over a reliable communication channel (SSH sessions ran over TCP the last time I checked) and randomly drop a few of them for convenience reasons. Would it be so hard to wait for the previous command to finish and then read the next line from the TCP buffer? Or use NX-API internally to execute CLI commands if that’s the only reliable way to talk to the box?

Not only does this make any CLI-based automation totally unreliable (not that it ever was completely reliable), as the documentation succinctly explains, even cut-and-paste is no longer guaranteed to work. The only “reliable” mechanism might be scp file device:running-config unless they broke that one as well.

On a totally unrelated note, I had to switch from NX-API to CLI during my Ansible for Networking Engineers webinar because the NX-API got less reliable with every software update, returning random 404 (page not found) errors. Admittedly I was running NX-OS image in VIRL, but I got similar reports from engineers running real-life networks.

Even though my calendar claims it’s 2017 it seems like I’ll have to add another line to the Network Automation RFP Requirement: device should not drop random commands received over any management-plane communication channel. Being big-time into Model Driven Manageability doesn’t help much if you can’t get the fundamentals right.

19 comments:

  1. Might be a "not cleared up" early april fool hoax?

    //Updated:Mar 31, 2017

    greeting, Matthew

    ReplyDelete
  2. Delay 10 on CRT is standard practice for Junos CLI

    ReplyDelete
    Replies
    1. ah... you know about 'load ... terminal', e.g. 'load set terminal' in configuration mode?

      Delete
  3. Good post and good to know about 9k. the SCP approach is spot on. I had some projects with a similar issue on other platforms with certain limits/bugs with cut/paste. So I would upload my config snippet files and use alias commands or EEM applets to add or remove configuration items from the running config.

    ReplyDelete
  4. I find it annoying enough if platforms have issues keeping up when copy-pasting over serial console (some 1RU Cisco switches like to do that), but over SSH? This is madness. Sheer and utter madness.

    But then, it's NX-BU, not "the company you liked to work with". *That* is probably outsourced to nowhereland today, while the rest of the company enjoys the BU infighting.

    ReplyDelete
  5. Cisco 9k isn't the only platform that suffers from this. I see this all the time with multiple vendors...enough so that I've tuned my terminal app to paste much more slowly than it's capable of doing.

    ReplyDelete
  6. So happy to run a real linux-based network OS when I read stuff like this. Even Ivan points out though.... CLI was always prone to random unknowns for automation. Anyone who has written a complex TCL script can tell you that. Perhaps these kinds of blatent problems will help to usher in proper (non-cli-driven) automation.

    ReplyDelete
    Replies
    1. With next step in the automation saga being that devices cannot be configured at all with anything other than the "Prime" management system (not free, of course). Nice.

      Sometimes you just need CLI, and it is reasonable to say that it can be for more than just a few lines.

      Delete
  7. Naaach - you thought you encountered everything in the past 15 years from malicous redundant CMM Modules in Switches left in Ashes while switching mastership, over isolated routing engines while doing a "nonstop" (!) software upgrade in virtual chassis deployments to ipv6 stacks left without function after software update in "rock solid" Routers - it even gets worse! I feel like proven technology that was the base for reliable IT infrastructures for decades has got in the vortex of home user fashioned banana engineering. So let's get surprised from upcoming errors in future.

    ReplyDelete
    Replies
    1. In totally unrelated news I stumbled upon this:

      https://www.quora.com/Are-Cisco-and-Juniper-still-good-companies-to-work-for-on-engineering-roles/answer/Tony-Li-19?srid=umIx

      Delete
  8. Don't worry, $VENDOR stuffed up the copy via SCP method, turns out they had two different config parsers, one for loading from CLI, one from file.

    Guess how we discovered this?

    ReplyDelete
    Replies
    1. I don't think I want to know the details ;)) Two config parsers... mind blown. No wonder that code is so bloated I can only fit one of them into my VIRL VM.

      Delete
  9. Well that is why N9K has API, so you don't have to cut and paste any more long snippets of CLIs, enjoy the power of APIs :)

    ReplyDelete
    Replies
    1. You probably failed to read the part of my blog post where I explained how NX-API returned 404 errors... Enjoy the power of randomly-failing API ;)

      Delete
    2. I sat in NX sales preso today where slide said "API for critical features". I guess critical sounded better than "some".

      Delete
  10. Maybe we can get them to improve this new "Configuration Random Early Drop" (CRED) feature by preferentially dropping "accept" vs. "deny" lines when you paste long ACLs: Weighted CRED! Would be much more secure, no?

    ReplyDelete
  11. Big management changes, basic business strategy problems, flushing of talent for some cost sparing...
    Do you remember Ascend, Lucent, Nortel, Digital Equipment, Alcatel, etc.?
    Similar is coming to more and more companies.
    Huawei and others are eating up their businesses... Game is over... It is time become vendor independent and rely more and more on open source. Even in hardware.

    ReplyDelete
  12. This and other great stories recently made me write up another story about how Cisco simply does not seem to care: https://mirceaulinic.net/2017-04-14-cisco-xr-xml-agent-fun/

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.