Let’s Drop Some Random Commands, Shall We?

Tuesday, April 4, 2017 10:37 +0200

Let’s Drop Some Random Commands, Shall We?

One of my readers sent me a link to CCO documentation containing (at that time) this gem:

Beginning with Cisco NX-OS Release 7.0(3)I2(1), Cisco Nexus 9000 Series switches handle the CLI configuration actions in a different way than before the introduction of NX-API and DME. The NX-API and DME architecture introduces a delay in the communication between Cisco Nexus 9000 Series switches and the end host terminal sessions, for example SSH terminal sessions.

So far so good. We can probably tolerate some delay. However, the next sentence is a killer…

2017-05-08: The behavior is caused by an old bug in Linux TTY driver. Fixed NX-OS versions are planned to be shipped in late May 2017. More details here.

2017-04-05: The wonderful information disappeared from Cisco's documentation within 24 hours with no explanation whatsoever. However, I expected that and took a snapshot of that page before publishing the blog post ;)

This delay causes the configuration lines to be dropped randomly when pasting the configurations to the switches. In most cases, the severity of the issue is directly proportional to the length of the configurations that are pasted into the terminal sessions. For example, pasting an ACL with greater than 600 lines often results in more lines getting dropped than pasting an ACL with only 100 lines.

Wait, WHAT? Your latest software release is randomly dropping configuration commands and you find it appropriate to document the behavior in some obscure section of the documentation instead of fixing it? What happened to the company I liked to work with for decades? This approach literally makes me sick.

I can’t possibly fathom how someone could get the idea that it’s perfectly fine to take commands received over a reliable communication channel (SSH sessions ran over TCP the last time I checked) and randomly drop a few of them for convenience reasons. Would it be so hard to wait for the previous command to finish and then read the next line from the TCP buffer? Or use NX-API internally to execute CLI commands if that’s the only reliable way to talk to the box?

Not only does this make any CLI-based automation totally unreliable (not that it ever was completely reliable), as the documentation succinctly explains, even cut-and-paste is no longer guaranteed to work. The only “reliable” mechanism might be **scp file device:running-config unless they broke that one as well.

On a totally unrelated note, I had to switch from NX-API to CLI during my Ansible for Networking Engineers webinar because the NX-API got less reliable with every software update, returning random 404 (page not found) errors. Admittedly I was running NX-OS image in VIRL, but I got similar reports from engineers running real-life networks.

Even though my calendar claims it’s 2017 it seems like I’ll have to add another line to the Network Automation RFP Requirement: device should not drop random commands received over any management-plane communication channel. Being big-time into Model Driven Manageability doesn’t help much if you can’t get the fundamentals right.

automation

Latest blog posts in CLI versus API series

21 comments:

Anonymous 04 April 2017 13:27

Might be a "not cleared up" early april fool hoax?

//Updated:Mar 31, 2017

greeting, Matthew

jsicuran 04 April 2017 19:05

Good post and good to know about 9k. the SCP approach is spot on. I had some projects with a similar issue on other platforms with certain limits/bugs with cut/paste. So I would upload my config snippet files and use alias commands or EEM applets to add or remove configuration items from the running config.

Unknown 04 April 2017 19:37

I find it annoying enough if platforms have issues keeping up when copy-pasting over serial console (some 1RU Cisco switches like to do that), but over SSH? This is madness. Sheer and utter madness.

But then, it's NX-BU, not "the company you liked to work with". *That* is probably outsourced to nowhereland today, while the rest of the company enjoys the BU infighting.

Anonymous 04 April 2017 19:50

Cisco 9k isn't the only platform that suffers from this. I see this all the time with multiple vendors...enough so that I've tuned my terminal app to paste much more slowly than it's capable of doing.

Eric Pulvino 05 April 2017 03:49

So happy to run a real linux-based network OS when I read stuff like this. Even Ivan points out though.... CLI was always prone to random unknowns for automation. Anyone who has written a complex TCL script can tell you that. Perhaps these kinds of blatent problems will help to usher in proper (non-cli-driven) automation.

Replies

R.-Adrian F. 09 April 2017 18:59

With next step in the automation saga being that devices cannot be configured at all with anything other than the "Prime" management system (not free, of course). Nice.

Sometimes you just need CLI, and it is reasonable to say that it can be for more than just a few lines.

Der Peh 05 April 2017 09:00

Naaach - you thought you encountered everything in the past 15 years from malicous redundant CMM Modules in Switches left in Ashes while switching mastership, over isolated routing engines while doing a "nonstop" (!) software upgrade in virtual chassis deployments to ipv6 stacks left without function after software update in "rock solid" Routers - it even gets worse! I feel like proven technology that was the base for reliable IT infrastructures for decades has got in the vortex of home user fashioned banana engineering. So let's get surprised from upcoming errors in future.

Replies

Ivan Pepelnjak 05 April 2017 15:58

In totally unrelated news I stumbled upon this:

https://www.quora.com/Are-Cisco-and-Juniper-still-good-companies-to-work-for-on-engineering-roles/answer/Tony-Li-19?srid=umIx

Julien Goodwin 05 April 2017 13:07

Don't worry, $VENDOR stuffed up the copy via SCP method, turns out they had two different config parsers, one for loading from CLI, one from file.

Guess how we discovered this?

Replies

Ivan Pepelnjak 05 April 2017 15:55

I don't think I want to know the details ;)) Two config parsers... mind blown. No wonder that code is so bloated I can only fit one of them into my VIRL VM.

Victor Zakharyev 06 April 2017 10:49

Majestic!

Anonymous 06 April 2017 15:38

Well that is why N9K has API, so you don't have to cut and paste any more long snippets of CLIs, enjoy the power of APIs :)

Replies

Ivan Pepelnjak 06 April 2017 20:19

You probably failed to read the part of my blog post where I explained how NX-API returned 404 errors... Enjoy the power of randomly-failing API ;)

Anonymous 07 April 2017 18:45

I sat in NX sales preso today where slide said "API for critical features". I guess critical sounded better than "some".

Simon Leinen 08 April 2017 10:48

Maybe we can get them to improve this new "Configuration Random Early Drop" (CRED) feature by preferentially dropping "accept" vs. "deny" lines when you paste long ACLs: Weighted CRED! Would be much more secure, no?

Bela 12 April 2017 13:00

Big management changes, basic business strategy problems, flushing of talent for some cost sparing...
Do you remember Ascend, Lucent, Nortel, Digital Equipment, Alcatel, etc.?
Similar is coming to more and more companies.
Huawei and others are eating up their businesses... Game is over... It is time become vendor independent and rely more and more on open source. Even in hardware.

Mircea Ulinic 14 April 2017 15:31

This and other great stories recently made me write up another story about how Cisco simply does not seem to care: https://mirceaulinic.net/2017-04-14-cisco-xr-xml-agent-fun/

Josh S 27 April 2017 06:03

This goes beyond the nx9k. I've experienced this behavior in nx7k running older versions of nxos than you'd expect.

Attempted to copy and paste configuration from production switch to testing nx7k failed miserably as the config would seemingly randomly get completely borked up. Ended up pasting the config into word then pasting in two pages at a time. The only way to to viably load large configs is copy ftp start and reload.

Pretty sure I've seen this on IOS XE as well

Anonymous 02 May 2017 11:43

I think Ivan will post an update on this soon. It turns out it is a bug in the Linux kernel and affects ANY switch vendor using that Linux kernel version. Check Ubuntu bug Bug #1208740 for details. Hope this helps anyone who experienced this issue from any vendor's gear.

Add comment

Latest blog posts in CLI versus API series

Recent posts in the same categories

automation

21 comments: