Response: Upgrading Network Device Software

I got numerous responses to the “Why Does It Take So Long to Upgrade Network Devices,” the best ones coming from Béla Várkonyi and Frederic Cuiller.

Béla is sick-and-tired of the stuff vendors are shipping:

An important aspect is the terrible quality of network device software. You install a new version and a lot of things are broken. You have a perfect configuration in software simulation, but it does not work with hardware optimized devices, since the porting is never finished and barely tested by the vendor. The documentation is just a bad joke, the support forums are full with unanswered questions.

There is so much bad experiences that most network engineers are reluctant to touch a working system for good reasons. You should change to a better environment with fully automated regression testing. However, when you look at the new SDN projects, the same problems come back. Everyone is rushing to include new features, and the automated testing have less than 10-20% coverage.

No one wants to pay for quality. Good enough is the target... :-)

Fred touched upon several SP-related aspects:

I see different aspects of this problem from my SP window:

  • Technical: you covered it in the article. Tools did not exist in the past for network devices, and this is being addressed/is already addressed on most platforms.
  • Risk culture: despite all this new fancy features to ease and speed software upgrades, it still takes ages to upgrade a network (whether it's 10 devices for a small enterprise or 1000+ device for a SP backbone). Human factor is important. Telcos for example have a strong quality culture and risk control history, and it's hard for them to imagine upgrading several devices in parallel, in one shot, by something totally automated. What if it goes wrong? What if this loop is not secured? What if the NOC did not restore an IGP metric during the last maintenance window?
  • Checks at scale: Upgrading is easy. Checking the state of the device or the traffic before, after, and compare it at scale is a challenge. You need those checks to validate your software upgrade is a success or to engage a rollback. Worst: it's not apple to apple comparison: you have churn in your IGP routes (so you end up with a small difference), you might have traffic shifting somewhere else because BGP
  • Worst: most routers deep health check are vendor dependent, platform dependent and software dependent. It's a tough one to manage. From past experiences, you add more and more checks, and end up with a crazy list. Again, automation could help. I also think telemetry could help: you should be able to confidently say your router is capable of taking production traffic looking at a dashboard containing key metrics.

Finally, if your network is correctly designed, if your architecture redundancy is regularly tested (or you are confident it's working) and you have correct tools, you don't have any technical excuses to regularly rolling out new software releases at scale. Now, I think as network professionals we should also address the human factor/education for risk.

More information

5 comments:

  1. Recently had the joy of mass upgrade of software via Cisco meraki to fix security issue,
    the outcome was half the access points went into repeater mode and had to be manually reset to gateway mode, they kept jumping into repeater mode , so access switches had to upgraded.
    Still the work experience guys at Meraki must be learning fast about blast radius.
  2. When you deliver systems, particulary to the goverment Customers, you need your frimware to be compliant with all security requirements (features & security bug fixes). Usually vendors integrate security fixes in the latest & greatest release. So the firmware stability often suffers.

    From my perspective this is the main driver to upgrade the firmware - security, not the feature reachness.
  3. I worked at Cisco for a little while a few years ago. The process for recommending IOS versions to customers was based on most stable and most deployed version (across customer base). Most often, the stable and most deployed version was typically 6-12 month old, sometimes even older. OS version upgrades for customers who had Advanced Services contract with Cisco were processed through a risk analysis tool, which looked at the features being used by the customer, features required, and the known bugs associated with the version being proposed. All-in-all, it was quite a process! Now, as a customer, things have been slightly different with NX-OS, but, generally, unless a critical feature is required, we stick to the "Recommended" version (varies with platform) - which tends to be 4-12 months old as well.
  4. Thanks for insider info Salman. I do have a question though, what about engineering releases for one off issues, was it still the same process?
  5. Regarding engineering releases, I can only speak from my experience:
    They go thru a Cisco-internal process where the decision is made _if_ an engineering special will be created (criteria are most likely the customer, magnitude of the issue, ETA of the regular bugfix release, effort for the engineering special, ...)
    If you get one, it will come with a "big sticker" telling you basically it will fix your special issue but might break "everything else" and you'll have to move to a regular release asap.
    Engineering releases are more or less completely untested except for the bugfix, at least from what I believe to know...
Add comment
Sidebar