Why Does It Take So Long to Upgrade Network Devices?

One of my readers sent me a question about his favorite annoyance:

During my long practice, I’ve never seen an Enterprise successfully managing the network device software upgrade/patching cycles. It seems like nothing changed in the last 20 years - despite technical progress, in still takes years (not months) to refresh software in your network.

There are two aspects to this:

  • Why does it take so long to validate a new software release, and why are they still monolithic blobs, and why are they always full of fresh bugs? Let’s postpone this one for another blog post ;), in the meantime read this blog post by Tom Hollingsworth (in particular the I Feel the Need for Speed section).
  • Why does it take so long to roll out new software?

The second aspect is entirely our fault. We were so keen on being CLI heroes that we ignored what was happening around us. The leading $vendor having ancient software that can barely spell API and never got configuration commit capability didn’t help either – the code base for their enterprise routing platform is almost four decades old, and their switching platform got configure replace in 2017.

Well, not everyone did the same mistake. I know people who roll out software upgrades with automated scripts (here are a few ideas to get you started). I’ve also heard of people who bricked hundreds of routers because of insufficient checks and lack of gradual rollout.

I know people who vote with their wallet and buy products that support automation. You can buy a router, a switch, a firewall, and (probably) a load balancer that had full support for automation for years if not decades from a major vendor for a reasonable price these days… but as long as everyone keeps buying what they’ve been buying for decades we won’t move anywhere. It’s like we would continue buying hierarchical databases like IBM IMS instead of using MySQL or whatever other variant of modern relational database (but wait… there are people talking about migrating mainframe applications to AWS).

Anyway, until we can automate our stuff, and prove that it works, we won’t move forward. Imagine bank tellers having to do transactions by typing SQL INSERT/UPDATE queries directly into production database that has no rollback/commit capability. I doubt they would move any faster than we do.

Unfortunately, most of networking engineers don’t even want to admit they have a problem. I never cease to be amazed at how disinterested some enterprise networking engineers are about network automation. Looks like they barely entered the denial phase of grief while everyone else is passing them by left and right.

If you’re not one of them, but simply don’t know where to start, check out ipSpace.net Network Automation webinars. If you’re focused more on solutions and architectures, go for the Building Network Automation Solutions course.

However, before yammering about the sad state of networking, let’s see what everyone else is doing. There are medical systems running on Windows XP, Equifax needed months to patch critical vulnerability, there are tons of environments where servers are never patched because they run mission-critical apps. Doesn’t mean that we can’t or shouldn’t do better (we definitely should), only that we’re not the only losers in IT.

8 comments:

  1. CLI, API, in either case most of the time is wasted waiting for the operations framework to give network team a green light for the upgrade. Too many sign-offs, conf calls, political posturing, etc.
    Replies
    1. And that's because we can't treat "software upgrade" as business-as-usual, because too many times something fails. People who deploy stuff weekly (or even multiple times per day) don't have that problem ;)
  2. I think that many of us have used the blunt instruments at our disposal (Expect etc.) to automate the upgrade of the unfriendly products that we have to work with, with varying degrees of success and reliability. So in my experience much of the challenge of deploying new code is that our testing approach is usually still back in the dark ages, and not sufficiently rigorous to convince the change management processes and other stakeholders that we should go ahead.

    Roll on the day when we actually create Ansible playbooks (or the equivalent in any other tool) that test out all of the features that we're actually using in a switch/router. Even better being able to spin up a virtual copy of your environment and prove the actual software or configuration change in situ and automatically test and validate the operation of they system afterwards. Perhaps then we get closer to being on a par with our application development colleagues with their CICD pipelines and put ourselves in a position where we are trusted to make these sorts of changes quickly and frequently.
  3. Other side is keeping track of new updates, that can be a full time job. Vendors have also done a poor job of this. We need something like WSUS to easily handle updates. It should be part of regular operations like it is for server people.
  4. Often there is no reason to upgrade if no fixes for bugs/vulnerabities are contained that apply to the network and no new features are required.
    So after studying release notes if there is no practical advantage it is often better to stay with a proven stable release (and keep rolling out new devices on the same software release).

    Of course when an update is required there should be an efficent automated process minimizing the risk and the goal should be to have a consistent SW release across all devices again within a reasonable time frame.

    So as I see it network devices often run older releases if nothing is to be gained from an update.
  5. An important aspect is the terrible quality of network device software. You install a new version and a lot of things are broken. You have a perfect configuration in software simulation, but it does not work with hardware optimized devices, since the porting is never finished and barely tested by the vendor. The documentation is just a bad joke, the support forums are full with unanswered questions.

    There is so much bad experiences that most network engineers are reluctant to touch a working system for good reasons. You should change to a better environment with fully automated regression testing. However, when you look at the new SDN projects, the same problems come back. Everyone is rushing to include new features, and the automated testing have less than 10-20% coverage.

    No one wants to pay for quality. Good enough is the target... :-)
  6. Hi.

    Thanks for the article.

    I see different aspects of this problem from my SP window:

    1 - Technical: you covered it in the article. Tools did not exist in the past for network devices, and this is being adressed/is already adressed on most platforms.

    2 - Risk culture: despite all this new new fancy features to ease and speed software upgrades, it still takes ages to upgrade a network (weither it's 10 device for a small entreprise or 1000+ device for a SP backbone). Human factor is important. Telcos for example have a strong quality culture and risk control history, and it's hard for them to imagine upgrading several devices (even 5) in parallel, in one shot, by something totaly automated. What if it goes wrong? What if this loop is not secured? What if the NOC did not restore an IGP metric during the last maintenance window?

    3 - Checks at scale: upgrading is easy. Checking the state of the device or the traffic before, after, and compare it at scale is a challenge. You need those checks to validate your software ugprade is a success or to engage a rollback. Worst: it's not apple to apple comparison: you have churn in your IGP routes (so you end up with a small difference), you might have trafic shifting somewhere else because BGP
    Worst: most routers deep health check are vendor dependant, platform dependant and software dependant. It's a tough one to manage. From past experiences, you add more and more checks, and end up with a crazy list. Again, automation could help. I also think telemetry could help: you shoud be able to confidently say you routeur is capable of taking production traffic looking at a dashboard containing key metrics.

    Finally, if your network is correctly designed, if your architecture redundancy is regularly tested (or you are confident it's working) and you have correct tools, you don't have any technical excuses to regularly roll out new software releases at scale. Now, I think as network professionals we should also adress the human factor/education for risk.

    For the first point you mentioned (why it takes so long to validate a software and why new releases are so buggy), I will come back in the comments but with my vendor hat this time ;-)

    Fred
  7. There is only so much automation can do for a poor monolithic architecture. To realize CD for network software, remember we are crossing the vendor-operator divide which is greater than the Dev-Ops divide, but I think it is eventually possible by taking a page from the architecture that set DevOps and general software CD on fire over the past few years: microservices. I've recently written about it in fact: https://www.linkedin.com/pulse/micro-services-knock-knockin-devnetops-door-james-kelly/
Add comment
Sidebar