I got numerous responses to the “Why Does It Take So Long to Upgrade Network Devices,” the best ones coming from Béla Várkonyi and Frederic Cuiller.
Béla is sick-and-tired of the stuff vendors are shipping:
An important aspect is the terrible quality of network device software. You install a new version and a lot of things are broken. You have a perfect configuration in software simulation, but it does not work with hardware optimized devices, since the porting is never finished and barely tested by the vendor. The documentation is just a bad joke, the support forums are full with unanswered questions.
There is so much bad experiences that most network engineers are reluctant to touch a working system for good reasons. You should change to a better environment with fully automated regression testing. However, when you look at the new SDN projects, the same problems come back. Everyone is rushing to include new features, and the automated testing have less than 10-20% coverage.
No one wants to pay for quality. Good enough is the target... :-)
Fred touched upon several SP-related aspects:
I see different aspects of this problem from my SP window:
- Technical: you covered it in the article. Tools did not exist in the past for network devices, and this is being addressed/is already addressed on most platforms.
- Risk culture: despite all this new fancy features to ease and speed software upgrades, it still takes ages to upgrade a network (whether it's 10 devices for a small enterprise or 1000+ device for a SP backbone). Human factor is important. Telcos for example have a strong quality culture and risk control history, and it's hard for them to imagine upgrading several devices in parallel, in one shot, by something totally automated. What if it goes wrong? What if this loop is not secured? What if the NOC did not restore an IGP metric during the last maintenance window?
- Checks at scale: Upgrading is easy. Checking the state of the device or the traffic before, after, and compare it at scale is a challenge. You need those checks to validate your software upgrade is a success or to engage a rollback. Worst: it's not apple to apple comparison: you have churn in your IGP routes (so you end up with a small difference), you might have traffic shifting somewhere else because BGP
- Worst: most routers deep health check are vendor dependent, platform dependent and software dependent. It's a tough one to manage. From past experiences, you add more and more checks, and end up with a crazy list. Again, automation could help. I also think telemetry could help: you should be able to confidently say your router is capable of taking production traffic looking at a dashboard containing key metrics.
Finally, if your network is correctly designed, if your architecture redundancy is regularly tested (or you are confident it's working) and you have correct tools, you don't have any technical excuses to regularly rolling out new software releases at scale. Now, I think as network professionals we should also address the human factor/education for risk.
- Read the original blog post
- Consider these ideas to automate the upgrade process
- Watch network automation webinars on ipSpace.net
- Enroll into Ansible for Networking Engineers course if you want to master Ansible.
- Check out the Building Network Automation Solutions online course if you’d like to get fluent in network automation