Microsoft engineers published an analysis of switch failures in 130 Azure regions (review of the article, The Next Platform summary):
- A data center switch has a 2% chance of failing in 3 months (= less than 10% per year);
- ~60% of the failures are caused by hardware faults or power failures, another 17% are software bugs;
- 50% of failures lasted less than 6 minutes (obviously crashes or power glitches followed by a reboot).
- Switches running SONiC had lower failure rate than switches running vendor NOS on the same hardware. Looks like bloatware results in more bugs, and taking months to fix bugs results in more crashes. Who would have thought…