Worth Reading: Azure Datacenter Switch Failures
Microsoft engineers published an analysis of switch failures in 130 Azure regions (review of the article, The Next Platform summary):
- A data center switch has a 2% chance of failing in 3 months (= less than 10% per year);
- ~60% of the failures are caused by hardware faults or power failures, another 17% are software bugs;
- 50% of failures lasted less than 6 minutes (obviously crashes or power glitches followed by a reboot).
- Switches running SONiC had lower failure rate than switches running vendor NOS on the same hardware. Looks like bloatware results in more bugs, and taking months to fix bugs results in more crashes. Who would have thought…
"data from over 180,000 switches running in its datacenters, which spanned 130 different geographical locations"
So that's an average of 1400 switches per geographical location. Can we conclude that the average fabric-size of the Azure network is ~1400 routers? I assume not all geographical locations have the same size of fabric. So what's the largest fabric at Azure? 10k routers maybe? Definitely smaller than 100k routers.
I read the paper, and from what section 2 described, 180k looks like the total number of switches across 130 DCs. So indeed there're some 1400k switches per DC, on average. That makes more sense as I've never believed you need even 10k switches for 100k-server DCs, even with commodity switches. Thx a lot Ivan, for bringing this data to light, putting an end to this question!!
Also Ivan, I don't think at 10k switches or higher switch counts, these cloud-scale guys actually run flat networks. Previous studies from SPs show that they sub-divide their networks into smaller routing domains and run redistribution between them. I have my suspicion Cloud providers won't do any better. What goes into presentations and corporate PR release doesn't seem to match what happens at ground zero, often times. If you ever find any such info, pls share :).
Re hardware problems, one would have thought after over 3 decades of building high-end hardware, this art had been refined into hard science. But looks like even for standard features like ECMP, hardware faults are still quite common. Some of the problems brought up by Fastly in their software load-balancing report include:
Uneven hashing. Some switches under evaluation were incapable of hashing evenly. For this particular switch model, the most and least heavily loaded of the 256 ECMP nexthops differ in allocated traffic share by a factor of approximately six.
Unusable nexthops. Some switches also have odd restrictions on the number of usable ECMP nexthops for any given destination. In particular, one model we tested appears to only support ECMP nexthop set sizes that are of the form (1;2; : : : ;15) x 2^n, presumably because of hardware limitations.
You can read more about them in section 6.3 of their paper:
https://www.usenix.org/system/files/conference/nsdi18/nsdi18-araujo.pdf
Why would vendors spend so much time adding bullshit features that hardly anyone uses, incl. shitty AI/ML capabilities, when they can't get their basics to work without hiccup??? And they expect us to believe in breath-taking technological progress. Wow!
@Minh Ha: I can easily answer the last question: "Why would vendors spend so much time adding bullshit features that hardly anyone uses, incl. shitty AI/ML capabilities, when they can't get their basics to work without hiccup?" - because bullshit sells, and fixing bugs doesn't.
When was the last time a major organization made code quality one of the important buying decisions in a public RFP?
100% Ivan! And we have even more BS now than say 15-20 yrs ago, because after all the low-hanging fruits in R&D have been picked, coming up with better products is getting more and more painful for vendors.
Ethan Banks wrote a funny piece about this tech disillusionment of his over the years, which you might find entertaining ;) :
https://packetpushers.net/i-used-to-think-now-i-think/