During one of my SDN workshops, an attendees asked me “How do you build carrier-grade (5 nines) cloud infrastructure with VMware NSX?”
Short answer: You don’t… and it’s a wrong question anyway.
Before Delving into Details (aka Disclaimer)
This is not an NSX-related blog post. It just happened that the attendee tried to accomplish the Mission Impossible with NSX. He could have chosen Juniper Contrail or Nuage VSP or anything else while facing the same pointless task.
I’ve encountered two compute infrastructure products that were probably close to what people call carrier-grade in my days – IBM mainframes and Tandem minicomputers. Both were incredibly complex and expensive, and ran short user-written transactions on top of fully redundant software and hardware infrastructure.
It’s impossible to reproduce the same feat in an Infrastructure-as-a-Service cloud environment because the workload isn’t composed of short ACID transactions but of servers of unknown quality. You might be able to build a cloud infrastructure with 5-nine reliability, but it would be a totally wasted effort if the workload running on top of it crashes (or is brought down for patching). See also High Availability Fallacies for more details.
The only way to build a solution with more than 99.9% availability is (according to James Hamilton) to build an application-layer solution running in multiple availability zones, and once you do that, you don’t care that much about the availability of individual zones as long as it’s reasonably high.
Building Carrier-Grade Infrastructure
Twenty-five years ago we had simple routers and switches, and we knew how to build resilient networks with redundant boxes and routing protocols. Then the traditional service providers learned how to spell IP and wanted to implement their existing operational practices in this brave new world… prompting the networking vendors to build increasingly complex infrastructure products like redundant supervisors, non-stop forwarding, and in-service software upgrade.
Guess what – complex products tend to be expensive to build and operate. The carriers complaining about high cost of the networking gear and lustfully looking at what Google, Facebook, Amazon and Azure are doing should stop yammering and admit that they got what they asked for.
Obviously some people never learn, and now that the carriers turn their attention toward the new fad – Network Function Virtualization – they want to repeat the same mistake, and want cloud architects to build carrier-grade infrastructure on which they’ll run unreliable workloads.
Definitely not Einstein
The Way Forward
The more I look at what various organizations are doing (and succeeding or failing along the way), the more I’m convinced that there’s only way to reduce the overall costs of running your IT infrastructure:
- Set realistic goals based on actual business needs;
- Build good enough infrastructure that is easy to operate at reasonable costs;
- Build the few applications that actually need very high availability (not everything needs five nines) using modern design-for-failure architectural principles. See also Cloud Native Applications for Dummies.
Numerous large-scale companies have proven that this approach works, but of course it requires a major change in the way your company develops and deploy applications.
Want to know more?
You might find some answers in my Designing Private Cloud Infrastructure or Designing Active-Active and Disaster Recovery Data Centers webinars, or maybe what you need is a more comprehensive overview of data center networking.