Building Carrier-Grade Cloud Infrastructure

During one of my SDN workshops, an attendees asked me “How do you build carrier-grade (5 nines) cloud infrastructure with VMware NSX?

Short answer: You don’t… and it’s a wrong question anyway.

Before Delving into Details (aka Disclaimer)

This is not an NSX-related blog post. It just happened that the attendee tried to accomplish the Mission Impossible with NSX. He could have chosen Juniper Contrail or Nuage VSP or anything else while facing the same pointless task.

The Problem

I’ve encountered two compute infrastructure products that were probably close to what people call carrier-grade in my days – IBM mainframes and Tandem minicomputers. Both were incredibly complex and expensive, and ran short user-written transactions on top of fully redundant software and hardware infrastructure.

It’s impossible to reproduce the same feat in an Infrastructure-as-a-Service cloud environment because the workload isn’t composed of short ACID transactions but of servers of unknown quality. You might be able to build a cloud infrastructure with 5-nine reliability, but it would be a totally wasted effort if the workload running on top of it crashes (or is brought down for patching). See also High Availability Fallacies for more details.

The only way to build a solution with more than 99.9% availability is (according to James Hamilton) to build an application-layer solution running in multiple availability zones, and once you do that, you don’t care that much about the availability of individual zones as long as it’s reasonably high.

Building Carrier-Grade Infrastructure

Twenty-five years ago we had simple routers and switches, and we knew how to build resilient networks with redundant boxes and routing protocols. Then the traditional service providers learned how to spell IP and wanted to implement their existing operational practices in this brave new world… prompting the networking vendors to build increasingly complex infrastructure products like redundant supervisors, non-stop forwarding, and in-service software upgrade.

Guess what – complex products tend to be expensive to build and operate. The carriers complaining about high cost of the networking gear and lustfully looking at what Google, Facebook, Amazon and Azure are doing should stop yammering and admit that they got what they asked for.

Randy Bush talked about this problem more than two decades ago, but of course nobody listened.

Obviously some people never learn, and now that the carriers turn their attention toward the new fad – Network Function Virtualization – they want to repeat the same mistake, and want cloud architects to build carrier-grade infrastructure on which they’ll run unreliable workloads.

Insanity: doing the same thing over and over again and expecting different results.

Definitely not Einstein

The Way Forward

The more I look at what various organizations are doing (and succeeding or failing along the way), the more I’m convinced that there’s only way to reduce the overall costs of running your IT infrastructure:

  • Set realistic goals based on actual business needs;
  • Build good enough infrastructure that is easy to operate at reasonable costs;
  • Build the few applications that actually need very high availability (not everything needs five nines) using modern design-for-failure architectural principles. See also Cloud Native Applications for Dummies.

Numerous large-scale companies have proven that this approach works, but of course it requires a major change in the way your company develops and deploy applications.

You could also decide to ignore this trend and continue building ever more complex infrastructure, and get the results you deserve.

Want to know more?

You might find some answers in my Designing Private Cloud Infrastructure or Designing Active-Active and Disaster Recovery Data Centers webinars, or maybe what you need is a more comprehensive overview of data center networking.

Latest blog posts in High Availability in Private and Public Clouds series


  1. Rome talk is at same time as OpenNebulaconf where i'll try to have a little SDN lab time.
    I hope it works out next year, I wanna come listen... definitely!
  2. I totally agree that carriers' are getting what they've asked for but whinging about the complexity. They always go against the KISS principle.

    In terms of Carrier-Grade Cloud, will OPNFV end up the same? I believe their target is to build a Carrier-Grade Cloud or a Carrier-Grade NFV infrastructure using open source software.
    1. The last time I looked they tried to glue the various incompatible bits-and-pieces together. I don't know whether the long-term plan is to build a 5-9 infrastructure on which you could run unreliable workload... in which case I wish them luck ;)
  3. Totally agree. The starting point is application and not the infrastructure. Unfortunately, it has been the other way round.

    Just virtualizing network function as is is not going work, things have to change to make it available. Of course there needs to be support from the underlying infra.
  4. HP Helion OpenStack Carrier Grade is a carrier-grade distribution of OpenStack the leading open source cloud computing platform.

    HP Helion OpenStack Carrier Grade enables carrier grade network functions virtualization (NFV) capabilities to provide communications service providers (CSPs) with an open source based cloud platform that meets their demanding reliability requirements and accelerate their transition to NFV deployments.
    1. Dear Anonymous,

      Do you honestly believe the BS you wrote or is it a distressingly lame attempt at marketing?

      The next time you might want to use your real name and disclose your vendor affiliation or I'll send your comment straight to /dev/null. For now, I'll leave your comment online for everyone to see how $Vendor marketing works.

      Finally, I spoke with a few pilot users of said product, and (without going into any details) I'd _strongly_ suggest you focus on making the product better instead of posting anonymous blog comments - the long-term benefit for your company will be much higher.
Add comment