Worth Reading: Building an OpenStack Private Cloud

It’s uncommon to find an organization that succeeds in building a private OpenStack-based cloud. It’s extremely rare to find one that documented and published the whole process like Paddy Power Betfair did with their OpenStack Reference Architecture whitepaper.

I was delighted to see they decided to do a lot of things I was preaching for ages in blog posts, webinars, and lately in my Next Generation Data Center online course.

Highlights include:

  • Don’t reinvent the wheel – use a commercial distribution (I know half a dozen organizations who tried to build OpenStack from grounds up and failed). They used Red Hat OpenStack distribution and Nuage VSP instead of vanilla Neutron.

One has to wonder whether the default Neutron implementation shipping with OpenStack is a networking vendor conspiracy to sell more warez ;)

  • The implemented multiple independent OpenStack instances, reducing the size of the failure domain.
  • They use L3 (routing) between OpenStack instances and DNS-based load balancing across multiple data centers. There’s no L2 DCI.
  • Physical network is a simple leaf-and-spine network providing stable routed infrastructure. Layer-2 domains terminate at the rack layer.
  • Provisioning and configuration of hypervisor hosts and physical network is fully automated. They use Ironic (OpenStack component) to provision the servers and Arista’s ZTP to provision the switches.

Interestingly, they didn’t OpenStack use distributed storage (Ceph) but an all-flash storage array (Pure storage) integrated into OpenStack. It was also nice to see they use Ansible to deploy applications, as well as many of the other tools we discussed in Building Netowrk Automation Solutions online course.

Timing is also interesting:

  • 6 weeks for PoC
  • 8 months for a first-generation production-ready pilot
  • 10 weeks to migrate test applications to pilot (in parallel with pilot deployment)
  • 12-18 months to migrate the rest of the workload
  • 18 months to decommission the old hardware (in parallel with the migration)

Not totally unrelated

Want to become a better data center architect? Learn from the leading industry experts? Explore the Building Next-Generation Data Center online course and register ASAP – there are only a few tickets left for the April 2017 session.

8 comments:

  1. Hi Ivan - thanks for the post. It's been a great project to work on. When we started we found it very hard to find references at the scale we were planning. I hope the White Paper will help others in this position and give something back to the community.
  2. Hi Ivan,
    Thanks for posting this.
    It's great to know this platform is following what you've been teaching on your courses.
  3. Regarding Neutron's default implementation - things are not so bad as they used to be (e.g. see https://www.mirantis.com/blog/openstack-neutron-performance-and-scalability-testing-summary/).
    Replies
    1. "not as bad as they used to be" << that's a nice summary. Anywhere near line rate only with latest NICs, latest kernel, and 9K MTU. Seems 1Gbps is still ludicrous speed in OpenStack world ;))

      http://blog.ipspace.net/2014/11/open-vswitch-performance-revisited.html
  4. "Don’t reinvent the wheel – use a commercial distribution " -- been there, tried to move forward and find quite difficult, a commercial distro not only help you in the deployment also allow you to move on and be able to meet deadlines (internals and for your personal interest :) ), doing by hand is possible and you learn a lot but is not the path for all organizations for sure.
  5. "Don’t reinvent the wheel – use a commercial distribution"
    As usual: it really depends. Going with a commercial distribution has its fair share of drawbacks: financial scalability, vendor lock-in, bloat, debatable architectural choices & decisions, required in-house knowledge or rather lack thereof. Don't fool yourself, if you choose to run a private cloud in the first place and depend on your vendor of choice to solve all or even most of the issues that will pop up, especially those high priority more complicated critical ones occuring at undesirable times... good luck to you. To rephrase an earlier post: been there. And be sure to research (continuous) upgrades; a nice installer and .ppt promises are worthless. Granted, recent commercial distributions are (getting) better, but if you already employ the required subject-savvy people and automation tools & processes, rolling + maintaining your own deployment isn't all that complicated. You can still fall back on support from your vendor, but you're not completely dependent on it. Until vendors are willing to move from license pushers to actually offering services and building a symbiotic relationship with their customers where interests are aligned I'd seriously consider all requirements, pro's & con's. And not only the technical ones.
    Might be your private cloud is a smaller setup for testing and development purposes only, or there are other reasons where using a commercial distribution makes perfectly sense. But I certainly wouldn't claim choosing a commercial OpenStack distribution is an obvious no-brainer.

    And as pointed out by Matjaz, as it's now 2017, even bare bones OVS neutron has moved on. It might not yet be able to saturate a 100GbE NIC in a single compute/network node, but I'm not sure which scalable cloud workload would need that. It's sure nice to know neutron OVS would be able to drive 100GbE at line rate, but it's all about scaling out and I'd say that you'd be hitting a ToR leaf-spine oversubscription before running into compute node network I/O bound issues. Ceph is a valid use case for e.g. 25/40GbE NIC's, low latency (although there's still the code path), but as Ceph doesn't need neutron/OVS thats not an issue.
  6. "Going with a commercial distribution has its fair share of drawbacks" << Couldn't agree more. I'm also positive people had exactly the same sentiment regarding Linux distros a decade ago. I don't know too many people building Linux from sources these days.

    Also, I know teams that spent a year setting up OpenStack from sources. I know other teams that tried (because it would be cheaper than buying something) and failed miserably wasting months and man-years.

    I think a fair summary would be "if you never tried it before, start with a distro, and if you know what you're doing you're not listening to my rambling anyway" Agreed?

    As for OVS performance, do read the details, and don't be misled by 9K MTU measurements. The performance documented in that report is still not anywhere close to where VMware vSwitch was years ago.
    Replies
    1. I knew the Linux distro argument would pop up, and of course it's totally valid :-) I should have stated that I didn't necessarily meant building on OpenStack upstream but rather using distro OpenStack packages, with(out) vendor support if you like. Then deploy OpenStack with own orchestration tools, i.e. "yum/apt install & configuration files". This is what we did, using saltstack. There's even quite a number of downloadable deployment projects on e.g. github by now.
      Ask if $vendor is willing to support their own packages when rolled out with your own orchestration tooling and/or see if $vendor will be using that fact to cover up a support organization lacking sufficient OpenStack know how (sadly that last addition isn't purely drivel from a cynical mind.)

      I might come across as grumpy rambling, just wanted to share a perspective from some folks who have gone through all this in reality. YMMV of course. I appreciate your blog and indeed it's awesome to see someone taking the effort to share their implementation. I hope we're allowed (time-wise as well) to do the same sometime soon.

      Starting out experiments or a PoC with an installer is certainly a good idea, even if that installer is only devstack/RDO. And using a commercial installer all the way will make and keep some people happy, sure of that. The point is, after the PoC you need to choose a path for your production private cloud. We've experienced the pain of migrating production clouds away from a commercial installer to our own orchestration & deployment (but we're still glad we did, and had very valid reasons for that move to begin with).

      Regarding OVS, obviously your VMware vSwitch comment is valid, I did read the details (and did some benchmarking ourselves after all, we had to choose an option). neutron/OVS is more like NSX though (multi tenant, self service, e.g. vxlan overlay). As we're not using NSX I haven't looked into actual NSX benchmarking.
      The Mirantis link fails to mention if they're using the native OVS firewall driver, already a cleaner & faster (without the veth/linux bridges/eb- & iptables mess).
      Also see e.g.:
      https://software.intel.com/en-us/articles/implementing-an-openstack-security-group-firewall-driver-using-ovs-learn-actions
      And I still stand by my comment regarding VMWare (ESX) enterprise versus private cloudy workloads :-) I'd rather scale out my cloud/deep learning/big data workloads over e.g. N racks with say 30-40 compute nodes per rack. Even with the current abominable OVS performance that could mean up to 40Gbps throughput per rack with a single ToR or up to double that if you're ECMP-ing over two ToR's (which is what we do). As Ceph is a valid and popular storage option, your design needs to include that traffic on the 10(/25?)GbE NIC(s) in your compute node as well (or sure, you could technically & financially emulate a SAN by dedicating network infrastructure to Ceph IP storage).
Add comment
Sidebar