Designing Active-Active and Disaster Recovery Data Centers

A year ago I was a firm believer in the unlimited powers of Software-Defined Data Centers and their ability to simplify workload migrations. After all, if you can use an API to create any data center object, what’s stopping you from moving the workload running in a data center to another location.

As always, there’s a huge difference between theory and reality.

Reality Distortion Field Has Failed

Being a slightly skeptical eternal optimist, I created a workshop description for Interop Las Vegas 2015 which still sounded pretty positive and mentioned SDDC as a potential solution.

In December 2014, the reality hit… hard. I was running a workshop for a global organization that was sold on a simple idea: using SDDC (from the vendor that created the acronym) it’s easy to pick up your toys (= application workload), pack them in a large bag, walk away to a different sandbox (= public cloud), drop them out of the bag and continue playing.

During the workshop we identified numerous obstacles and missing orchestration components, and concluded that it’s totally impossible to achieve what they planned to do. The best they could do at that time was to manually recreate network infrastructure (= subnets) and services (= firewalls and load balancers) in a second virtualized environment (disaster recovery data center or public cloud), and afterwards restart VMs from the failed data center in that cloud.

The only approach that would do what my customer wanted at that time was automated application deployment using tools like Cloudify, but that solution was further away from their grasp than Alpha Centauri – they were a traditional enterprise IT shop with manual non-repeatable server creation and application deployment processes.

After three days we had to conclude that there’s nothing SDDC could do for them to solve their immediate workload migration problems, and that they should focus on automating application development and deployment processes (yeah, I know I sound like Captain Obvious).

Combining NSX-T and SRM might be a step in the right direction, but I never read the documentation to find out the potential “minor” details.

Adjusting to Reality

Based on that traumatic experience, I decided to refocus my Interop presentation on what works now in real life and not surprisingly the best answer is “proper application architecture”.

Anyway, the Interop workshop documented numerous challenges you might encounter on your journey (including finite bandwidth, non-zero latency, unpredictable failures, bad application architectures, vendors promoting obviously-stupid things), and resulted in a fantastic experience for the attendees even though the workshop was just before the evening party and I ran way overtime.

An updated version of that workshop is now available as a webinar with a more appropriate title: Designing Active-Active and Disaster Recovery Data Centers webinar.

Latest blog posts in Disaster Recovery series

12 comments:

  1. "In December 2015, the reality hit…hard"? Welcome back to the present, what is the future like? not so bright for SDDC I take it.
    Replies
    1. Well, obviously I don't own a DeLorean yet, but unfortunately I don't expect the results to be much different in December 2015.

      Fixed, thank you!
  2. non-zero latency? theory vs reality or just stating obvious?
    Replies
    1. You wouldn't believe how non-obvious that is to most people.
    2. Unfortunately I do. I've seen and worked with many of those people.
    3. Ivan, do you think there's a way to talk or write about glaring but nontheless common technical misconceptions (like non-zero latency) in a way that's both not condescending, and will reach the right people?

      I've noticed that a lot of blog posts I've done targeted at beginners don't get a lot of views (like https://jayswan.github.io/2013/10/16/java-is-to-javascript-as-car-is-to/).

      Do you think this is because blog audiences aren't the ones who need those topics? I've noticed that your posts have gotten steadily more advanced over the years, and wondered if that was part of the reason.
    4. I just wanted to add that I don't think anything in Ivan's post is condescending -- I'm asking a more generalized question about how to help educate the IT population as a whole about common technical misunderstandings.
  3. Will these webinars sessions be recorded? I will be able to attend only one session live.
    Replies
    1. All live webinars are recorded, and the recordings (in form of downloadable MP4 videos) are made available within 48 hours of the live session.
  4. Here's a short video demo on a related topic: Overlay network design and route optimization for multi-DC applications in the context of vMotion. https://www.youtube.com/watch?v=CFBm3EFFdCY. Ivan: Any thoughts on the principle of associating subnets with particular DC sites, and only announcing /32 host routes for VMs that are "away from home"?
    Replies
    1. Obviously it works... as far as someone is willing to accept /32 from the data center. The fundamental question remains, though: WHY do we need to move VMs between data centers, and WHY does someone claim that broken applications deserve this level of complexity and perpetual technical debt.
    2. I agree - and in particular WHY web front-end VMs, which are supposedly stateless. Still, I believe it is worthwhile to have a solution in place which can support vMotion between DCs when needed, with optimized routing even for individual VMs. But just because you can, doesn't mean you should
Add comment
Sidebar