Long-Distance vMotion, Stretched HA Clusters and Business Needs

During a recent vMotion-over-VXLAN discussion Chris Saunders made a very good point: “Folks should be asking a better question, like: Can I use VXLAN and vMotion together to meet my business requirements.

Yeah, it’s always worth exploring the actual business needs.

Based on a true story ...

A while ago I was sitting in a roomful of extremely intelligent engineers working for a large data center company. Unfortunately they had been listening to a wrong group of virtualization consultants and ended up with the picture-perfect disaster-in-waiting: two data centers bridged together to support a stretched VMware HA cluster.

Actually, the disaster was no longer “in-waiting”, they had already experienced a perfect bridging storm that took down both data centers.

During the discussion I tried not to be prejudicial grumpy L3 guru that I’m known to be (at least in vendor marketing circles) and focused on figuring out the actual business needs that triggered that oh-so-popular design.

Q: “So what loads would you typically vMotion between the data centers?”

A: “We don’t use long-distance vMotion, that wouldn’t work well… the VM would have to access the disk data residing on the LUN in the other data center”

Before you write a comment telling me how you could use storage vMotion to move the data and the vMotion to move the VM, do me a favor and do some math.

Q: “So why do you have stretched HA cluster?”

A: “It’s purely for disaster recovery – when one data center fails, the VMs of the customers (or applications) that pay for disaster recovery get restarted in the second data center.”

Q: “And how do you prevent HA or DRS from accidentally moving a VM to the other data center”

A: “Let’s see…”

At this point you’d usually get one of these answers: (A) we use affinity rules… and hope nobody has a fat-finger day or (B) we have the hosts in the second data center in maintenance mode.

Q: “OK, so when the first data center fails, everything gets automatically restarted in the second data center. Was that your goal?”

A: (from the storage admin) “Actually, I have to change the primary disk array first – I wouldn’t want a split-brain situation to destroy all the data, so the disk array failover is controlled by the operator”

Q: “Let’s see – you have a network infrastructure that exposes you to significant risk, and all you wanted to have is the ability to restart the VMs of some customers in the other data center using an operator-triggered process?”

A: “Actually, yes”

Q: “And is the move back to the original data center automatic after it gets back online?”

A: “No, that would be too risky”

Q: “So the VMs in subnet X would never be active in both data centers at the same time?”

A: “No.”

Q: “And so it would be OK if you would move subnet X from DC-A to DC-B during the disaster recovery process?”

A: “Yeah, that could work…”

Q: “OK, one more question – how quickly do you have to perform the disaster recovery?”

A: “Well, we’d have to look into our contracts ...”

Q: “But what would the usual contractual time be?”

A: “Somewhere around four hours”

Q: “Let’s summarize – you need a disaster recovery process that has to complete in four hours and is triggered manually. Why don’t you reconfigure the data center switches at the same time to move the IP subnets from the failed data center to the backup data center during the disaster recovery process? After all, you have switches from vendor (C|J|A|D|...) that could be reconfigured from a DR script using NETCONF.”

A: (from the network admin) “Yeah, that’s probably a good idea.”

Does this picture look familiar (particularly the business consultant part)?

Source: www.projectcartoon.com

Source: www.projectcartoon.com

Latest blog posts in Disaster Recovery series

11 comments:

  1. This is a common problem in world of IT today. The people have less or no more knowledge and trust in people who think they have knowledge. ;) With a well implemented VMware SRM solution you could solve the problem very comfortable. SRM is even able to run scripts e.g. to change a switch configuration or it waits for an user interaction before continuing the desaster recovery process. You don't need L2 links and VMotion. Replicate the storage, build and test an SRM Action Plan. That's it's... Why make life complicate? ;)
  2. SRM is nice, but ... try to imagine system which consists of database server and app server. This is standard. But in next step try to imagine that DB server ip address is hardcoded into apllication source code. This is reason why sys admins don't want even think about changing IP even with SRM.
    In such situation - I think that only way is not to focuse on automatic switching but on DR manual procedures (like Ivan described). The possibility of DC failure is low. Such situation don't happen often (like single device failure), so you can prepare clear procedures of manual switching to second DC.
    Also if you have to have L2 connectivity and can't convice PM or sysadmins to user L3 or manual procedures - please read this document http://www.juniper.net/us/en/local/pdf/implementation-guides/8010050-en.pdf Especially please focus on Figure 7. For me it's one and only way to implement L3 connectivity for streched L2 DC.
    To be clear as network engineer I prefer clear L3 connectivity :)
    Replies
    1. My VMware expert told me, that changing the IP address of a VM is not neccessary. The action plan can run scripts that prepares the infrastructure on the backup site and it's able to wait for user interaction. I think it's good choice, because you can optimize and speed up the DR process and you have enough room for manual interaction, to ensure a proper site move. A time window of 4 hours is comfortable, but if you run into trouble maybe it's not enough. That's why I recommend to implement as much automation as possible.
  3. Not that it matters to the point you're making, but can you comment on how they implemented L2 between DCs? OTV, VPLS, EoMPLS, dark, etc.
    Replies
    1. Bridging over dark fiber, I think. Definitely not OTV.
  4. Awesome post Ivan. This is something that could possibly be optimized with network virtualization and edge gateway nodes that speak routing protocols.
  5. How exactly is "Long Distance" defined?

    Say you have a primary data center and a secondary data center a mile apart with plenty of dark fiber between the two.

    It would be great if we would do a layer 3 interconnect, but is a layer 2 DCI still a bad idea here?

    When exactly does a layer 2 DCI stop being acceptable?
    Replies
    1. While distances do matter (doing bridging across Atlantic clearly doesn't make sense), the most important fact remains that a L2 broadcast domain represents a single failure domain.

      If you think you need more than one failure domain (or availability zone), then you should be very careful with bridging (it usually makes sense to have more than one L2 domain within a single facility).

  6. Good post. Best to get the facts of what a customer requires, facts of the apps and infrastructure available to them and let those dictate the solution. We find that in a lot of cases a stretched HA infrastructure brings more complexity than what the business requirements dictate, but the reverse is also true attempting to meet some requirements by the business with SRM is more complex or incompatible (cannot use SRM for vCD) and increases operational effort. In these cases, best to lay out all the caveats and risk and working with the customer choose the appropriate solution that meets the needs with the least amount of risk. One size fits all does not exist.
  7. Excellent post. I'm facing these questions very often because I work for vendor having servers, storages, and networking. I'm long time vmware expert guy but with networking and storage overlap. I'm always trying to explain difference among local-HA, geo-HA, DR and concept of failure domains (availability zones) to my customers when delivering vSphere designs. And I can tell you it is hard, really hard. IT staff are siloed and influenced by marketing teams of particular vendors. It is pretty difficult to explain them all these cross domain topics and disprove some product marketing "simplifications". On top of that they don't have right requirements and SLAs from business. I personally also prefer good and fast enough DR against stretched metro clusters (geo-HA). You can test it and really know deterministically what you will get in case of disaster. I believe it can cover 95% of business workload requirements. However if there are real geo-HA requirements it seems to me that it can be designed correctly nowadays. But you have to use the newest technologies and products (this is the first risk) and you have to carefully consider all possible failure scenarios. Even you use CISCO OTV or similar technology for L2 network overlay over L3, even you use geo distributed storage like pair of EMC VPLEX, NetApp MetroCluster, HP 3PAR stretched volume, IBM SVC/Storwise, DELL Compellent LiveVolume, etc. which appears as single storage distributed across datacenters and ensures optimal DCI usage, even you have third site used as split brain arbiter allowing automatically failover geo distributed storage nodes ... vSphere stretched HA cluster and Distributed Storage will be single failure zones. It is not bug. It is a feature by design!!! If you need it and want it go for it. If you prefer split available zones, don't do it. Period. Just my $0.02 and your milage may vary.
  8. easy method: GSLB aka DNS pointing to SLB(haproxy or nginx) where the SLB routes to site local Hosts!

    It solves the entire issue of spanning sites over the WAN and syncing massive "dataset", as the only state that gets synced is the state of the SLB fronting the hosts on Site.

Add comment
Sidebar