I’ve already written about the stupidities of risking the stability of two data centers to enable live migration of “mission critical” VMs between them. Now let’s take the discussion a step further – after hearing how critical the VM the server or application team wants to migrate is, you might be tempted to ask “and how do you ensure its high availability the rest of the time?” The response will likely be along the lines of “We’re using VMware High Availability” or even prouder “We’re using VMware Fault Tolerance to ensure even a hardware failure can’t bring it down.”
I have some bad news for the true believers in virtualization-supported high availability – quite a few of them probably don’t understand how it works. VMware HA is a great solution, but the best it can do is to restart a VM after it crashes or after the hypervisor host fails (and working on the VM level, it usually can’t detect a hung service). The VM has to go through full power-up process and all the services the VM runs have to perform whatever recovery procedures they need to run before the VM (and its services) are fully operational.
VMware FT is an even more interesting case. It runs two parallel copies of the same VM (and ensures they're continuously synchronized) – a perfect solution if you’re running a very lengthy procedure and don’t want a hardware failure to interrupt it. Unfortunately, software failures happen more often than hardware ones ... and if the VM crashes, both copies (running in sync) will crash simultaneously. Likewise, if the application service running in the VM crashes (or hangs), it will do so in both copies of the VM.
Update 2011-08-09: As expected, an interesting Twitter discussion followed this blog post. Among other interesting remarks, Duncan (Yellow Bricks) Epping rightfully pointed out that the VMware HA/FT products function exactly as described. That’s absolutely true – VMware’s documentation is extremely precise in describing how HA and FT work.
You can read more about high availability fallacies in an article I wrote for SearchNetworking (the title is a bit misleading) ... and remember: scale-out application architecture combined with load balancers is still the only way to reach true high availability.
Even more information
You’ll find in-depth discussions of high-availability architectures, impacts of vMotion and various types of data center interconnects in my webinars: Data Center 3.0 for Networking Engineers (recording), Data Center Interconnects (recording) and VMware Networking Deep Dive (recording or live session). All three webinars are also available as part of the yearly subscription.