High Availability Fallacies
I’ve already written about the stupidities of risking the stability of two data centers to enable live migration of “mission critical” VMs between them. Now let’s take the discussion a step further – after hearing how critical the VM the server or application team wants to migrate is, you might be tempted to ask “and how do you ensure its high availability the rest of the time?” The response will likely be along the lines of “We’re using VMware High Availability” or even prouder “We’re using VMware Fault Tolerance to ensure even a hardware failure can’t bring it down.”
I have some bad news for the true believers in virtualization-supported high availability – quite a few of them probably don’t understand how it works. Let’s see what HA products can do … and keep in mind that hardware causes just a few percents of the failures; most of them are caused by software failures or operator errors.
VMware High Availability (or any equivalent product) is a great solution, but the best it can do is to restart a VM after it crashes or after the hypervisor host fails. Assuming you can reliably detect a VM OS or application service (for example, database software) failure, the VM still needs to be restarted. VM-level high availability is thus dangerous, as it gives application developers and server administrators false hopes – they start to believe a magical product can bring high availability to any hodgepodge of enterprise spaghetti code. In reality, the VM has to go through full power-up process and all the services the VM runs have to perform whatever recovery procedures they need to run before the VM (and its services) are fully operational.
VMware Fault Tolerance is an even more interesting case. It runs two parallel copies of the same VM (and ensures they're continuously synchronized) – a perfect solution if you’re running a very lengthy procedure and don’t want a hardware failure to interrupt it. Unfortunately, software failures happen more often than hardware ones ... and if the VM crashes, both copies (running in sync) will crash simultaneously. Likewise, if the application service running in the VM crashes (or hangs), it will do so in both copies of the VM.
High-availability clusters like Windows Server Failover Clustering restart a failed service (for example, the SQL server) on the same or on another server. The restart can take from a few seconds to a few minutes (or sometimes even longer if the database has to do extensive recovery). A nine lost.
Bridging between data centers (the typical design recommended by VMware-focused consultants) might cause long-distance forwarding loops, or you might see the flood of traffic caused by a forwarding loop spilled over the WAN link into the other data center, killing all other inter-DC traffic (including cluster heartbeats if you’re brave enough to use long-distance clusters, and storage replication).
Want a data point: we experienced a forwarding loop caused by an intra-site STP failure. Recovery time: close to 30 minutes with NMS noticing the problem immediately and operator being available on site. Admittedly some of that time has been spent collecting evidence for post-mortem analysis.
Are you really willing to risk your whole IT infrastructure to support an application that was never designed to run on more than one instance? After all, one would hope your server admins do patch the servers … and patches do require an occasional restart, don’t they?
Moral of the story: the “magic” products give you false sense of security; good application architecture and use of truly highly-available products (like MySQL database cluster) combined with load balancing technologies are the only robust solution to the high availability challenge.
Even More Information
You’ll find in-depth discussions of high-availability architectures in the Designing Active-Active and Disaster Recovery Data Centers webinar.
Want to dive deep into the underlying infrastructure technologies? Watch Data Center 3.0 for Networking Engineers, Data Center Interconnects and VSphere 6 Networking Deep Dive webinars.
- Rewrote a few paragraphs to make them easier to understand.
The problem is to provide a HA solution for anything that has to do with persistent local data. This may include the database in (relatively) modern 3 tiers app but it also includes more traditional Enterprise applications (Exchange being an example).
It is not even worth discussing how to provide resiliency to the front end. It's done. Focus your energies for the back-end.
However, both SQL Server and MySQL offer a redundant server configuration, where the second server can take over immediately when the first one fails. High-end MySQL offers an even better distributed solution. So the problems can be solved ... but it's easier to offload them to someone else and believe in unicorn tears.
Having this said there is clearly a trend for which this backend is being made more "scale out" friendly... but it will be a long way to go.
My 2 cents.
SQL Server provides database mirroring (which can be synchronous if you want to retain total consistency).
And we (yet again) agree that the backend has a long way to go ;)
In most enterprise organizations I have been at least 80% of the applications, which are essential to the line-of-business day-to-day, don't support this kind of set up. This is one of the reasons HA is so widely adopted today. On top of that there is a substantial cost associated to load balancers and a shared database configuration (yes needs to be clustered / distributed as well) which might be more than the SLA requires. In those cases vSphere HA / FT / VM and App Monitoring are the way to go. 5 clicks and it is configured, no need to have special skills to enable it... just point and click.
Once again, I agree that using a vFabric load balanced setup (shameless plug :)) would be ideal, but there are far too many legacy apps out there. Even in the largest enterprise orgs the IT department cannot control this, even the line-of-business cannot control it... main reason being that they are suppliers not taking the time to invest.
Go vSphere HA
Unfortunately critical doesn't equal current or mature application architecture.
We could go for Netscaler's traditional cluster setup, but that would require us buying 2x licenses. With our existing FT license we get just as much reliability with no extra cost.
If software inside of that VM were to die, we would be in exactly the same situation as running it on a dedicated box.