Repost: VMware Fault Tolerance Woes
I always claimed that VMware Fault Tolerance makes no sense. After all, the only thing it does is protect a VM against a server hardware failure… in the world where software crashes are way more common, and fat fingers cause most of the outages.
But wait, it gets worse, the whole thing is incredibly complex – you might like this description Minh Ha left as a comment to my Fifty Shades of High Availability blog post.
In one of the linked posts you mentioned VMware FT Ivan. FT is a high-end feature of VMware’s HA portfolios, and it also happens to be a resource hog. Due to the way it needs to log all input and non-deterministic events on the primary VM, send it over to and replay it on the backup VM, things that normally bypass the hypervisor like Rx/Tx now have to go through it for logging purpose and, get delayed until acknowledgement is received, so I/O-bound workloads will incur big performance hit. And multicore VM rubs salt to the wound, because the order of CPUs accessing shared memory needs to be tracked and retained for semantics preservation/correctness purpose. Basically the slowdown is superlinear with the increase in CPU cores. And that’s why even though they claim here – at question 18 – that FT doesn’t cause degradation, looking at their corresponding white paper, the slowdown is indeed superlinear:
And that’s just with synthetic workloads. And yes, I/O-bound workloads – Rx more so than Tx due to the different ways FT deals with each of them – suffer non-trivial downgrade. Some of their customers also reported similar issues with I/O.
Essentially, looks like HA solutions normally come with performance trade-offs, sometimes considerable ones, and they always cost a hell lot more.
Also, I remember earlier this year, you were blogging about some guy demonstrating a lossless Vmotion failover. Frankly, what does it prove anything? The Vmotion process is inherently lossy, due to the repeated iterations of memory copy and the freezing of the VM, esp. for memory-intensive workload, and that you can successfully execute a lossless migration, just means you’re lucky, thanks to probabilities, or have a workload that doesn’t stress Vmotion capability to its limit, or both. Or put it in a semi-formal way, just because you manage to achieve result-level correctness, doesn’t mean you have process-level correctness :p. Think gambling. That’s one classic example of (sometimes) great result, horrible process.
As to end-to-end HA, I agree 100% with you that the right way to do it is via the applications, as it goes along the same line of complexity belonging at the edge and simplicity at the core, smart edge dumb core :)) . DNS exemplifies that application-level-HA paradigm. It’s simple and rock solid.
Another great example worth mentioning, is good ol Active Directory. MS did get it right with their DS. AD is a distributed DB application, and a multi-master replication one at that. Given it was designed with this specific model in mind from day one and it came to be more than 20 yrs ago, before this scale-out movement was even a thing, one has to give it to MS on this one.
AD is among the most mission-critical part of just about any company’s infrastructure, and it’s whole by itself, doesn’t need any overpriced and overrated HA device to look after its heartbeat. AD’s HA is completely built into its mechanics, with its eventually-consistent DB. Within a site, replication is super quick, and inter-site replication is done using distance-vector paradigm to ensure high scalability. On a side note, MS started Intersite replication in AD with their proprietary algorithm, most likely a link-state one because MS Exchange at that time used LS to route emails between its servers. That one made AD fall apart at 250 sites or so in windows 2000, so MS gave up on it and went with a simpler BGP-like replication model between sites.
in AD, if any server goes down, the client just locates another one using DNS SRV record, first within site and then globally if all servers within a site fail. It’s scaled-out, simple, and effective, and it works so well people don’t bothers talking about AD these days anymore, and actually haven’t done so in a long time.
Ivan,
all said is true. Where it make sense is SWs that have a strict licensing policy. When you pay per installed server. Not so common today but still are use cases.
Hi Ivan,
Thx a lot for the repost! While Vmware FT indeed has its use for cases where hardware failure is a concern, we need to make clear what kind of hardware protection it can provide. FT is meant to provide protection for fail-stop CPU, that is, when the CPU failure hasn't managed to cause externally visible damage. If the failing CPU has managed to spread its damage to output events, causing say session corruption, FT provides no protection against it.
So yes, even wrt to hardware, it's always good to know the boundary. This is not nit-picking on the tool, as after all, no man-made product is flawless, -- if someone else claims otherwise, they're SELLING -- it's simply to provide a clear view on its capability.
Re Vmotion, the final pause period of freezing the machine and transferring control over to the migrated machine, is strived to be under 1s. Let's say it's 300ms, which is quite good. If the server is having a 10GE interface and hosting I/O bound workload, that would potentially translate to more than 300MB of data lost. So yes, the process is inherently lossy.
Speaking of data center technologies like FT and Vmotion, also reminds me of this other issue -- I'm digressing as always :p. Some people have suggested using RDMA to implement checkpoint-based FT alternatives to Vmware FT, to overcome the multi-core obstacle of deterministic-replay approaching, taking advantage of RDMA high performance transfer capability.
Even without this use case, RDMA/RoCE is already being widely used in many DCs these days, esp. big ones like those of hyperscalers, and the use of this technology requires PFC to provide lossless network similar to that of Infiniband. But PFC is well-known for HOL blocking and can also cause cyclic buffer dependency in the fabric, resulting in deadlock. It's not even a theoretical debate as it's already happened in MS DCs.
One easy and scalable way to guard against deadlock in PFC-enabled network is to ensure strict adherence to valley-free routing. Violation of VFR can cause CBD resulting in potential deadlock. So for people who think VFR is just for show, they might want to reconsider their position :)).