Fifty Shades of High Availability

A while ago we had an interesting exchange of ideas around inserting high-availability network appliance into a public cloud environment (TL&DR: it was really hard until AWS introduced Gateway Load Balancing), and someone quickly pointed out we’re solving the wrong challenge because…

Azure Firewall […] is a fully stateful firewall-as-a-service with built-in high-availability.

Somehow he wasn’t too happy when I pointed out that there’s more to high availability than vendor marketing ;)

WARNING: Before reading the rest of this blog post, try answering this simple question:

What do I really need, and what is good enough?

Still here? Are you sure your application is so important it needs special tricks to increase its availability?

OK, let’s start with the oldest marketing trick: high availability means restarting a VM when it fails (aka VMware High Availability). While it’s true that the service will be eventually restored, and it’s better than running a single-instance server with no safeguards, keep these minor details in mind:

  • The VM is restarted when some agent discovers it’s time to restart it, which is perfectly fine for handling hardware failures, but not exactly stellar when dealing with software failures or degraded performance. Your service might be gone way before that agent gets nervous.
  • A restarted VM will reboot. That takes time (more so if you’re still using a Windows server).
  • The disks might need file system check. That takes some more time.
  • If you’re running a service that uses an internal database, that database needs to be recovered. Add some more waiting time.

So yeah, after a while the service will be back. In the meantime, the unhappy customers will be twiddling their thumbs.

Regardless of how unimportant my application is, I would always use this approach when available. After all, it can’t hurt to have your service restarted after a failure… but please don’t call a single VM instance that is rebooted when it fails a high availability solution. It’s 2020, and what you’re doing is nothing more than basic hygiene.

Moving forward with beloved virtualization solutions, I will not waste any time on crazy schemes involving moving running server instances around the globe. They work best in PowerPoint and rarely in practice (unless you need an audit tick-in-the-box).

High-availability server clusters are just a tad better. Instead of an operating system restart, a service is restarted when the primary server or service fails. The clients won’t have to wait for the OS reboot, but might still wait for service recovery.

OK, so we’ll run multiple server instances, put a load balancer in front of them, and restart failed instances as needed (this is probably how Azure Firewall works). Much better, but still far from perfect:

  • All sessions going to the failed server will be lost on server failure. Not a big deal in a typical web application (although in most cases the end-user will get a garbled page and will reload in frustration), but it might be a significant annoyance when a 10GB upload fails after 15 minutes.
Web browsers typically don’t retry requests that fail, so it’s up to the end-user to get annoyed and press the RELOAD button.
  • If your application team asks you to implement any sort of session stickiness, it’s usually an indication that the application stores the client session state within individual server instances (as opposed to back-end database or key-value store). Not only will clients get garbled web pages on server failure, they will also lose session state, which might include the contents of their shopping cart. Awesome.

But wait, there’s something much much better (or so they claim): clusters sharing session state across all nodes. It looks picture-perfect in PowerPoint – whatever happens, someone will be able to keep going without losing anything (including client TCP sessions) – until you realize that tight coupling of devices that should provide failure resiliency probably isn’t the greatest idea out there. After all, if one node goes totally bonkers, it might bring down all other nodes (just ask anyone who had to deal with switch stack challenges).

There’s also the minor detail of nodes wasting increasingly more time synchronizing session state if you want to grow the cluster size to deal with increased load.

It’s obviously much better if you can implement high availability in the application code instead of relying on generic infrastructure high availability.

Example: When using active/standby database instances, the database clients can be made to switch from active to standby instance on failure, giving you much better results than any generic load balancing concoction:

  • There is no data loss if you use 2-phase commit (but then you tightly coupled the database instances into a single failure domain);
  • You might lose some data with log shipping but the database instances are decoupled, which means that the primary instance keeps working even if the standby instance fails.
  • Failover to standby instance is almost instantaneous as opposed to load balancer health probes. When a database client detects connection error, it automatically switches over to the backup instance.
  • You could even use the standby instance for read-only database access, increasing the performance of applications that mostly read data like most e-commerce applications; in some environment 99% of the transactions are read-only queries.

Finally, keep in mind that modern application stack contain more than one service, and that a single rotten apple anywhere in the stack can bring down the performance of the whole stack (see How Not to Measure Latency). The only way to work around this problem is to have independent copies of the application stack (swimlanes) to keep the performance challenges within a single lane which can then be shut down if needed.

Sample swimlane design

Sample swimlane design

However, even within a perfectly designed system with multiple swimlanes you might get a total system failure if you’re using synchronous data exchange between database instances (example: 2-phase commit). The only way to make swimlanes completely independent from one another is to have eventual consistency between data stores… which means that the application code has to be able to deal with that.

More to Explore

Latest blog posts in Distributed Systems series


  1. End-to-end high-availability is always the best, I fully agree. That is why in safety critical networks we have ED-137 linked session, or the 4-times-simulcasted ARTAS radar streams.

    What is funny that even in-chassis redundancy in network devices still does not work well after three decades of development. We have daily problems that this feature or that feature is broken when there is an automated failover from one processor to the other. The background synchronization is difficult and they did no want to create a full transaction journalling, since that is expensive.

    If you have to implement high-availability in the infrastructure, then it is a sign that the applications are not designed very well for this purpose. Keep in mind that your infrastructure HA solution is just a compensation control for the weaknesses of the application.

  2. How to solve then the TCP session state problem on load balancer or server if they go down so that the end-user doesn't have to press the reload button? (Ignoring the question for a moment if it's worth solving this problem or not.)

  3. @Anonymous: Nobody has managed to migrate an existing TCP session to another server (at least I'm not aware of that). It's not just the state of various TCP counters and windows, the second server would have to know what the first server already did with the request(s).

    The only solution is to have an application-level library that understands that TCP is not a 100% reliable transport protocol and acts accordingly.

    @Bela: I love the "infrastructure HA is just a compensation control for application weaknesses". Thanks a million!

  4. In one of the linked posts you mentioned VMware FT Ivan. FT is a high-end feature of VMware's HA portfolios, and it also happens to be a resource hog. Due to the way it needs to log all input and non-deterministic events on the primary VM, send it over to and replay it on the backup VM, things that normally bypass the hypervisor like Rx/Tx now have to go through it for logging purpose and, get delayed until acknowledgement is received, so I/O-bound workloads will incur big performance hit. And multicore VM rubs salt to the wound, because the order of CPUs accessing shared memory needs to be tracked and retained for semantics preservation/correctness purpose. Basically the slowdown is superlinear with the increase in CPU cores. And that's why even though they claim here -- at question 18 -- that FT doesn't cause degradation, looking at their corresponding white paper, the slowdown is indeed superlinear:

    And that's just with synthetic workloads. And yes, I/O-bound workloads -- Rx more so than Tx due to the different ways FT deals with each of them -- suffer non-trivial downgrade. Some of their customers also reported similar issues with I/O:

    Essentially, looks like HA solutions normally come with performance trade-offs, sometimes considerable ones, and they always cost a hell lot more.

    Also, I remember earlier this year, you were blogging about some guy demonstrating a lossless Vmotion failover. Frankly, what does it prove anything? The Vmotion process is inherently lossy, due to the repeated iterations of memory copy and the freezing of the VM, esp. for memory-intensive workload, and that you can successfully execute a lossless migration, just means you're lucky, thanks to probabilities, or have a workload that doesn't stress Vmotion capability to its limit, or both. Or put it in a semi-formal way, just because you manage to achieve result-level correctness, doesn't mean you have process-level correctness :p. Think gambling. That's one classic example of (sometimes) great result, horrible process.

    As to end-to-end HA, I agree 100% with you that the right way to do it is via the applications, as it goes along the same line of complexity belonging at the edge and simplicity at the core, smart edge dumb core :)) . DNS exemplifies that application-level-HA paradigm. It's simple and rock solid.

    Another great example worth mentioning, is good ol Active Directory. MS did get it right with their DS. AD is a distributed DB application, and a multi-master replication one at that. Given it was designed with this specific model in mind from day one and it came to be more than 20 yrs ago, before this scale-out movement was even a thing, one has to give it to MS on this one.

    AD is among the most mission-critical part of just about any company's infrastructure, and it's whole by itself, doesn't need any overpriced and overrated HA device to look after its heartbeat. AD's HA is completely built into its mechanics, with its eventually-consistent DB. Within a site, replication is super quick, and inter-site replication is done using distance-vector paradigm to ensure high scalability. On a side note, MS started Intersite replication in AD with their proprietary algorithm, most likely a link-state one because MS Exchange at that time used LS to route emails between its servers. That one made AD fall apart at 250 sites or so in windows 2000, so MS gave up on it and went with a simpler BGP-like replication model between sites.

    in AD, if any server goes down, the client just locates another one using DNS SRV record, first within site and then globally if all servers within a site fail. It's scaled-out, simple, and effective, and it works so well people don't bothers talking about AD these days anymore, and actually haven't done so in a long time.

  5. Also, pls keep these jam-packed, heavyweight posts coming Ivan!! They're deep and heavy, yet highly enjoyable at the same time :)).

  6. Hi Ivan. At least in Azure VM can subscribe for about upcoming maintenance events:

    Then it can let Load Balancer know (by turning to unhealty state) that it will no longer accept new connections. Existing connections will still be kept on this VM until one of the sides (VM or remote endpoint) close it.

    If application really cares, it has at least 10 minutes to gracefully drain all the connections, let client application reconnect smoothly.

    Yes, AZFW works exactly this way:

    It's a little harder for Active/Standby apps like databases. Many Azure Native services do care about that and utilize these features.

    Of course in case of BSOD on VM or node failure all your statements are completely reasonable, but for planned maintenances it's completely manageable almost without even customer's notice.

Add comment