On March 30th 2022, AWS announced automatic recovery of EC2 instances. Does that mean that AWS got feature-parity with VMware High Availability, or that VMware got it right from the very start? No and No.
Automatic Instance Recover Is Not High Availability
Reading the AWS documentation (as opposed to the feature announcement) quickly reveals a caveat or two. The automatic recovery is performed if an instance becomes impaired because of an underlying hardware failure or a problem that requires AWS involvement to repair.
In simpler terms:
- If AWS experiences a (hypervisor) server failure or NIC failure, they’ll recover your instance.
- If your instance crashes, or if an application server process hangs in your instance, they’ll do nothing. It’s still your responsibility.
VMware HA does all that, but it also includes VM and Application Monitoring that uses VMware Tools to detect VM operating system failures. You can even use VMware SDK to generate application-specific heartbeats and have VMware HA restart the virtual machine if the application stops responding.
VMware HA Clusters Still Don’t Scale
How could AWS implement something similar to VMware HA clusters (which are limited to 64-96 hosts per cluster) in a region with (supposedly) millions of servers. Hint: they used a scalable architecture ;)
For decades, VMware treated vCenter as a GUI add-on that you could turn off when you go home. The high availability clusters and DRS were thus implemented as a standalone peer-to-peer mechanism independent of vCenter. The end result: a nasty conglomerate of byzantine failure scenarios (tons of blog posts and whole books were written about them) and “interesting” synchronization challenges when VMware started adding layers of network abstractions with VMware NSX2 on top of that pile of complexity.
AWS won’t tell you how they implemented automatic recovery, but as they already had server monitoring and instance status checks for a decade, all they had to do was to add a behind-the-scenes CloudWatch action to restart an instance when
StatusCheckFailed_System changes to one. Instance recovery is thus not a byzantine system with its own mindset and opinions but a simple add-on to the existing thoroughly tested orchestration system. One does have to wonder why it took a decade to implement though ;)
Did you notice I had to use three links in a single sentence to describe what’s going on? I love AWS documentation, but sometimes the granularity/nesting level goes through the roof. ↩︎
LDP-IGP synchronization issues are a kindergarten-level topic compared to HA/DRS-NSX ones. For example, in early versions of VMware NSX you could lose connectivity to your VMs if DRS moved them while the vCenter SOAP service was broken. ↩︎