AWS Automatic EC2 Instance Recovery
On March 30th 2022, AWS announced automatic recovery of EC2 instances. Does that mean that AWS got feature-parity with VMware High Availability, or that VMware got it right from the very start? No and No.
Automatic Instance Recover Is Not High Availability
Reading the AWS documentation (as opposed to the feature announcement) quickly reveals a caveat or two. The automatic recovery is performed if an instance becomes impaired because of an underlying hardware failure or a problem that requires AWS involvement to repair.
In simpler terms:
- If AWS experiences a (hypervisor) server failure or NIC failure, they’ll recover your instance.
- If your instance crashes, or if an application server process hangs in your instance, they’ll do nothing. It’s still your responsibility.
VMware HA does all that, but it also includes VM and Application Monitoring that uses VMware Tools to detect VM operating system failures. You can even use VMware SDK to generate application-specific heartbeats and have VMware HA restart the virtual machine if the application stops responding.
AWS EC2 does something similar to what VMware Tools are doing with instance status checks, but you have to define an Amazon CloudWatch alarm action to reboot your instance1.
VMware HA Clusters Still Don’t Scale
How could AWS implement something similar to VMware HA clusters (which are limited to 64-96 hosts per cluster) in a region with (supposedly) millions of servers. Hint: they used a scalable architecture ;)
For decades, VMware treated vCenter as a GUI add-on that you could turn off when you go home. The high availability clusters and DRS were thus implemented as a standalone peer-to-peer mechanism independent of vCenter. The end result: a nasty conglomerate of byzantine failure scenarios (tons of blog posts and whole books were written about them) and “interesting” synchronization challenges when VMware started adding layers of network abstractions with VMware NSX2 on top of that pile of complexity.
AWS won’t tell you how they implemented automatic recovery, but as they already had server monitoring and instance status checks for a decade, all they had to do was to add a behind-the-scenes CloudWatch action to restart an instance when
StatusCheckFailed_System changes to one. Instance recovery is thus not a byzantine system with its own mindset and opinions but a simple add-on to the existing thoroughly tested orchestration system. One does have to wonder why it took a decade to implement though ;)
Did you notice I had to use three links in a single sentence to describe what’s going on? I love AWS documentation, but sometimes the granularity/nesting level goes through the roof. ↩︎
LDP-IGP synchronization issues are a kindergarten-level topic compared to HA/DRS-NSX ones. For example, in early versions of VMware NSX you could lose connectivity to your VMs if DRS moved them while the vCenter SOAP service was broken. ↩︎
I work on large scale workload, migration projects regularly migrating systems from a legacy data center to an end state data center. Almost every single project I have to tell the customers why I won't agree to a design that requires permanent layer 2 stretching. Some customers can be very persistent, and besides listing all the problems that can occur with permanent layer 2 extensions, I just also casually remind them that Amazon won't let you do it either, not even in between their own two data centers, let alone with your legacy data center ( as subnets are limited within and AZ). Fun stuff! 😁