High-Availability Solutions
ChatGPT explaining application high availability to a high school kid
Before going into the details, it’s worth figuring out what the application (or system) users need as opposed to what they think they need:
Not surprisingly, IT vendors sell magic infrastructure solutions as the high-availability panacea based on the assumption that redundant infrastructure cannot fail. Nothing could be further from the truth:
High Availability Concepts, Technologies, and Solutions
You can use a plethora of approaches depending on your availability targets:
- Disaster recovery is the right tool for the job if you’re OK with the system being down for a few hours.
- Automatic restart of application instances combined with disaster recovery is acceptable if you can accept your system to be down ~0.1% of the time (99.9% availability)
- Availability targets higher than 99.9% can only be reached reliably with proper application design supported by well-designed infrastructure.
I wrote over 130 blog posts on these topics. It would be impossible to list all of them on a single page; major high-availability technologies or concepts thus have dedicated pages:
- Disaster recovery and avoidance
- High availability clusters
- Public and private cloud deployments
- Global and local load balancing with IP anycast
One of the prerequisites for highly available services is also redundant networking infrastructure:
Regardless of your approach, the only sustainable way to get highly available services is the correct design of the application stack. For more details, watch the Designing Active-Active and Disaster Recovery Data Centers webinar; I also wrote a few blog posts on the topic:
Notable Outages
Finally, here are a few notable outages. TL&DR: it can happen to the big guys and will eventually happen to you.
Other High Availability Blog Posts
- 2024
- 2023
- 2022
- 2021
-
- Optimizing the Time-to-First-Byte
- Non-Stop Routing (NSR)
- Where Would You Need DNS Anycast?
- Big Picture: BFD, Non-Stop Forwarding, and Graceful Restart
- Interactions Between BFD and Graceful Restart
- Circular Dependencies Considered Harmful
- Graceful Restart and BFD
- Graceful Restart and Routing Protocol Convergence
- Graceful Restart and Other Control Plane Protocols
- Graceful Restart (GR)
- Stateful Switchover (SSO)
- Non-Stop Forwarding (NSF)
- Stretched VLANs: What Problem Are You Trying to Solve?
- When Stretching Layer Two, Separate Your Fate
- Understand Your Single Points of Failure
- Worth Reading: Fail-Fast is Failing... Fast
- Impact of Azure Subnets on High Availability Designs
- Virtual Networks and Subnets in AWS, Azure, and GCP
- Designing a Simple Disaster Recovery Solution
- Availability Zones and Regions in AWS, Azure and GCP
- State Consistency in Distributed SDN Controller Clusters
- Fast and Simple Disaster Recovery Solution
- Repost: VMware Fault Tolerance Woes
- 2020
-
- Differential Availability
- Fifty Shades of High Availability
- Are Business Needs Just Excuses for Vendor Shenanigans?
- Disaster Recovery: a Vendor Marketing Tale
- Bridging Loops in Disaster Recovery Designs
- Meaningful Availability
- The Myth of Scaling From On-Premises Data Center into a Public Cloud
- Live vMotion into VMware-on-AWS Cloud
- You're Responsible for Resiliency of Your Public Cloud Deployment
- Public Cloud Cannot Change the Laws of Physics
- 2019
-
- You Don't Need IP Renumbering for Disaster Recovery
- Disaster Recovery and Failure Domains
- Tuning BGP Convergence in High-Availability Firewall Cluster Design
- Using BGP for Firewall High Availability: Design and Software Upgrades
- Stretched VLANs and Failing Firewall Clusters
- Disaster Recovery Faking, Take Two
- Disaster Recovery Test Faking: Another Use Case for Stretched VLANs
- Impact of Controller Failures in Software-Defined Networks
- Real-Life Data Center Meltdown
- How Common Are Data Center Meltdowns?
- Decide How Badly You Want to Fail
- To Centralize or not to Centralize, That’s the Question
- BGP as High Availability Protocol
- Large Layer-2 Domains Strike Again…
- 2018
- 2017
- 2016
-
- Ingress Traffic Flow in Multi-Data Center Deployments
- Reliability of Clustered Solutions: Another Data Point
- The Network Is Reliable and Other Stories
- Do I Need Redundant Firewalls?
- Stretched ACI Fabric Is Sometimes the Least Horrible Solution
- Unexpected Recovery Might Kill Your Data Center
- Some People Don’t Get It: It Will Eventually Fail
- Host-to-Network Multihoming Kludges
- High Availability Planning: Identify the Weakest Link
- How Hard Is It to Think about Failures?
- VLANs and Failure Domains Revisited
- Running BGP on Servers
- 2015
-
- The Grumpy Old Network Architects and Facebook
- Can You Afford to Reformat Your Data Center?
- Stretched Firewalls across Layer-3 DCI? Will the Madness Ever Stop?
- Is Anyone Using Long-Distance VM Mobility in Production?
- Sometimes You Have to Decide How You Want to Fail
- Building Carrier-Grade Cloud Infrastructure
- Designing Active-Active and Disaster Recovery Data Centers
- What Happens When a Data Center Fabric Switch Fails?
- VMware VSAN Can Stretch – Should It?
- Can You Avoid Networking Software Bugs?
- Another Spectacular Layer-2 Failure
- So You Need ISSU on Your ToR switch? Really?
- Design Challenge: Multiple Data Centers Connected with Slow Links
- Availability Zones in Overlay Virtual Networks
- Cisco ACI – a Stretched Fabric That Actually Works
- Before Talking about vMotion across Continents, Read This
- Is Controller-Based Networking More Reliable than Traditional Networking?
- Latency: the Killer of Spread-Out Application Stack Ideas
- 2014
-
- Coping with Byzantine Routing Failures
- Use a Disaster Recovery Project to Build Your New Cloud
- Workload Mobility and Reality: Bandwidth Constraints
- Controller Cluster Is a Single Failure Domain
- All Operations Engineers Should Have Firefighting Training
- Should We Use Redundant Supervisors?
- Can We Use IPv6 Router Advertisements for Fast Failover?
- Whose Failure Domain Is It?
- Combine Physical and Virtual Appliances in a Private Cloud
- Keep Your Failure Domains Small
- Disasters and Recoveries, Part 2
- Disasters and Recoveries, Part 1
- 2013
-
- Are Your Applications Cloud-Friendly?
- Sooner or Later, Someone Will Pay for the Complexity of the Kludges You Use
- 50 Shades of Statefulness
- Dynamic Routing with Virtual Appliances
- Simplify Your Disaster Recovery with Virtual Appliances
- Resiliency of VM NIC firewalls
- This Is What Makes Networking So Complex
- Does dedicated iSCSI infrastructure make sense?
- Long-Distance vMotion, Stretched HA Clusters and Business Needs
- Redundant Data Center Internet Connectivity – High-Level Design
- Redundant Data Center Internet Connectivity – Problem Overview
- 2012
- 2011
-
- IPv6 Multihoming Without NAT: the Problem
- Busting Layer-2 Data Center Interconnect Myths
- Follow-the-Sun Workload Mobility? Get Lost!
- Reliable or Unreliable Cloud Services?
- Long-distance vMotion for Disaster Avoidance? Do the Math First
- Long-distance IRF Fabric: Works Best in PowerPoint
- High Availability Fallacies
- Disaster Recovery: Lessons Learned
- Disasters Happen ... It’s the Recovery that Matters
- Stretched Clusters: Almost as Good as Heptagonal Wheels
- Distributed Firewalls: a Ticking Bomb
- 2010
- 2009