High Availability Planning: Identify the Weakest Link

Everyone loves to talk about business critical applications that require extremely high availability, but it’s rare to see someone analyze the whole application stack and identify the weakest link.

If you start mapping out the major components of an application stack, you’ll probably arrive at this list (bottom-to-top):

  • Network links and devices;
  • Network services;
  • Servers and storage;
  • Virtualization platforms and operating systems;
  • Databases, message queues…;
  • Applications.

Each one of these components can fail due to hardware failure, software error, or human mistake (operator error).

Next, identify the likelihood of individual failures. Hardware failures (apart from link failures) are less common than software failures or operator errors these days, and in most cases infrastructure failures tend to be less common than application problems.

Finally, figure out how to increase the resilience of each of the components – redundant links and network devices, network services clusters, dual-homed servers, hypervisor-based high-availability solutions, database replicas, and finally scale-out application architecture.

Now you’re ready to start the discussion:

  • Which parts of the whole stack are currently resilient to failures and which ones represent a single point of failure?
  • Which parts could be made more resilient?
  • How will your organization handle the remaining SPOFs?
  • What is the downtime caused by a failure of a non-redundant component?
  • How often can you expect to see those failures?

Getting answers to those questions (good luck ;) might make it easier to persuade the CIO that you company doesn’t need a L2 DCI for disaster recovery (which might happen every 10 years) when the non-redundant applications need a restart every month or remain unpatched for years because nobody wants to touch them… and if everything else fails, you can still quote Gartner.

2 comments:

  1. Identifying the weakest link is like playing a game of whack a mole, when you think you have all your ducks in order another appliance, piece of software or other dependency is unknowingly introduced creating a possible SPOF.. that's why I find it's important to regularly review these questions with the right group of people (Management, Network, Servers, Storage teams, etc) I also agree that there are far more cheaper/affordable ways than jumping into the L2 DCI bandwagon. Many applications these days have their own replication/syncing technology built-in.. for example in the Microsoft world SQL Server mirroring has been around since v2005, Exchange 2007 has started it's own mailbox replication as well.. all of these have only improved. If possible, I would try to at least virtualize as many workloads as possible to make it at least possible to replicate legacy workloads at a VM level using software like Veeam or other form of replication. I understand that there will always be some super lagacy equipment which would require a forklift upgrade - I can think if a legacy IVR system using Dialogic cards (can't virtualize) tied to a bunch of analog telco lines..
    Replies
    1. I think another measure should be added: the effect of the removal of the weakest link, on the overall network.
      Or at a further (more complicated) level: the effect of its removal on other services (ie the dependencies on that system or service.)
Add comment
Sidebar