Keep Your Failure Domains Small

A week after the disastrous sleet that kicked whole regions of Slovenia off power grid the servicemen of the local power distribution company (working literally days and nights) managed to restore electricity to the closest town … but it still might take days or even weeks before everyone gets it. One of the reasons: huge failure domains.

The 10 KV power lines that bring electricity to the transformer near my house (luckily I have an underground cable) are hardwired together in a multidrop fashion that would have made SDLC oldtimers either immensely proud or scared to death (because they knew how much havoc a single misbehaving modem could wreak).

For those of you that never experienced the beauties of multidrop SDLC, here’s how the power distribution works around my place:

The crucial problem: there’s no disconnector at the junctions, making the whole distribution tree a single failure domain. A single tree branch short-circuiting the wires at the remotest point could cut off hundreds of customers.

Back to Bits and Bytes

Wondering what this blog post has to do with networking? You do remember that every bridged network (aka layer-2 network) is also a single failure domain, right? A forwarding loop might bring down the whole domain (which some people enthusiastically extend across multiple data centers).

What Can We Do?

Here are a few things to keep in mind:

  • Keep your failure domains as small as possible. Terminate bridging as soon as possible;
  • Insert as much failure isolation as you can. Overlay virtual networks nicely isolate the single failure domain of a layer-2 virtual network from the robust layer-3 transport infrastructure;
  • Use technologies that reduce the size of a failure domain. Layer-3 hypervisor switching eliminates layer-2 failure domains altogether (other failure domains like single cluster of managements systems are obviously still an issue);
  • Build a hierarchy of failure domains. Availability zones in your private cloud are the necessary first step.
  • Analyze the structure of mission-critical applications (covered in more details in the fantastic Scalability Rules book).

More Information

Watch these cloud computing webinars:

Latest blog posts in Disaster Recovery series

1 comments:

  1. I am all for limiting failure domains as much as possible. But what would you have a small ISP do, to limit failure domains?

    Metro Ethernet and MPLS Virtual Private LAN service are all the rage, and offers customers the promise of being able to connect all their branch offices together, and use the same set of VLANs with free Layer 2 connectivity between their sites.

    It seems that the failure domains can't be limited, because it's either: extend the failure domains, or lose out in selling the service, b/c the customer will buy from another ISP, if we are not "buzzword compliant".
Add comment
Sidebar