Controller Cluster Is a Single Failure Domain

Some OpenFlow-focused startups are desperately trying to tell you how redundant their architecture is. Unfortunately all the whitepapers (and the prancing unicorns) cannot change a simple fact: an SDN controller (OpenFlow-based or otherwise) is in some aspects a single failure domain.

But the Controller Cluster Will Save the Day

No, it won’t. A controller cluster will protect you against hardware failures, which are probably the last 1% of all failures you’ll encounter (if that’s not the case, change the hardware). A cluster will not protect you against software failures (= bugs) or operator mistakes (= fat fingers).

An active/standby controller cluster might be less sensitive than an active/active one. If the active controller crashes, the standby controller takes over (similar to supervisor failover in most high-end switches) and starts with a fresh copy of the data structures. Controllers in an active-active cluster might share the data structures and thus be affected by the same bug at the same time.

A controller crash might also be triggered by a malformed packet, or even a perfectly valid one – decades ago one of my hosts generated a legitimate ARP packet that consistently crashed next-hop Cisco router. In this case, it’s reasonable to expect the backup controller to crash as soon as it takes over and receives the same packet from the same host.

Finally there’s the complexity of the clustering software. I haven’t heard of a clustering solution that would provably work under all possible weird conditions (and it’s pretty hard to test all of them); failovers between supervisor modules are no exceptions.

Obviously, if there’s a perfect clustering solution out there, I’d love to hear about it. Please write a comment.

What Can We Do?

The solutions to this challenge are well known:

Not surprisingly, scalable SDN solutions from Google, Microsoft and (supposedly) Facebook, as well as some network virtualization solutions use most or all of these principles.

Need a Bigger Picture?

Check out SDN and cloud networking resources on ipSpace.net.

Latest blog posts in High Availability Service Clusters series

1 comments:

  1. Maybe not perfect, but a step in the right direction: The Nuage SDN controller - we call it 'VSC' - is based on our service router code and protocols ( BGP, OSPF ), which has been running for years in hundreds of thousands of devices in networks world wide. Using standards-based protocols and field proven code is a good recipe for building a solid controller. Also, data flows will continue for a programmable interval in case of a dual active/standby controller failure - so the impact is a customer configurable trade-off
Add comment
Sidebar