Your browser failed to load CSS style sheets. Your browser or web proxy might not support elliptic-curve TLS

Building network automation solutions

6 week online course

reserve a seat
back to overview

Controller Cluster Is a Single Failure Domain

Some OpenFlow-focused startups are desperately trying to tell you how redundant their architecture is. Unfortunately all the whitepapers (and the prancing unicorns) cannot change a simple fact: an SDN controller (OpenFlow-based or otherwise) is in some aspects a single failure domain.

But the controller cluster will save the day

No, it won’t. A controller cluster will protect you against hardware failures, which are probably the last 1% of all failures you’ll encounter (if that’s not the case, change the hardware). A cluster will not protect you against software failures (= bugs) or operator mistakes (= fat fingers).

An active/standby controller cluster might be less sensitive than an active/active one. If the active controller crashes, the standby controller takes over (similar to supervisor failover in most high-end switches) and starts with a fresh copy of the data structures. Controllers in an active-active cluster might share the data structures and thus be affected by the same bug at the same time.

A controller crash might also be triggered by a malformed packet, or even a perfectly valid one – decades ago one of my hosts generated a legitimate ARP packet that consistently crashed next-hop Cisco router. In this case, it’s reasonable to expect the backup controller to crash as soon as it takes over and receives the same packet from the same host.

Finally there’s the complexity of the clustering software. I haven’t heard of a clustering solution that would provably work under all possible weird conditions (and it’s pretty hard to test all of them); failovers between supervisor modules are no exceptions.

Obviously, if there’s a perfect clustering solution out there, please write a comment.

What can we do?

The solutions to this challenge are well known:

Not surprisingly, scalable SDN solutions from Google, Microsoft and (supposedly) Facebook, as well as some network virtualization solutions use most or all of these principles.

Need a bigger picture?

Check out SDN and cloud networking resources on

1 comment:

  1. Maybe not perfect, but a step in the right direction: The Nuage SDN controller - we call it 'VSC' - is based on our service router code and protocols ( BGP, OSPF ), which has been running for years in hundreds of thousands of devices in networks world wide. Using standards-based protocols and field proven code is a good recipe for building a solid controller. Also, data flows will continue for a programmable interval in case of a dual active/standby controller failure - so the impact is a customer configurable trade-off


You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.