Controller Cluster Is a Single Failure Domain

Monday, September 8, 2014 18:19 +0200

Controller Cluster Is a Single Failure Domain

Some OpenFlow-focused startups are desperately trying to tell you how redundant their architecture is. Unfortunately all the whitepapers (and the prancing unicorns) cannot change a simple fact: an SDN controller (OpenFlow-based or otherwise) is in some aspects a single failure domain.

But the Controller Cluster Will Save the Day

No, it won’t. A controller cluster will protect you against hardware failures, which are probably the last 1% of all failures you’ll encounter (if that’s not the case, change the hardware). A cluster will not protect you against software failures (= bugs) or operator mistakes (= fat fingers).

An active/standby controller cluster might be less sensitive than an active/active one. If the active controller crashes, the standby controller takes over (similar to supervisor failover in most high-end switches) and starts with a fresh copy of the data structures. Controllers in an active-active cluster might share the data structures and thus be affected by the same bug at the same time.

A controller crash might also be triggered by a malformed packet, or even a perfectly valid one – decades ago one of my hosts generated a legitimate ARP packet that consistently crashed next-hop Cisco router. In this case, it’s reasonable to expect the backup controller to crash as soon as it takes over and receives the same packet from the same host.

Finally there’s the complexity of the clustering software. I haven’t heard of a clustering solution that would provably work under all possible weird conditions (and it’s pretty hard to test all of them); failovers between supervisor modules are no exceptions.

Obviously, if there’s a perfect clustering solution out there, I’d love to hear about it. Please write a comment.

What Can We Do?

The solutions to this challenge are well known:

Distributed systems are more resilient than centralized ones;
Loosely coupled systems (example: BGP SDN) are more resilient than tightly coupled ones (example: OpenFlow controller);
Network infrastructure enhanced by a controller is more resilient than one that relies on a controller to operate;
Complexity at the edge of the network scales better than centralized complexity.

Not surprisingly, scalable SDN solutions from Google, Microsoft and (supposedly) Facebook, as well as some network virtualization solutions use most or all of these principles.

Need a Bigger Picture?

Check out SDN and cloud networking resources on ipSpace.net.

1 comments:

Jeroen van Bemmel 16 September 2014 05:36

Maybe not perfect, but a step in the right direction: The Nuage SDN controller - we call it 'VSC' - is based on our service router code and protocols ( BGP, OSPF ), which has been running for years in hundreds of thousands of devices in networks world wide. Using standards-based protocols and field proven code is a good recipe for building a solid controller. Also, data flows will continue for a programmable interval in case of a dual active/standby controller failure - so the impact is a customer configurable trade-off

But the Controller Cluster Will Save the Day

What Can We Do?

Need a Bigger Picture?

Latest blog posts in High Availability Service Clusters series

Recent posts in the same categories

SDN

high availability

1 comments: