Some OpenFlow-focused startups are desperately trying to tell you how redundant their architecture is. Unfortunately all the whitepapers (and the prancing unicorns) cannot change a simple fact: an SDN controller (OpenFlow-based or otherwise) is in some aspects a single failure domain.
But the controller cluster will save the day
No, it won’t. A controller cluster will protect you against hardware failures, which are probably the last 1% of all failures you’ll encounter (if that’s not the case, change the hardware). A cluster will not protect you against software failures (= bugs) or operator mistakes (= fat fingers).
An active/standby controller cluster might be less sensitive than an active/active one. If the active controller crashes, the standby controller takes over (similar to supervisor failover in most high-end switches) and starts with a fresh copy of the data structures. Controllers in an active-active cluster might share the data structures and thus be affected by the same bug at the same time.
A controller crash might also be triggered by a malformed packet, or even a perfectly valid one – decades ago one of my hosts generated a legitimate ARP packet that consistently crashed next-hop Cisco router. In this case, it’s reasonable to expect the backup controller to crash as soon as it takes over and receives the same packet from the same host.
Finally there’s the complexity of the clustering software. I haven’t heard of a clustering solution that would provably work under all possible weird conditions (and it’s pretty hard to test all of them); failovers between supervisor modules are no exceptions.
Obviously, if there’s a perfect clustering solution out there, please write a comment.
What can we do?
The solutions to this challenge are well known:
- Distributed systems are more resilient than centralized ones;
- Loosely coupled systems (example: BGP SDN) are more resilient than tightly coupled ones (example: OpenFlow controller);
- Network infrastructure enhanced by a controller is more resilient than one that relies on a controller to operate;
- Complexity at the edge of the network scales better than centralized complexity.