Impact of Controller Failures in Software-Defined Networks

Christoph Jaggi sent me this observation during one of our SD-WAN discussions:

The centralized controller is another shortcoming of SD-WAN that hasn’t been really addressed yet. In a global WAN it can and does happen that a region might be cut off due to a cut cable or an attack. Without connection to the central SD-WAN controller the part that is cut off cannot even communicate within itself as there is no control plane…

A controller (or management/provisioning) system is obviously the central point of failure in any network, but we have to go beyond that and ask a simple question: “What happens when the controller cluster fails and/or when nodes lose connectivity to the controller?”

Worst-case scenario is the orthodox SDN architecture with centralized control plane residing in the controller. While packet forwarding might continue to work until the flows time out, even ARP won’t work anymore.

Architectures based on a bit more operational experience like Big Switch fabric can deal with short-term failures. Big Switch claims ARP entries reside in edge switches, so they can keep ARP going even when the controller fails. It might also be possible to pre-provision backup paths in the network (see also: SONET/SDH) so the headless fabric can deal with link failures (but not link recoveries because those require path recalculation). Dealing with external topology changes like VM migration is obviously already a mission impossible.

Some architectures deal with controller failure by falling back to traditional behavior. For example, ESXi hosts that lose connectivity with the NSX-V controller cluster enter controller disconnected mode in which they flood every BUM packet on every segment to every ESXi host in the domain. While this approach obviously works, try to figure out how much overhead (and wasted CPU cycles) it generates.

On the complete other end of the spectrum are systems with traditional distributed control plane that use SDN controller purely for management tasks. Cisco ACI immediately comes to mind - as I usually joke during my “NSX or ACI” workshops, you could turn off APIC controller cluster when going home for the weekend and the ACI fabric would continue to work just fine.

Where are SD-WAN systems in this spectrum? We don’t know, because the vendors are not telling us how their secret sauce works. However, at least some vendors claim their magic SD-WAN controller replaces routing protocols, which means that controller failure might prevent edge topology changes from propagating across the network.

There’s also the nasty question of key distribution. In traditional systems like DMVPN edge nodes exchange P2P keys with IKE and use shared secrets or pre-provisioned certificates to prevent man-in-the-middle attacks. In an SD-WAN system the controller might do key distribution, in which case I wish you luck when you’ll face a nasty WAN partition (or AWS region failure if the controller runs in the cloud).

Summary: Things are never as rosy as they appear in PowerPoint presentations and demos. Figure out everything that could potentially go wrong (like WAN partitioning), try to find what happens from product documentation, and ask some really hard questions (or change the vendor) if the documentation is not useful. Finally, verify every claim a $vendor makes in a lab.

Latest blog posts in High Availability Service Clusters series

4 comments:

  1. Only way to get around this is a proof of concept (due diligence) and to test extensively. Test node failures, link failures and grey failures. Also test multiple failure scenarios at the same time (here's where it get's interesting).
  2. This is complex topic. Regardless of the type of distributed system:

    - collect requirements related to Failure & Recovery ('what if' scenarios where the failure of links / devices / controllers) - Customer tends to expect "miracles" - the low of physic is the limit (+ your imagination what may happen) - think about two failures at the same time (define what is the same time i.e. a gap between failures)

    - put redundant controllers wherever required (especially in isolated parts of the systems) and DEFINE & implement system logic behind to handle failure scenarios(e.g. when the isolated part of the system is taking over the responsibility for controlling the routing path, how the failure is detected, what is considered the failure, split-brain scenarios handling, etc)

    -PUT ALL the constraints coming from the above in the contract to avoid being sued for not handling a failure (so be VERY SPECIFIC what is possible and supported!!)

    - TEST, TEST, and again TEST and be ready for defects from the field (you will be surprised how different the real life issues are from those in the lab)

    Easy, isn't it? Joking. If you work in this kind of field you know how complex it is, and how big is the GREY area (undefined failure scenarios, how difficult is to define the failure, etc).
  3. Perhaps not the same scale but even google got it (somewhat) wrong:


    « Google's resilience strategy relies on the principle of defense in depth. Specifically, despite the network control infrastructure being designed to be highly resilient, the network is designed to 'fail static' and run for a period of time without the control plane being present as an additional line of defense against failure. The network ran normally for a short period - several minutes - after the control plane had been descheduled. After this period, BGP routing between specific impacted physical locations was withdrawn, resulting in the significant reduction in network capacity observed by our services and users, and the inaccessibility of some Google Cloud regions. End-user impact began to be seen in the period 11:47-11:49 US/Pacific. »

    https://status.cloud.google.com/incident/cloud-networking/19009


  4. SD-WAN vendors like Versa Networks providers controller redundancy with multiple controllers across different geographical location. Also there are mechanisms in place to counter situation where a branch that loses connectivity to all controllers can still use the local information to route packets to other branch devices.
Add comment
Sidebar