Build the Next-Generation Data Center
6 week online course starting in spring 2017

Does Centralized Control Plane Make Sense?

A friend of mine sent me a challenging question:

You've stated a couple of times that you don't favor the OpenFlow version of SDN due to a variety of problems like scaling and latency. What model/mechanism do you like? Hybrid? Something else?

Before answering the question, let’s step back and ask another one: “Does centralized control plane, as evangelized by ONF, make sense?

A bit of history

As always, let’s start with one of the greatest teachers: history. We’ve had centralized architectures for decades, from SNA to various WAN technologies (SDH/SONET, Frame Relay and ATM). They all share a common problem: when the network partitions, the nodes cut off from the central intelligence stop functioning (in SNA case) or remain in a frozen state (WAN technologies).

One might be tempted to conclude that the ONF version of SDN won’t fare any better than the switched WAN technologies. Reality is far worse:

  • WAN technologies had little control-plane interaction with the outside world (example: Frame Relay LMI), and those interactions were run by the local devices, not from the centralized control plane;
  • WAN devices (SONET/SDH multiplexers, or ATM and Frame Relay switches) had local OAM functionality that allowed them to detect link or node failures and reroute around them using preconfigured backup paths. One could argue that those devices had local control plane, although it was never as independent as control planes used in today’s routers.

Interestingly, MPLS-TP wants to reinvent the glorious past and re-introduce centralized path management, yet again proving RFC 1925 section 2.11.

The last architecture (that I remember) that used truly centralized control plane was SNA, and if you’re old enough you know how well that ended.

Would central control plane make sense in limited deployments?

Central control plane is obviously a single point of failure, and network partitioning is a nightmare if you have a central point of control. Large-scale deployments of ONF variant of SDN are thus out of question. But does it make sense to deploy centralized control plane in smaller independent islands (campus networks, data center availability zones)?

Interestingly, numerous data center architectures already use centralized control plane, so we can analyze how well they perform:

  • Juniper XRE can control up to four EX8200 switches, or a total of 512 10GE ports;
  • Nexus 7700 can control 64 fabric extenders with 3072 ports, plus a few hundred directly attached 10GE ports;
  • HP IRF can bind together two 12916 switches for a total of 1536 10GE ports;
  • QFabric Network Node Group could control eight nodes, for a total of 384 10GE ports.

NEC ProgrammableFlow seems to be an outlier – they can control up to 200 switches, for a total of over 9000 GE (not 10GE) ports… but they don’t run any control-plane protocol (apart from ARP and dynamic MAC learning) with the outside world. No STP, LACP, LLDP, BFD or routing protocols.

One could argue that we could get an order of magnitude beyond those numbers if only we were using proper control plane hardware (Xeon CPUs, for example). I don’t buy that argument till I actually see a production deployment, and do keep in mind that NEC ProgrammableFlow Controller uses decent Intel-based hardware. Real-time distributed systems with fast feedback loops are way more complex than most people looking from the outside realize (see also RFC 1925, section 2.4).

Does central control plane make sense?

It does in certain smaller-scale environments (see above)… as long as you can guarantee redundant connectivity between then controller and controlled devices, or don’t care what happens after link loss (see also wireless access points). Does it make sense to generate a huge hoopla while reinventing this particular wheel? I would spend my energy doing something else.

I absolutely understand why NEC went down this path – they did something extraordinary to differentiate themselves in a very crowded market. I also understand why Google decided to use this approach, and why they evangelize it as much as they do. I’m just saying that it doesn’t make that much sense for the rest of us.

Finally, do keep in mind that the whole world of IT is moving toward scale-out architectures. Netflix & Co are already there, and the enterprise world is grudgingly doing the first steps. In the meantime, OpenFlow evangelists talk about the immeasurable revolutionary merits of centralized scale-up architecture. They must be living on a different planet.

More on SDN and OpenFlow

To learn more about the realities of OpenFlow and SDN, watch my SDN webinars, or attend my SDN workshop.

6 comments:

  1. OpenFlow does not make sense. The L2 and most L3 functions of switching need not be centrally processed at line rate. This also doesn't align with need or demand.

    NFV makes much more sense in conjunction with programmatic switch control. I would be very happy if a network vendor would build a management platform to which all switches could be registered and managed with a graphical and programmatic interface.

    Open the UI, create a data flow with VLAN, route, and QOS parameters and click go. The management platform issues the commands to the switches to configure them to support the designed flow. The management platform then tests the flow and provides notification that it has been successful. Of course, the management platform also monitors and reports on switch performance/activity.

    This management platform can then be integrated with hypervisors to allow provisioning of workloads and networks through the same wizard. The network doesn't need to be intelligent, it needs to be obedient.

    ReplyDelete
  2. Of course, enterprise wireless networks have a centralized controller and can support thousands of access points.

    ReplyDelete
    Replies
    1. At what speeds and aggregate bandwidth? And don't forget that in most cases all the traffic gets hauled back to the controller. See also

      http://blog.ipspace.net/2013/09/openflow-fabric-controllers-are-light.html

      Delete
  3. Does not a big chassis has similar construct of centralized controller model that you talk about here ?. RP is the centralized controller and the line-cards are dumb switches programmed by the RP ?

    ReplyDelete
  4. You're absolutely right. And how many networks have you seen built with a single big chassis?

    ReplyDelete
    Replies
    1. Not one but atleast two for redundancy purpose where each chassis would typically have 2 RPM's (Primary & Standby) and both chassis talk to each other using some protocols (federation) or using vPC . Similarly mechanisms can be applied to centralized Openflow model as well right ?. One could have 2 or more controllers in their network, each controller supports HA and throw in federation among the controllers.

      My point out here is that SPOF & Network partitioning for centralized controller model could be solved by borrowing ideas from chassis world.

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.