To Centralize or not to Centralize, That’s the Question

One of the attendees of the Building Next-Generation Data Center online course solved the build small data center fabric challenge with Virtual Chassis Fabric (VCF). I pointed out that I would prefer not to use VCF as it uses centralized control plane and is thus a single failure domain.

In case you’re interested in data center fabric architecture options, check out this section in the Data Center Fabric Architectures webinar.

Here are his arguments for using VCF:

As for the architecture, VCF is a simple design for small to medium DC. It is a centralized architecture but has the L2 and L3 simplicity to provide scalability for legacy application while also using L3. There are redundant route engines to assist in failure of master route engine. Protocols like GRES ( graceful route engine switchover), NSR / NSB, non-stop routing/bridging also assist in quick RE fail-overs while also assisting protocols in convergence times.

They are all valid arguments, but in practice I dislike centralized control/management plane architectures because they’re really hard to get right… and if you get byzantine control plane failure, you lose the whole fabric.

Also, there are occasional software upgrade challenges that you don’t get with independent boxes, and everyone who’s been in networking long enough has a scary horror story about a failed stackable switch upgrade.

An obvious alternative to VCF would be a traditional leaf-and-spine fabric with VXLAN using either EVPN control plane or statically-configured ingress BUM replication with dynamic MAC learning. More robust, less complex software, smaller blast radius… but harder to design and configure.

As always, it’s the question of explicit versus hidden complexity, and you have to choose which one is better for you. I have no problem with that - it’s just that the customers going for hidden complexity aren’t always aware of the risks they’re taking.

Further Reading

To Learn More about These Topics

Check out ipSpace.net data center webinars, in particular

Need even more? How Building Next-Generation Data Center online course?

Latest blog posts in High Availability Service Clusters series

4 comments:

  1. Very good point Ivan! We run centralized DC solution similar to VCF and every software upgrade we think about how awesome it would have been to have separate control plane. Even vendor TAC after stumbling upon those Byzantine failures just basically recommends rebooting whole DC at once during software rollouts.
    Speaking about hidden complexities - article of "Leaky Abstractions" posted earlier is a great read. Thanks!
  2. +1 for distributed control plane and explicit complexity. Dealing with implicit complexity is desirable sometimes but imo at the end of the day implicit complexity is equivalent to unknown complexity resulting in design and operations headaches down the road. The best example is probably any MLAG implementation. Looks good on paper, but has a lot of caveats that cause a lot more complexity with little benefit. There is a reason Cisco's vPC design guide is 129 pages long. Magic has its price.
  3. I know the question is asked within the context of data center, but what about sd-wan then? with centralized controller and its magic happening under the hood?
    Replies
    1. Reading my mind :) Already working on that one...
Add comment
Sidebar