Facebook Next-Generation Fabric

Facebook published their next-generation data center architecture a few weeks ago, resulting in the expected “revolutionary approach to data center fabrics” echoes from the industry press and blogosphere.

In reality, they did a great engineering job using an interesting twist on pretty traditional multi-stage leaf-and-spine (or folded Clos) architecture.

They split data center into standard pods. No surprise there, anyone aiming for easy-to-manage scale-out architecture (i.e. not so many people) is doing that – we discussed it on Episode 8 of Software Gone Wild, and I described it in one of the data center design case studies. The second part of this video should give you a few additional ideas along the same lines.

Inside each pod they use leaf-and-spine architecture, almost identical to what Brad Hedlund described in my Leaf-and-Spine Fabric Architectures webinar… including the now-standard 3:1 oversubscription on the leaf switches (48 server-facing ports, four 40GE uplinks).


Source: code.facebook.com

Note that every fabric switch needs 48 leaf-facing 40GE ports. Adding the necessary pod-to-spine uplinks, they need 96-port 40GE switches to implement this design. I wouldn't be too surprised to see Arista launch a switch meeting these specs at the next Interop ;)

The interesting twist is the inter-pod connectivity. Instead of building a single non-oversubscribed core fabric, and connecting leaf nodes to it (the traditional way of building multi-stage leaf-and-spine fabrics), they treat each pod fabric switch as a leaf node in another orthogonal leaf-and-spine fabric (for a total of four core fabrics), resulting in a data center fabric that can potentially support over 100.000 server ports (the limiting factor is the number of ports on the spine switches).

More information

7 comments:

  1. Whoever made the 3D diagram didn't account for color blind folks. It's hard to follow those lines.
  2. It is very interesting to see this physical topology being used in an actual datacenter deployment. While a traditional leaf and spline network employs a 2-level folded-Clos topology, this is a 3-level folded-Clos topology. It resembles an undirected butterfly network, which mathematically is denoted as a k-ary n-fly network, with k the switch semi-radix (half the port count, or number of uplinks or downlinks) and n the number of stages. In the case of the Facebook proposal, they design for n=3 and k=48, although they admit variations to increase capacity (starting from a smaller design). Interestingly, the path diversity of this topology equals the number of core switches -up to 48 in Facebook's design- so it tolerates multiple switch losses with graceful performance degradation (although ECMP load balancing mechanisms typically work best with power-of-2 number of paths, so they could result in larger degradation due to load unbalance).

    This type of topology has been used before in HPC systems. For example, the BBN Butterfly system already employed a multistage interconnection network in the early 80's (using 4x4 switches and Motorola 68000 processors!!). In these multi-stage interconnection networks (MINs) the maximum number of endpoints with full-bisection bandwidth (i.e. statistically non-blocking) grows exponentially with the number of stages, something like k^n, which is larger than 100.000 for k=48 and n=3. Considering today's large switches, such arrangements will be only required for very large systems from very large companies, such as this example from Facebook.

    There is plenty of academic information about Clos networks in the literature; This book by Bill Dally (Chief Scientist in Nvidia) considers MINs and has a section about multi-level Clos and Butterflies.
  3. Great stuff here!!
  4. You can see the same 3-level clos design in this 2009 presentation by Google's Bikash Koley. http://conference.vde.com/ecoc-2009/programs/documents/ecoc09-100g-ws-google-koley.pdf
  5. This comment has been removed by the author.
  6. Oops accidentally deleted my post. Thanks for the Google presentation, I've seen something like this multiple times really. It's not new in the HPC world at all.

    Doug Hanks from Juniper did a NANOG presentation, and there is a good section in the new QFX5100 book about doing something similar except using Juniper's VCF to make one of the tiers look like a single switch. Of course you are limited by what the VCF protocols/management can do whereas this is just based on number of available ports.
    Replies
    1. Thanks. The new QFX5100 Series book also has a chapter on just IP Fabrics showing different options such as 3-stage and 5-stage topologies.

      I also posted a white paper on how to build these networks using BGP.

      http://www.juniper.net/us/en/local/pdf/whitepapers/2000565-en.pdf

      Juniper also has an open source project called OpenClos which automates the creation of these networks.

      https://github.com/Juniper/OpenClos
Add comment
Sidebar