Facebook Next-Generation Fabric
Facebook published their next-generation data center architecture a few weeks ago, resulting in the expected “revolutionary approach to data center fabrics” echoes from the industry press and blogosphere.
In reality, they did a great engineering job using an interesting twist on pretty traditional multi-stage leaf-and-spine (or folded Clos) architecture.
They split data center into standard pods. No surprise there, anyone aiming for easy-to-manage scale-out architecture (i.e. not so many people) is doing that – we discussed it on Episode 8 of Software Gone Wild, and I described it in one of the data center design case studies. The second part of this video should give you a few additional ideas along the same lines.
Inside each pod they use leaf-and-spine architecture, almost identical to what Brad Hedlund described in my Leaf-and-Spine Fabric Architectures webinar… including the now-standard 3:1 oversubscription on the leaf switches (48 server-facing ports, four 40GE uplinks).
Note that every fabric switch needs 48 leaf-facing 40GE ports. Adding the necessary pod-to-spine uplinks, they need 96-port 40GE switches to implement this design. I wouldn't be too surprised to see Arista launch a switch meeting these specs at the next Interop ;)
The interesting twist is the inter-pod connectivity. Instead of building a single non-oversubscribed core fabric, and connecting leaf nodes to it (the traditional way of building multi-stage leaf-and-spine fabrics), they treat each pod fabric switch as a leaf node in another orthogonal leaf-and-spine fabric (for a total of four core fabrics), resulting in a data center fabric that can potentially support over 100.000 server ports (the limiting factor is the number of ports on the spine switches).
- Jason Edelman created a nice 2D diagram that makes the multiple layers of leaf-and-spine fabrics more evident;
- Gary Berger wrote a long blog post analyzing the new Facebook fabric including a deep-dive into the port count limitations;
- You’ll find a bit more down-to-earth designs in my Leaf-and-Spine Fabric Architectures and Designing Private Cloud Infrastructure webinars, and I’m usually available for short consulting engagements.
This type of topology has been used before in HPC systems. For example, the BBN Butterfly system already employed a multistage interconnection network in the early 80's (using 4x4 switches and Motorola 68000 processors!!). In these multi-stage interconnection networks (MINs) the maximum number of endpoints with full-bisection bandwidth (i.e. statistically non-blocking) grows exponentially with the number of stages, something like k^n, which is larger than 100.000 for k=48 and n=3. Considering today's large switches, such arrangements will be only required for very large systems from very large companies, such as this example from Facebook.
There is plenty of academic information about Clos networks in the literature; This book by Bill Dally (Chief Scientist in Nvidia) considers MINs and has a section about multi-level Clos and Butterflies.
Doug Hanks from Juniper did a NANOG presentation, and there is a good section in the new QFX5100 book about doing something similar except using Juniper's VCF to make one of the tiers look like a single switch. Of course you are limited by what the VCF protocols/management can do whereas this is just based on number of available ports.
I also posted a white paper on how to build these networks using BGP.
Juniper also has an open source project called OpenClos which automates the creation of these networks.