One of the answers you get from some of the vendors selling you data center fabrics is “you can use any topology you wish” and then they start to rattle off an impressive list of buzzword-bingo-winning terms like full mesh, hypercube and Clos fabric. While full mesh sounds like a great idea (after all, what could possibly go wrong if every switch can talk directly to any other switch), it’s actually the worst possible architecture (apart from the fully randomized Monkey Design).
Before reading the rest of this post, you might want to visit Derick Winkworth’s The Sad State of Data Center Networking to get in the proper mood.
Imagine you just bought a bunch (let’s say eight, to keep the math simple) of typical data center switches. Most of them use the same chipset today, so you probably got 48 x 10GE ports and 4 x 40GE ports that you can use as 16 x 10GE ports. Using the pretty standard 1:3 oversubscription, you dedicate 48 ports for server connections and 16 ports for intra-fabric connectivity … and you wire them in a full mesh (next diagram), giving you 20 Gbps of bandwidth between any two switches.
Guess what … unless you have a perfectly symmetrical traffic pattern, you’ve just wasted most of the intra-fabric bandwidth. For example, if you’re doing a lot of vMotion between servers attached to switches A and B, the maximum throughput you can get is 20 Gbps (even though you have 140 Gbps of uplink bandwidth on every single switch).
Remember that all data center fabric technologies use Equal Cost Multipath because vendors think that MPLS Traffic Engineering and virtual circuits fry the brains of data center engineers.
Now let’s burn a bit more of our budget: buy two more switches, and wire all ten of them in a Clos fabric. You have just enough ports on the two “core” switches to connect 80 Gbps of uplink bandwidth from each “edge” switch to each of them (sometimes directly, sometimes splitting 40 GE port on the edge switch into 4 x 10GE and using port channel across eight 10 GE uplinks).
Having core switches with plenty of 40GE ports makes your life even simpler.
Have we gained anything (apart from some goodwill from our friendly vendor/system integrator)?
Actually we did – using the Clos fabric you get a maximum of 160 Gbps between any pair of edge switches. Obviously the maximum amount of traffic any edge switch can send into the fabric or receive from it is still limited to 160 Gbps (unless you use unicorn-based OpenFlow you can’t really violate the laws of physics), but we’re no longer limiting our ability to use all available bandwidth.
Now imagine you use slightly bigger core switches and attach uplinks directly to them. Shake the whole thing a bit, let the edge switches “fall to the floor” and you have the traditional 2-tier data center network design (or a 3-tier design if you use the Nexus 5000/FEX combo). Maybe we weren’t so stupid after all …
The 10-switch example was taken straight from the Guide to Network Fabrics webinar sponsored by Juniper (you see, even though the title contains “Server’s Guy Guide …”, the webinar does contain some useful tidbits for networking geeks).
I’ll talk about Clos fabrics in the webinar on Thursday (see previous paragraph), as well as during my RIPE 64 talks tomorrow, and upcoming Clos Fabrics Explained webinar, where I just might mention a few alternatives (disclaimer: this is a future-looking statement and does not represent a commitment on part of the author).
Finally, some of the improvements vendors made in the last 6 months (described in the mid-May Data Center Fabric Architectures update session) do help you make better or more scalable data center designs (including the ability to build larger Clos fabrics).