One of the answers you get from some of the vendors selling you data center fabrics is “you can use any topology you wish” and then they start to rattle off an impressive list of buzzword-bingo-winning terms like full mesh, hypercube and Clos fabric. While full mesh sounds like a great idea (after all, what could possibly go wrong if every switch can talk directly to any other switch), it’s actually the worst possible architecture (apart from the fully randomized Monkey Design).
Imagine you just bought a bunch (let’s say eight, to keep the math simple) of typical data center switches. Most of them use the same chipset today, so you probably got 48x10GE ports and 4x40GE ports1 that you can split into 16x10GE ports. Using the pretty standard 1:3 oversubscription, you dedicate 48 ports for server connections and 16 ports for intra-fabric connectivity … and you wire them in a full mesh (next diagram), giving you 20 Gbps of bandwidth between any two switches.
Guess what … unless you have a perfectly symmetrical traffic pattern, you’ve just wasted most of the intra-fabric bandwidth. For example, if you’re doing a lot of vMotion between servers attached to switches A and B, the maximum throughput you can get is 20 Gbps (even though you have 160 Gbps of uplink bandwidth on every single switch).
Now let’s burn a bit more budget: buy two more switches with 40GE ports, and wire all ten of them in a Clos fabric. You’ll get 80 Gbps of uplink bandwidth from each edge switch to each core switch.
Have we gained anything (apart from some goodwill from our friendly vendor/system integrator)?
Actually we did – using the Clos fabric you get a maximum of 160 Gbps between any pair of edge switches. Obviously the maximum amount of traffic any edge switch can send into the fabric or receive from it is still limited to 160 Gbps2, but we’re no longer limiting our ability to use all available bandwidth.
Now shake the whole thing a bit, let the edge switches “fall to the floor” and you have the traditional 2-tier data center network design. Maybe we weren’t so stupid after all …
- Mentioned 25GE/100GE ports and cleaned up the wording.