Full mesh is the worst possible fabric architecture

One of the answers you get from some of the vendors selling you data center fabrics is “you can use any topology you wish” and then they start to rattle off an impressive list of buzzword-bingo-winning terms like full mesh, hypercube and Clos fabric. While full mesh sounds like a great idea (after all, what could possibly go wrong if every switch can talk directly to any other switch), it’s actually the worst possible architecture (apart from the fully randomized Monkey Design).

Before reading the rest of this post, you might want to visit Derick Winkworth’s The Sad State of Data Center Networking to get in the proper mood.

Imagine you just bought a bunch (let’s say eight, to keep the math simple) of typical data center switches. Most of them use the same chipset today, so you probably got 48 x 10GE ports and 4 x 40GE ports that you can use as 16 x 10GE ports. Using the pretty standard 1:3 oversubscription, you dedicate 48 ports for server connections and 16 ports for intra-fabric connectivity … and you wire them in a full mesh (next diagram), giving you 20 Gbps of bandwidth between any two switches.

Guess what … unless you have a perfectly symmetrical traffic pattern, you’ve just wasted most of the intra-fabric bandwidth. For example, if you’re doing a lot of vMotion between servers attached to switches A and B, the maximum throughput you can get is 20 Gbps (even though you have 140 Gbps of uplink bandwidth on every single switch).

Remember that all data center fabric technologies use Equal Cost Multipath because vendors think that MPLS Traffic Engineering and virtual circuits fry the brains of data center engineers.

Now let’s burn a bit more of our budget: buy two more switches, and wire all ten of them in a Clos fabric. You have just enough ports on the two “core” switches to connect 80 Gbps of uplink bandwidth from each “edge” switch to each of them (sometimes directly, sometimes splitting 40 GE port on the edge switch into 4 x 10GE and using port channel across eight 10 GE uplinks).

Having core switches with plenty of 40GE ports makes your life even simpler.

Have we gained anything (apart from some goodwill from our friendly vendor/system integrator)?

Actually we did – using the Clos fabric you get a maximum of 160 Gbps between any pair of edge switches. Obviously the maximum amount of traffic any edge switch can send into the fabric or receive from it is still limited to 160 Gbps (unless you use unicorn-based OpenFlow you can’t really violate the laws of physics), but we’re no longer limiting our ability to use all available bandwidth.

Now imagine you use slightly bigger core switches and attach uplinks directly to them. Shake the whole thing a bit, let the edge switches “fall to the floor” and you have the traditional 2-tier data center network design (or a 3-tier design if you use the Nexus 5000/FEX combo). Maybe we weren’t so stupid after all …

More information

The 10-switch example was taken straight from the Guide to Network Fabrics webinar sponsored by Juniper (you see, even though the title contains “Server’s Guy Guide …”, the webinar does contain some useful tidbits for networking geeks).

I’ll talk about Clos fabrics in the webinar on Thursday (see previous paragraph), as well as during my RIPE 64 talks tomorrow, and upcoming Clos Fabrics Explained webinar, where I just might mention a few alternatives (disclaimer: this is a future-looking statement and does not represent a commitment on part of the author).

Finally, some of the improvements vendors made in the last 6 months (described in the mid-May Data Center Fabric Architectures update session) do help you make better or more scalable data center designs (including the ability to build larger Clos fabrics).

10 comments:

  1. THANK YOU! You just put into words the idea that has been bouncing around in my mind for the past few months. Ever since I read about Brocade (Foundary) VCS, I've been struggling with the thought that "something isn't right" with full mesh. As far as I can tell, Brocade is advocating full mesh because they have no large VCS capable switches. Their answer seems to be to just mesh the ToR switches and argue that you don't need a large switch for aggregation. Since I've been seeing that same argument in several places, I thought that I must either be dumber than I thought or just missing something very obvious!

    ReplyDelete
  2. A very simple change to forwarding model, known as Valiant Load Balancing (VLB) makes full-mesh topology (aka k-ary 1-flat butterfly) fully load-balanced. VLB is fairly easy to implement on a proprietary fabric, as long as underlying chipset supports encap-decap functionality (in fact I did thate even in dynamips lol). There is a tradeoff, of course - the throughput is reduced by 50%, though this still way better as compared to minimal routing.

    It is also possibly to utilize practically full capacity in the fully meshed fabric by using adaptive routing methods, which are, however, harder to implement. For instance, you can run a full-mesh of MPLE-TE tunnels ;) Though this model has way too slow response time to be used in data-center.

    Overall, adaptive routing has been a big area of research in 70-90s for HPC systems, and a lot of interesting results could be borrowed from there. It's just that modern DC's are built from commodity gear which normally has pretty limited functionality.

    ReplyDelete
  3. Jamie,
    In a proper CLOS Leaf/Spine topology you can scale the fabric pretty large using existing 1RU and 2RU switches.

    ReplyDelete
  4. First off, great write up as usual.

    And I will in fairness say that there could be applications written in the future that take advantage of more full mesh designs. And it does depend on your design goals, bandwidth vs latency vs consistency vs resiliency etc. As Plapukhov stated above, there are ways to do full mesh to one's benefit today. That is just not how we do things today in datacenters, and my guess is we won't any time soon.

    But for the most part ever since Charles Clos released his work in 1953 http://en.wikipedia.org/wiki/Clos_network communications have proven time and time again that a CLOS based architecture gives the best performance.

    This is illustrated (somewhat humorously) by the fact that if you build a full mesh fabric today that most if not all of the switches you would be using to do so have inside them asics on a board that are in CLOS configuration.

    For most applications today CLOS ROCKS.

    ReplyDelete
  5. Hmm... I'm not sure what you were told, all of the VCS reference designs that I know of are NOT full all-2-all mesh networks. There are usually 2 tier clos networks.

    I personally do talk in some of my presentations that things like TRILL mean you can more safely look at new topologies like full mesh and others and that we may see more of that in the future. But I know of no datacenter application today that would be helped by a full mesh topology.

    If you can point me to where you saw this I'm more than happy to look into it. It's very possible that it's simply the miss use of the phrase "full mesh".

    ReplyDelete
  6. i'm assuming that you're saving the more interesting components of a clos fabric for your webinar. (sorry, i won't be able to make that) but at least a drive-by reference to some of the merits of a fat tree clos and the enhancements in asics to facilitate respectable ECMP might put a little more balance on the topic. to jump from full mesh to a 2-node spine seems to leave out some of the more interesting capabilities of the topology. ;)

    ReplyDelete
  7. Ivan Pepelnjak17 April, 2012 08:21

    Hi Steve!

    I wasn't actually holding back - I just wanted to make a quick comparison between two topologies. There will be more blog posts on Clos fabrics.

    Also, if you do have something on ASICs you can share, that would be most appreciated. It's getting harder and harder to get access to HW documentation without NDA (or similar restrictions).

    ReplyDelete
  8. I honestly cannot point to a specific reference to full mesh, so that may have been something that started in my mind and had positive reinforcement bias as I have been reading several articles about full mesh (or even partial mesh) recently.

    Not to go too far off of the topic of this blog, the main issue that jumped out to me was the limitation of only 24 switches in a VCS fabric. If 2+ switches are used as spine nodes, that would leave 22 or less for leaf nodes. I guess that led me to the thought that you would want to be able to use all available nodes as leaf nodes, and my mind constructed the idea of a full mesh to accomplish this.

    Given that their largest switch supporting VCS (that I know of) is the VDX6730-76 with 60 ethernet ports, I would assume 4 spine nodes to allow full bandwidth within the fabric (2 spine nodes wouldn't have enough ports for 4 uplinks each from all 22 leaf nodes). That leaves only 20 leave nodes, with 52 ports each as your maximum access port count, and even then it's at 6.5:1 oversubscription.

    ReplyDelete
  9. Ivan Pepelnjak18 April, 2012 18:44

    Here's what you can do: 4 spine switches, 15 edge switches (all VDX 6720 with 60 10GE ports). Use 16 ports for uplinks on edge switches, 44 ports for server connectivity --> 660 10GE ports = 330 servers with dual 10GE uplinks. Very close to the VMware vDS limit. Accidental perfect sizing, I would say :D

    ReplyDelete
  10. Try looking at the Plexxi solution. Plexxi does not care about equal costs, or shortest paths, and creates a full mesh in which all bandwidth is usable (2:1) using very minimal cabling. Very interesting technology
    http://www.plexxi.com/

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.