Monkey Design Still Doesn’t Work Well

We’ve seen several interesting data center fabric solutions during the Networking Tech Field Day presentations, every time hearing how the new fabric technologies (actually, the shortest path bridging part of those technologies) allow us to shed the yoke of the Spanning Tree monster (see Understanding Switch Fabrics by Brandon Carroll for more details). Not surprisingly we wanted to know more and asked the obvious question: “and how would you connect the switches within the fabric?”

The vendors were quick to assure us that “we can use any topology we want.” We also heard the buzzwords like hypercube, Clos, daisy chain and ring, and the promises like “you just plug it in ... it just works!” What they usually forgot to mention was the fact that removing the rigid requirements of spanning tree protocol doesn’t magically alleviate the need for proper network design.

Brandon has gracefully allowed me to use a picture from his blog post to illustrate the problem. Imagine you build a network shown in the following diagram. Because you’re using a fabric technology (be it TRILL, SPB, FabricPath or something else), no ports are blocked and you should be able to use all the bandwidth in the network ... but that simply won’t happen.

You see, the shortest path bridging technologies behave almost exactly like routing, and (like their name indicates) they give you shortest path bridging. All the traffic between A and E will still go over the B-C link because that’s the shortest path. The path A-B-D-C-E is longer and won’t be used.

The B-D and C-D links would be used if there would be other devices attached to D, but I hope you get my point – shortest path bridging technologies are no better than routing.


Just because the shortest path bridging technologies provide routing-like behavior at MAC layer doesn’t mean that you can wire your network haphazardly. Fortunately, you can fall back to the age-old rules of properly designed routed networks ... and guess what: they usually prescribe a hierarchical structure with edge, aggregation and core. Maybe the shiny new world isn’t so different from the old one after all.


  1. But, it doesn't make sense to use all of the Bandwidth in this design... maybe if you link D to E... no?
  2. We could try to improve this design, but that was not the point of the post. What really matters is that you should consider all the implications of your design, not just wire the switches willy-nilly and hope for the best.
  3. But the "traditional" 2 or 3 layer design of Access Layer switch connected to two Distribution Layer switches connected to X Core switches (just like from Cisco's SWITCH books I read now :) ) will definitely benefit from STP absence, too. Question is whether you need new devices to replace existing, when will your next HW upgrade cycle be, and if these technologies currently sold as big boys' datacenter toys, how long it will take till they appear in mid and smaller market stuff.

    Or, obviously, you can stick to Routing to Access Layer, but then you won't have big layer2 domains...
  4. You're on a good path - you need some __structure__ in the DC design. It obviously should:

    * make sense
    * be as simple as possible
    * match your needs/goals (example: equal-cost load balancing or optimum bandwidth utilization or equidistant endpoints or something else).

    Unfortunately, this simple fact is oft ignored.
  5. At least with routing it is well known how to modify metrics to influence traffic down certain paths and maybe you do in fact want unequal cost paths. With L2, it is truly ECMP with shortest "hops" winning, right? And is it possible to modify metrics or do you need to get into isis for that?

    Are we at the RIP stages of L2MP?
  6. BTW, I was partially being sarcastic. And as for ISIS, it's not possible to manipulate, and if it was, would that ruin the simplicity of FP being plug in play!!! =-O
  7. I do not think anybody clueful was expecting something besides shortest-path routing from TRILL. And not shortest path in terms of hops, either, but rather shortest path using interface bandwidth as a metric (which I believe TRILL does).

    That said though, what is the topology of choice? Obviosuly, a Clos network will have the minimal number of hops but the highest cost in terms of inter-switch links. 2/3/4 dimension Torus or hyper-cubes may also work well (you trade having fewer long links between switches for more hops, but still get massive bisection bandwidth at lower cost). Both of these topologies are widely used in the HPC space.

    What won't work with TRILL right now are fancy-pants topologies like Dragonfly, Flattenend Butterfly, etc. that require adaptive or non-minimal routing. But those don't work with any other mainstream routing protocol either. Maybe someday.

    It seems as though fabric vendors are just assuming a 2/3 stage Clos network is going to be the design of choice, cost be damned.

    One question though: What the hell is the difference between a two-stage Clos network and a "leaf and spine" network? Is it just marketing language?
  8. You had me at flattened butterfly.
  9. In fact, there is one commonly used adaptive "routing" algorithm known as MPLS-TE ;) The problem is that it's not adaptive enough to work with rapidly changing traffic matrix in a data-center...

    Also, not sure if you can just go with shortest-path routing on a hypercube/torus either - you still need some sort of adaptive routing variant, or just plain old VLB (halving throughput in sacrifice).

    As for the cost factor, it does matter in smaller deployments - large scale data-centers are not really sensitive to network cost, as its just on the order of magnitude less compared to server costs, hence the preference for Clos.

    The "irregular" topologies are still interesting for small-to-med networks, though the requirement of adaptive routing makes it tough to implement in commodity hardware/software, and calls for "naturally load-balanced" Clos once again...
  10. Y'all are so behind; we say HyperX now. But you need linear programming to design it. ;)
  11. change abounds ... how to validate with real line rate traffic ? @bwolmarans
    1. Obviously using Spirent gear ;) See my today's blog post (coming in a few hours).
  12. Essentially this is correct, but when does the other link get used is when you have pinning technology been used with either UCS or Nexus in the picture
Add comment