Stackable Data Center Switches? Do the Math!

Imagine you have a typical 2-tier data center network (because 3-tier is so last millennium): layer-2 top-of-rack switches redundantly connected to a pair of core switches running MLAG (to get around spanning tree limitations) and IP forwarding between VLANs.

Next thing you know, a rep from your favorite vendor comes along and says: “did you know you could connect all ToR switches into a virtual fabric and manage them as a single entity?” Is that a good idea?


Typical layer-2 leaf-and-spine design

Did you know your network design is actually a pretty popular leaf-and-spine architecture, one of the many variants of Clos fabrics? Learn more about them in my Clos Fabrics Explained webinar.

Assuming you have typical 64-port 10GE ToR switches and use pretty safe 3:1 oversubscription ratios, you have 48 10GE server-facing ports and 16 10GE (or 4 40GE) uplinks. If the core switch is non-blocking (please kick yourself if you bought an oversubscribed core switch in the last year or two), every server has equidistant bandwidth to every other server (servers connected to the same ToR switch are an obvious exception).

Unless you experience some nasty load balancing issues where numerous elephant flows hash onto the same physical link, every server gets ~3.33 Gbps of bandwidth toward any other server (side note: you might experience more load balancing problems with NVGRE than with VXLAN. The proof is left as an exercise for the reader).

The problem of elephant flows hashing onto the same link in a bundle or the same path across the network can be solved with multipath TCP or by buying Brocade VDX switches.

Now let’s connect the top-of-rack switches into a stack. Some vendors (example: Juniper) use special cables to connect them; others (HP, but also Juniper) connect them via regular 10GE links. In most scenarios I’ve seen so far you connect the switches in a ring or a daisy chain.


Leaf-and-spine design with stackable ToR switches

You might be able to connect EX4500’s in a leaf-and-spine Clos fabric and configure them as a virtual chassis, but I got mixed responses to this idea from “this is unsupported” to “I can’t see why it wouldn’t work.”

What happens next depends on the traffic profile in your data center. If most of your server traffic is northbound (example: VDI, simple web hosting), you probably won’t notice a difference, but if most of your traffic is going between servers (east-west traffic), the stacking penalty will be huge.

The moment you merge the ToR switches into a stack (regardless of whether it’s called Virtual Chassis or Intelligent Resilient Framework) the traffic between servers connected to the same stack stays within the stack, and the only paths it can use are the links connecting the ToR switches.

Remember that we’re usually talking about a ring/daisy chain here – a packet might have to traverse multiple hops across the ring to get to the destination ToR switch.

Now do the math. Let’s assume we have four HP 5900 ToR switches in a stack, each switch with 48 server-facing 10GE ports and 4 40GE uplinks. The total uplink bandwidth is 640 Gbps (160 Gbps per switch times four ToR switches) and if half of your traffic is server-to-server traffic (it could be more, particularly if you use Ethernet for storage or backup connectivity), the total amount of server-to-server bandwidth in your network is 320 Gbps (which means you can push 640 Gbps between the servers using marketing math).

Figuring out the available bandwidth between ToR switches is a bit trickier. Juniper’s stacking cables work @ 128 Gbps. HP’s 5900s can use up to four physical ports in an IRF link; that would be 40Gbps if you use 10GE ports and 80Gbps if you use all four 40GE ports on an HP5900 for two IRF links.

I’m not picking on HP in particular; they just happen to have a handy switch model. Doing the math with Juniper’s EX4500 which has 48 10GE ports is more cumbersome.

You might get lucky with traffic distribution and utilize multiple segments in the ring/chain simultaneously, but regardless of how lucky you are, you’ll never get close to the bandwidth you had before you stacked the switches together, unless you started with a large oversubscription ratio.

Futhermore, by using 10GE or 40GE ports to connect the ToR switches in a ring or daisy chain, you’ve split the available uplink ports into two groups: inter-server ports (within the stack) and northbound ports (uplinks to the core switch). In a traditional leaf-and-spine architecture you’re able to fully utilize the all the uplinks regardless of the traffic profile; the utilization of links in a stacked ToR switch design depends heavily on the east-west versus northbound traffic ratio (the pathological case being known as Monkey Design).

Conclusion: daisy-chained stackable switches with 100+Gbps stacking cables were probably a great idea in 1GE world; be careful when using switch stacks in 10GE world. You might have to look elsewhere if you want to reduce the management overhead of your ToR switches.

More information

Leaf-and-spine architecture is just the simplest example of the Clos architecture. You’ll find fabric designs guidelines (including numerous L2- and L3-designs) in the Clos Fabrics Explained webinar.

You’ll find the port density and fabric behavior of almost all data center switches available from nine major vendors (yes, I actually read all those data sheets) in the Data Center Fabrics webinar.

Both webinars are available as part of the yearly subscription and you can always ask me for a second opinion or a design review

12 comments:

  1. Juniper's cables are actually 32Gb x two cables, add "cisco math" and there's the 128Gb.

    ReplyDelete
  2. Hello,

    I am going to disagree since I believe there are scenarios when you will benefit from stacking on ToR in HP case.

    Suppose all your servers have 4x 10G port and you bundle them to LACP NIC team. You connect those ports to four 5900s in IRF. HP allows to change LAG hashing algorithm to „local first“ – that means if there is connection local to the switch that one is prefered and used. When one server talk to another one - server will use hash and let say it will use first 5900. This 5900 will prefer local connection to second server since there is direct link to it. With this stacking link is not going to be used for your inter-server traffic if all servers have active connections to all nodes of your ToR stack.

    In this case inside of your 5900s IRF pod you are always one switch away from one server to another.

    Uplinks to core in such case needs to be on every 5900 – agree on that.

    Tomas

    ReplyDelete
    Replies
    1. We're actually in perfect agreement - as long as you have port channels spanning all switches in IRF/VC/VCS fabric, no traffic goes over stacking links (the setup is identical to Cisco's VSS/vPC, only with a higher number of switches).

      However, my scenario was a bit different - I have a running network (thus no server-side port channel) and stack the switches.

      Will write a follow-up blog post ;)
      Ivan

      Delete
  3. "the management overhead of your ToR switches"

    When I recently heard a Juniper presentation, I couldn't help but think, is management overhead really that much of a concern? These are machines who do the management (the config archiving etc.) not humans, I'd think there is little cost associated, once the management system has been bought, therefore little savings.
    To me, Virtual Chassis seems like a solution looking for a problem.

    ReplyDelete
  4. Very interesting post, thanks a lot. Just my own comment which is meant to be more humorous than anything --- who the heck has 16 uplinks?

    ReplyDelete
  5. HP's IRF merge not only management plan also control plan and HP market says it is "superior" design. Another HP's market snafu is they claimed Cisco's Nexus 7K VDC decrease fabric performance and now they are rolling out their own "VDC" in 12500 line with 1GHZ control plan CPU.

    ReplyDelete
  6. Hey, when did 'elephant' in reference to packet flows stop referring to high latency, high bandwidth operations (Long Fat Network -> LFN -> elephant -- see RFC 1072), and start being a reference to any big TCP flows as claimed by wikipedia?
    http://en.wikipedia.org/wiki/Elephant_flow

    I think the wikipedia entry is bogus:
    - It is not clear who coined "elephant flow",
    - but the term began occurring in published
    - Internet network research in 2001...

    RFC 1072 dates back to 1988

    ReplyDelete
  7. With a leaf-spine 2-tier architecture, isn't multi path a no brainer -- spray packets randomly. This is what switches do internally. It avoids persistent congestion at intermediate hops and reduces packet reordering (so TCP won't kill itself).

    ReplyDelete
    Replies
    1. The packets cannot be sprayed randomly; you have to preserve the order of packets within a single flow.

      Delete
    2. Yes, wouldn't TCP preserve the order of packets within a single flow? I was wondering if reordering packets at the edge just above TCP would actually lead to significant performance improvements over conventional ECMP.

      Delete
    3. TCP would deliver the data stream in proper order to the application, the question is whether out-of-order packets affect performance (not sure whether LRO can handle them). Never got a good answer to this question.

      Non-TCP applications, on the other hand, might break (or slow down significantly) when receiving out-of-order packets.

      Delete
  8. About the EX4500 having the cables working @ 128 Gbps is not the case. From the data sheet:
    • 128 Gbps Virtual Chassis module with 2 x 64 Gbps ports.

    And as far as I know, this is still marketing. When you display vc-port

    show virtual-chassis vc-port (EX4200 Virtual Chassis)

    user@switch> show virtual-chassis vc-port
    fpc0:
    -------------------------------------------------------------------------–
    Interface Type Trunk Status Speed Neighbor
    or ID (mbps) ID Interface
    PIC / Port
    vcp-0 Dedicated 1 Up 32000 1 vcp-1
    vcp-1 Dedicated 2 Up 32000 0 vcp-0

    This means "line speed" is just 32 Gbps.

    So this is less than the 80 Gbps you get with an HP switch

    just a detail. In a real design I would never virtualize more than two switch in a TOR as one (IRF. VSS, VC, or whatever...)

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.