Who the **** needs 16 uplinks? Welcome to 10GE world!

Will made an interesting comment to my Stackable Data Center Switches article: “Who the heck has 16 uplinks?” Most of us do in the brave new 10GE world.

A bit of a background

Most data centers have a hierarchical design, with top-of-rack (or leaf) switches connected via a set of uplinks to aggregation (or core or spine) switches, and the performance of your data center network depends heavily on the oversubscription ratio – the ratio between server-facing bandwidth and uplink bandwidth (assuming most server traffic traverses the uplinks).

Alternatives include full mesh design, monkey-see-monkey-do design, and novel approaches I can’t discuss yet.

Going from GE to 10GE

Most ToR switches we were using to connect Gigabit Ethernet server NICs to the network had 10GE uplinks, and the oversubscription ratios were reasonably low, ranging from 1:1.2 (various Nexus 2000 fabric extenders) to 1:2.4 (Juniper EX4200, HP 5830-48).

Some 10GE ToR switches have only 10GE ports (Brocade VDX 67xx, Cisco Nexus 5500, Juniper EX4500, HP 5920), and the Trident-based ones have a mixture of 10GE and 40GE ports (and you can use 40GE ports as 4x10GE ports with a breakout cable).

To maintain a reasonable oversubscription ratio, you have to use a quarter of the switch ports as uplinks (resulting in a 1:3 oversubscription) – sixteen 10GE ports in a 64-port 10GE switch or four 40GE ports in an equivalent 10/40GE switch. Regardless of the switch model you use, the number of fiber strands you need remains the same; 40GE link needs four fibers pairs.

Conclusion: if you want to have 1:3 oversubscription ratio, you need 16 fiber pairs to connect a 64-port 10GE ToR switch (or 48x10GE+4x40GE switch or a 16-port 40GE switch) to the network core.

Higher oversubscription ratios?

Do you really have to keep the oversubscription ratio low? Is 1:3 a good number? How about 1:7? As always, the answer is “it depends.” You have to know your traffic profile, workload characteristics, and plan for the foreseeable future.

Don’t forget that you can easily fit ~130 M1 Medium EC2 instances in a single physical server with 512GB of RAM. Assuming the server has two 10GE uplinks and you use 1:3 oversubscription ratio, that’s 51 Mbps per instance (ignoring storage and vMotion traffic). Is that good enough? You tell me.

More information

You’ll find numerous fabric designs guidelines in the Clos Fabrics Explained webinar. Port densities and fabric behavior of almost all data center switches available from nine major vendors are described in the Data Center Fabrics webinar.

Both webinars are available as part of the yearly subscription and you can always ask me for a second opinion or a design review.

10 comments:

  1. I think people definitely should reconsider their oversubscription ratios for 10G server connectivity. At least in our environment(enterprise), we're finding that we can have a much higher over-subscription ratio for 10G server access. Most of our internal customers that are using 10G on their servers really don't "need" it based on the amount of traffic they generate. (Granted - we have separate uplinks to connect our access switches together for vMotion, and not much in the way of IP storage). Hopefully by the time they start pushing the 10G connections more, 40 / 100 G connectivity will be more commonly available on access switches.
    Replies
    1. As I wrote, "it depends". However, having 10GE server links and not fully utilizing them feels like throwing money away to me ... and the number of multimode fiber pairs won't change regardless of whether you have 10GE, 40GE or 100GE ports.
    2. A significant fraction (perhaps a majority?) of new "private cloud" virtualization clusters I've seen are being implemented with some form of IP-based storage (iSCSI, NFS, or now even SMB3 I suppose). This is usually done with just two 10G links to each virtualization host, with VLANs and perhaps QoS policies differentiating the storage traffic from the application and VMotion traffic.

      In some cases, the clustered storage is actually inside the virtualization hosts themselves (see Gluster or HP StoreVirtual VSA).

      The capex and opex savings that come with eliminating FC are compelling, but the use of IP storage does require lower oversubscrition ratios. Fortunately Ethernet ports and links are (comparatively) cheap, especially when the outrageous costs of "vendor-blessed" HBAs, FC optics, FC switches, and monolithic FC SANs are included in a TCO model.

      To my knowledge, approximately zero public/hybrid cloud service providers are using FC storage. Some might still offer FC with their managed/dedicated hosting offerings, but everything new is cloudy and IP-based.
    3. //outrageous costs

      I know what you mean: Two fully equipped, plus SFPed, plus licensed, plus carepacked Brocade port fibrechannel switches for 80000,-€ is simply too much nowadays!
  2. Thanks for that info. When I first installed the 5020s/5010s I was concerned about the number of uplinks to use but was swayed by firefly that it wouldnt be a big deal (mainly because Windows in combination with lower tiered storage have trouble pushing more than 2Gbs much less 10Gbs). Two years later I'm still running ok on 12:1 oversubcription on non-FCOE and NFS networks although I'd say I'm half 1Gbs and half 10Gb servers.

    I'm glad I have read this prior to a large 5548 and UCS implementation. Thanks a lot!
  3. For Fabric extender 1:1.2 is not considered oversubscription because port to port trafic is also going thru uplinks which is not the case for other switches.
  4. 'the number of fiber strands you need remains the same' - this is entirely true in the case of multimode fibre and 40GbaseSR4. However if you use singlemode fibre you can do 40GbaseLR4 over a pair. Not suggesting this is a particularly good idea at present - LR4 optics are expensive and running more fibre if needed is probably more cost effective. Perhaps in the future though the economics will change.
  5. What's powerful is if you don't have to think about OSR as being fixed the day you wire your network, or as being uniform across the fabric. It should be variable across the network and dynamic - higher in pockets that don't need it, and lower where it is needed, and continuously updated based on current conditions. But to do this, you have to move beyond traditional leaf/spine designs.
    Replies
    1. Would this be an undercover Plexxi plug ;)) In theory I agree with you, in practice I'd prefer building smaller single-purpose fabrics.
    2. Not meant to be undercover, just not overtly commercial :)

      Yes, another way is smaller single-purpose fabrics if that's what you prefer, but not necessarily practical for converged private cloud infrastructure where the goal is ultimately "any workload" on a single infrastructure, but most problems have multiple solutions, ours is just one.
Add comment
Sidebar