The Saga of Oversubscriptions

Matt Thompson provided a really good answer to the “what’s acceptable oversubscription ratio in a ToR switch” when he wrote “I’m expecting a ‘how long is a piece of string’ answer” (note: do watch the BBC video answering that one).

There’s the 3:1 rule-of-thumb recipe, with a more realistic answer being “it depends”. Now let’s see if we can go beyond that without a deep dive into scholastic waters.

Know Thy Traffic

You can provide the best answer to this question by monitoring your data center traffic. Figure out the amount of network and storage traffic, the ratio between outbound (north-south) and internal (east-west) traffic, the amount of overhead traffic (vMotion, data replication and backups could represent a large percentage of your traffic), and do some back-of-napkin math (speaking of napkins, this book is REALLY good).

Have to mention that this article describes data center environments. Campus networks are totally different; you might easily get away with 100:1 oversubscription, particularly if the users connections run @ 1Gbps. Also, small data centers where majority of the traffic stays within a ToR switch or blade server enclosure could use significantly higher oversubscription ratios.

If you have a larger data center and plenty of budget, consider proper analysis and simulation tools. Cariden has a great tool that will definitely fit the bill (they were recently acquired by Cisco but still seem pretty independent).

It’s harder (but still feasible) to use this approach if you’re deploying a totally new architecture, for example migrating from FC to iSCSI, or moving from 1GE to 10GE while radically changing the virtualization ratio.

On the other hand, if you have no reliable network usage statistics, you’re facing a Mission Impossible situation, so here are a few more things to consider.

Expect the Worst

You might have a perfectly designed network that performs flawlessly … until one of the links fails. Link failure reduces the available bandwidth, but might also significantly impair ECMP load balancing algorithm (which is not perfect anyway), regardless of whether it’s based on L2/L3 forwarding tables or port channels.

The dirty details are obviously hardware-specific (check with your vendor and do your own lab tests), but you might lose way more than a quarter of the uplink bandwidth if you lose one of four uplinks.

BTW, Cariden tools allow you to simulate link, node and transmission group (bundle of links) failures, iterating over every possible failure (or multiple failures), and documenting worst-case performance.

The Hogs

In many data centers that use network to transfer data generated by backup jobs running on the servers (physical or virtual), backup traffic represents majority of the overall traffic, sometimes overloading the network to the extent that daily backups take more than 24 hours.

If this sounds familiar, remember that you won’t solve the problem by upgrading the weakest link, you’ll just move the problem around. To solve the problem, you have to:

  • Estimate the minimum backup bandwidth you need from each server or group of servers;
  • Figure out the paths the backup traffic will take across the network;
  • Overprovision those paths to ensure the backup traffic doesn’t interfere with the regular use traffic.

Obviously there are loads of tricks you can use to alleviate this problem, from deploying a dedicated backup network, to moving backup storage closer to the servers or using virtual disk or storage-based backup solutions (ex: Veeam).

vMotion might be a similar hog, only less predictable. At least we know when the backups start each night, but we never know when a clickety-click-happy operator decides it’s a good idea to evacuate a rack of servers (because someone told him vMotion Just Works with zero impact to the VM performance).

Oh, and then there are the Happy Tuesdays.

Rule-of-thumb: You wouldn’t want the hogs to use more than approximately half of your bandwidth. If there’s nothing else you can do, mark the hogs, and keep them in a separate QoS class.

Know What Matters

You might get away with limiting the amount (or percentage) of bandwidth available to the hogs. For example, if backup traffic clogs your network, use QoS tools to give backup a guaranteed but limited amount of bandwidth.

Converged Storage Networks

iSCSI, NFS or FCoE could be another major source of data center traffic, but it’s pretty easy to predict. The storage traffic is limited by the total bandwidth of network adapters on your disk arrays ... unless you’re using peer-to-peer file systems (GFS comes to mind), in which case all bets are off (just ask Amazon).

If you expect large amounts of storage traffic, it might make sense to build a dedicated storage network – either starting with dedicated server ports and ToR switches (which will make you loved by storage admins and hated by the CFO), or by having shared ToR switches and a separate storage network spine layer with dedicated uplinks.

Have you noticed I just described access-layer FCoE networks with FC ports on ToR switches?

Actual network traffic

Finally we’re in the totally unpredictable waters. The amount of the (production) network traffic depends heavily on your application architecture and the skills of your programmers (ever seen a transaction doing thousands of individual SQL queries because the programmer never got to the JOIN chapter of the SQL manual?). Add “unlimited mobility” and “business agility” to the mix and you have a winner.

There are only a few hints I can give you:

Standalone applications (example: single-VM LAMP stacks or IIS/SQL Server Express combos encountered in many web hosting environments) are always limited by the user-facing bandwidth (links at the edge of the data center).

Database traffic (for simple single-VM applications that use external DB server) is always limited by the NIC bandwidth of the database server(s).

Ideally you have equidistant bandwidth between all end points (hint: leaf-and-spine aka Clos fabrics). Not really hard to do unless you plan to have tens of thousands of server ports or your CFO forces you to work with old garbage because he believes data center equipment has 15 year depreciation period.

Back to rules-of-thumb

All things considered, if you don’t have reliable network statistics, 3:1 oversubscription does look reasonable (particularly if you use IP-based storage that is not attached to the ToR switch). I’ll stick with it.

Thank you!

Ethan Banks, Chris Marget, Jeremy Filliben and Greg Ferro provided invaluable feedback and interesting additional details during my writing struggles. Thank you!

More information

If you need ...

2 comments:

  1. Hi Ivan,
    So you assumed that all networking nodes (LCs) are nonblocking?

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.