Networking vendors are quick to point out how the opaqueness (read: we don’t have the HW to look into it) of overlay networks presents visibility problems and how their favorite shiny gizmo (whatever it is) gives you better results (they usually forget to mention the lock-in that it creates).
Now let’s step back and ask a fundamental question: how much bandwidth do we need?
Disclaimer: If you’re running a large public cloud or anything similarly sized, this is not the post you’re looking for.
- We have mid-sized workload of 10.000 VMs (that’s probably more than most private clouds see, but let’s err on the high side);
- The average long-term sustained network traffic generated by every VM is around 100 Mbps (I would love to see a single VM that’s not doing video streaming or network services doing that, but that’s another story).
The average bandwidth you need in your data center is thus 1 Tbps. Every pizza box ToR switch you can buy today has at least 1.28 Tbps of non-blocking bandwidth. Even discounting for marketing math, you don’t need more than two ToR switches to satisfy your bandwidth needs (remember: if you have only two ToR switches you have 1.28 Tbps of full-duplex non-blocking bandwidth).
If that’s not enough (or you think you should take in account traffic peaks), take a pair of Nexus 6000s or build a leaf-and-spine fabric.
In many cases VMs have to touch storage to deliver data to their clients, and that’s where the real bottleneck is. Assuming only 10% of the VM-generated data comes from the spinning rust (or SSDs) I’d love to see the storage delivering sustained average throughput of 100 Gbps.
How about another back-of-the-napkin calculation:
- A data center has two 10Gbps WAN links;
- 90% of the traffic stays within the data center (yet again on the high side – supposedly 70-80% is a more realistic number).
Based on these figures, the total bandwidth needed in the data center is 200 Gbps. Adjust the calculation for your specific case, but I don’t think many of you will get above 1-2 Tbps.
Obviously you might have bandwidth/QoS problems if:
- You use legacy equipment full of oversubscribed GE linecards;
- You still run a three-tier DC architecture with heavy oversubscription between tiers;
- You built a leaf-and-spine fabric with 10:1 oversubscription (yeah, I’ve seen that);
- You have no idea how much traffic your VMs generate and thus totally miscalculate the oversubscription factor;
... but that has nothing to do with overlay virtual networks – if anything of the above is true you have a problem regardless of what you run in your data center.
Just in case you need more information
Check out these webinars:
- Data Center 3.0 if you’re new to data center networking;
- Clos Fabrics Explained if you’re building a new data center networking fabric;
- Data Center Fabrics if you can’t decide which vendor to choose.
All webinars are available as part of the yearly subscription and you can always ask me for a second opinion or a design review.