In my quest to understand how much buffer space we really need in high-speed switches I encountered an interesting phenomenon: we no longer have the gut feeling of what makes sense, sometimes going as far as assuming that 16 MB (or 32MB) of buffer space per 10GE/25GE data center ToR switch is another $vendor shenanigan focused on cutting cost. Time for another set of Fermi estimates.
Let’s take a recent data center switch using Trident II+ chipset and having 16 MB of buffer space (source: awesome packet buffers page by Jim Warner). Most of switches using this chipset have 48 10GE ports and 4-6 uplinks (40GE or 100GE).
There are only two reasons we need buffers in network devices:
- To capture intermittent bursts;
- To prevent packet drops when an output interface becomes congested.
The most important assumption we have to make is thus: where does the congestion happen? The obvious answer is it depends on the traffic flows crossing the switch. I’ll assume we’re solving the incast problem - many senders sending data to the same destination connected to a 10GE link.
Next, we have to make a lot of assumptions about the black box called ASIC that we cannot possibly learn anything about because Broadcom manages to hide its shortcomings under layers and layers of NDAs. Making my task easier I’ll assume that:
- We’re dealing with large packets (1500 bytes);
- Buffers are shared across all ports.
You can obviously make a different set of assumptions and try to figure out whether the Broadcom ASIC uses fixed-size buffers (at maximum MTU) or splits incoming packets into fragments that optimize buffer usage. Wish you luck ;)
16MB of buffer space is ~10.000 1500-byte packets, or ~200 packets per server port (assuming all ports are experiencing congestion at the same time). Time for reality check: when was the last time you’ve simultaneously seen 200 packets in output queues of all output ports?
Even under such unrealistic scenario there’s 300.000 bytes of output data sitting in each 10GE output queue, resulting in 240 microseconds of latency, making single-switch queuing latency approaching the SSD access latency. Not exactly what you’d want to have after paying an arm and a leg for an SSD-only disk array (or all-NVMe server farm).
Another rule-of-thumb approach uses bandwidth-delay product estimate. The reasoning is very simple: if you want to keep data flowing, the amount of data in flight has to be at least as large as the end-to-end bandwidth times one-way delay. Using 10GE link speed and 100 microsecond delay we need 125 Kbyte of data in transit, and having more buffer space than that on any single device makes little sense unless you're dealing with a corner case (see above).
Conclusion: Assuming the ASIC vendors got their **** together the shallow-buffer switches are more than good enough for intra-data-center traffic. Next time we’ll focus on what happens when we cross the LAN-to-WAN boundary.
Want to know more?
- To Drop or To Delay with Juho Snellman;
- TCP in the Data Center with Thomas Graf;
- Networks, Buffers and Drops free webinar with JR Rivers;