Switch Buffer Sizes and Fermi Estimates
In my quest to understand how much buffer space we really need in high-speed switches I encountered an interesting phenomenon: we no longer have the gut feeling of what makes sense, sometimes going as far as assuming that 16 MB (or 32MB) of buffer space per 10GE/25GE data center ToR switch is another $vendor shenanigan focused on cutting cost. Time for another set of Fermi estimates.
Let’s take a recent data center switch using Trident II+ chipset and having 16 MB of buffer space (source: awesome packet buffers page by Jim Warner). Most of switches using this chipset have 48 10GE ports and 4-6 uplinks (40GE or 100GE).
There are only two reasons we need buffers in network devices:
- To capture intermittent bursts;
- To prevent packet drops when an output interface becomes congested.
The most important assumption we have to make is thus: where does the congestion happen? The obvious answer is it depends on the traffic flows crossing the switch. I’ll assume we’re solving the incast problem - many senders sending data to the same destination connected to a 10GE link.
Next, we have to make a lot of assumptions about the black box called ASIC that we cannot possibly learn anything about because Broadcom manages to hide its shortcomings under layers and layers of NDAs. Making my task easier I’ll assume that:
- We’re dealing with large packets (1500 bytes);
- Buffers are shared across all ports.
You can obviously make a different set of assumptions and try to figure out whether the Broadcom ASIC uses fixed-size buffers (at maximum MTU) or splits incoming packets into fragments that optimize buffer usage. Wish you luck ;)
16MB of buffer space is ~10.000 1500-byte packets, or ~200 packets per server port (assuming all ports are experiencing congestion at the same time). Time for reality check: when was the last time you’ve simultaneously seen 200 packets in output queues of all output ports?
Even under such unrealistic scenario there’s 300.000 bytes of output data sitting in each 10GE output queue, resulting in 240 microseconds of latency, making single-switch queuing latency approaching the SSD access latency. Not exactly what you’d want to have after paying an arm and a leg for an SSD-only disk array (or all-NVMe server farm).
Another rule-of-thumb approach uses bandwidth-delay product estimate. The reasoning is very simple: if you want to keep data flowing, the amount of data in flight has to be at least as large as the end-to-end bandwidth times one-way delay. Using 10GE link speed and 100 microsecond delay we need 125 Kbyte of data in transit, and having more buffer space than that on any single device makes little sense unless you're dealing with a corner case (see above).
Conclusion: Assuming the ASIC vendors got their **** together the shallow-buffer switches are more than good enough for intra-data-center traffic. Next time we’ll focus on what happens when we cross the LAN-to-WAN boundary.
Want to know more?
Check out:
- To Drop or To Delay with Juho Snellman;
- TCP in the Data Center with Thomas Graf;
- Networks, Buffers and Drops free webinar with JR Rivers;
Finally, join the How Networks Really Work series starting on June 18th. You could join the live session for free if you send me a really good question.
What about the uplinks? Do they also share the buffer space?
Pay attention to the fact that one-way delay != RTT . Also your definition of latency would be helpful.
An other assumption would be that most if not all traffic is TCP based. Then of course the bandwidth delay product comes into play.
Now I'm looking forward to the deep buffers advocates to join the party.
PS: It seems you got inspired by my youtube reference about buffer requirements in networking from your beloved vendor ;)
If I ever get as far as "it depends on..." followed by a solid (and justifiable) set of parameters, I'll be a happy person, and will know more than I know today.
As for "one-way delay versus RTT", I spent a lot of time thinking about that, and got to the conclusion that RTT is the one that really matters (but maybe I got it wrong together with everyone else).
And yeah, I would love to hear the cheers from the deep buffer crowd :D
More to come, stay tuned...
One example where you need deep buffers, is when you forward market data over high-latency circuits. Let's say you've spent a lot of time and money to make sure your circuits are 100% clean, wouldn't you be a bit mad if your switches end up discarding packets and causing a few seconds of latency?
With shallow-buffer switches, increase the size of your switch uplinks, then you might have discards from uplinks to hosts. Keep your uplinks at low speed, then your will have discards from hosts to uplinks. If your uplinks are connected to a router, it might also be costly to get 40G/100G ports on that router: which also means you will save some money (on the router side) if you use deep buffer switches with 10G uplinks. Why would anyone do that? Well, what's the point of using 40G/100G uplinks, if your router has a 10G circuit (not 40G/100G) to other locations? Might as well buffer on the (cheap) switch than on the expensive router.
Now, vendors using deep-buffer Broadcom ASICs were not really honest when they said you could buffer Gbytes of data. What did they forget to tell you? That it depends on the frame size. Broadcom Jericho uses buffer cells of 1000 bytes. Any frame smaller than 1000 bytes will use the entire cell, so if you try to buffer 200-byte frames (about the average size of market data), you can only use 20% of the buffer, and the rest is wasted. Switch spec shows 8 GBytes, but you can only use 300 Mbytes for 200-byte frames, less than some Catalyst switches.
All Broadcom ASICs use this buffer cell method, with a different cell size for each ASIC (usually 208 bytes). The ASIC used in Arista 7020R doesn't have this limitation (one buffer cell can buffer multiple frames), and Jericho 2 doesn't have this problem either I was told (to be tested). Anyway, what this means is your deep-buffer switch's buffer is not as deep as you thought, and definitely not ready to replace your expensive PE which can buffer up to 500ms of data per port.
One last comment: if you look at all popular shallow-buffer ASICs, and look at how much buffer in ms you have for a single port (let's say your port is 10G on Trident, and 400G on Tomahawk III), you will notice the buffer size in ms keeps decreasing. It was 7.2 ms with Trident, it's now 1.3 ms with Tomahawk III. Do you really think that's enough buffer?
- RTT is much more variable and can be much higher than a typical DC RTT
- The TCP Congestion avoidance algorithm cannot be optimised for shallow buffering as it
happens for Google, Amazon and so forth as the SP is simply not in control of the
TCP/Applications as it just transports them.
- There can still be huge interface speed difference especially at the Leaf layer going
from , say, 100Ge to 10Ge.
- Last but not least, the assumptions at the basis of the 'Stanford Model' of buffer sizing (advocating small buffers in SP networks) have been challenged from several angles and by several papers (see below some of the challenges):
1. It’s not just the link utilisation and the queueing delay to be optimized but the quality of service perceived by TCP flows too as at the end of the day that's where your customer is...
2. A fixed number N of long-lived TCP flows is a pretty poor model. A SP network is more complex than this
3. The model should differentiate between Core and Edge/access links where the interface speed difference can be important
Such a challenge opened up a much more complex scenario to be considered to properly model a SP network capable of providing a more realistic formula about the best buffer size in every context. I guess this can be part of the next blog that Ivan said it'd be about inter-DC traffic (or maybe actually of a third one since the environment I am talking about is not inter-DC but Virtualized SP Edge.. ) ?
Ciao
Andrea
https://twitter.com/enno_insinuator/status/1145389102906454016?s=12