Switch Buffer Sizes and Fermi Estimates

In my quest to understand how much buffer space we really need in high-speed switches I encountered an interesting phenomenon: we no longer have the gut feeling of what makes sense, sometimes going as far as assuming that 16 MB (or 32MB) of buffer space per 10GE/25GE data center ToR switch is another $vendor shenanigan focused on cutting cost. Time for another set of Fermi estimates.

Let’s take a recent data center switch using Trident II+ chipset and having 16 MB of buffer space (source: awesome packet buffers page by Jim Warner). Most of switches using this chipset have 48 10GE ports and 4-6 uplinks (40GE or 100GE).

You can do similar analysis using Trident 3- or Tomahawk-based switches. I found it easier to deal with 10GE links ;)

There are only two reasons we need buffers in network devices:

  • To capture intermittent bursts;
  • To prevent packet drops when an output interface becomes congested.

The most important assumption we have to make is thus: where does the congestion happen? The obvious answer is it depends on the traffic flows crossing the switch. I’ll assume we’re solving the incast problem - many senders sending data to the same destination connected to a 10GE link.

Incast should be pretty rare in typical data center environments. You might encounter it in Map-Reduce clusters or distributed file systems (after adding a fresh node to the DFS cluster). Then there are corner cases where there's long-term congestion on ToR switch uplinks (details here). In those cases you obviously need faster links or deeper buffers.

Next, we have to make a lot of assumptions about the black box called ASIC that we cannot possibly learn anything about because Broadcom manages to hide its shortcomings under layers and layers of NDAs. Making my task easier I’ll assume that:

  • We’re dealing with large packets (1500 bytes);
  • Buffers are shared across all ports.

You can obviously make a different set of assumptions and try to figure out whether the Broadcom ASIC uses fixed-size buffers (at maximum MTU) or splits incoming packets into fragments that optimize buffer usage. Wish you luck ;)

16MB of buffer space is ~10.000 1500-byte packets, or ~200 packets per server port (assuming all ports are experiencing congestion at the same time). Time for reality check: when was the last time you’ve simultaneously seen 200 packets in output queues of all output ports?

Even under such unrealistic scenario there’s 300.000 bytes of output data sitting in each 10GE output queue, resulting in 240 microseconds of latency, making single-switch queuing latency approaching the SSD access latency. Not exactly what you’d want to have after paying an arm and a leg for an SSD-only disk array (or all-NVMe server farm).

Another rule-of-thumb approach uses bandwidth-delay product estimate. The reasoning is very simple: if you want to keep data flowing, the amount of data in flight has to be at least as large as the end-to-end bandwidth times one-way delay. Using 10GE link speed and 100 microsecond delay we need 125 Kbyte of data in transit, and having more buffer space than that on any single device makes little sense unless you're dealing with a corner case (see above).

Conclusion: Assuming the ASIC vendors got their **** together the shallow-buffer switches are more than good enough for intra-data-center traffic. Next time we’ll focus on what happens when we cross the LAN-to-WAN boundary.

Want to know more?

Check out:

Finally, join the How Networks Really Work series starting on June 18th. You could join the live session for free if you send me a really good question.

8 comments:

  1. Good guesswork but it might still not be enough for some of your followers to trust you on the shallow buffer mystery.
    What about the uplinks? Do they also share the buffer space?
    Pay attention to the fact that one-way delay != RTT . Also your definition of latency would be helpful.
    An other assumption would be that most if not all traffic is TCP based. Then of course the bandwidth delay product comes into play.
    Now I'm looking forward to the deep buffers advocates to join the party.
    PS: It seems you got inspired by my youtube reference about buffer requirements in networking from your beloved vendor ;)
    Replies
    1. "it might still not be enough for some of your followers to trust you on the shallow buffer mystery" << I don't have an answer (apart from "it depends") and we cannot solve the mystery because (A) it's really hard to generalize anything to a very wide range of environments (I've heard there's still a bit of a gap between general relativity and quantum mechanics ;) and (B) we don't have a solid understanding of how things work thanks to Broadcom & co.

      If I ever get as far as "it depends on..." followed by a solid (and justifiable) set of parameters, I'll be a happy person, and will know more than I know today.

      As for "one-way delay versus RTT", I spent a lot of time thinking about that, and got to the conclusion that RTT is the one that really matters (but maybe I got it wrong together with everyone else).

      And yeah, I would love to hear the cheers from the deep buffer crowd :D

      More to come, stay tuned...
  2. You also forgot about multiple incast frames arriving at exactly the same time as a need for buffer, but usually in this case you don't need massive buffers.
  3. The need for deep buffers definitely depends on your environment/types of traffic.

    One example where you need deep buffers, is when you forward market data over high-latency circuits. Let's say you've spent a lot of time and money to make sure your circuits are 100% clean, wouldn't you be a bit mad if your switches end up discarding packets and causing a few seconds of latency?

    With shallow-buffer switches, increase the size of your switch uplinks, then you might have discards from uplinks to hosts. Keep your uplinks at low speed, then your will have discards from hosts to uplinks. If your uplinks are connected to a router, it might also be costly to get 40G/100G ports on that router: which also means you will save some money (on the router side) if you use deep buffer switches with 10G uplinks. Why would anyone do that? Well, what's the point of using 40G/100G uplinks, if your router has a 10G circuit (not 40G/100G) to other locations? Might as well buffer on the (cheap) switch than on the expensive router.

    Now, vendors using deep-buffer Broadcom ASICs were not really honest when they said you could buffer Gbytes of data. What did they forget to tell you? That it depends on the frame size. Broadcom Jericho uses buffer cells of 1000 bytes. Any frame smaller than 1000 bytes will use the entire cell, so if you try to buffer 200-byte frames (about the average size of market data), you can only use 20% of the buffer, and the rest is wasted. Switch spec shows 8 GBytes, but you can only use 300 Mbytes for 200-byte frames, less than some Catalyst switches.

    All Broadcom ASICs use this buffer cell method, with a different cell size for each ASIC (usually 208 bytes). The ASIC used in Arista 7020R doesn't have this limitation (one buffer cell can buffer multiple frames), and Jericho 2 doesn't have this problem either I was told (to be tested). Anyway, what this means is your deep-buffer switch's buffer is not as deep as you thought, and definitely not ready to replace your expensive PE which can buffer up to 500ms of data per port.

    One last comment: if you look at all popular shallow-buffer ASICs, and look at how much buffer in ms you have for a single port (let's say your port is 10G on Trident, and 400G on Tomahawk III), you will notice the buffer size in ms keeps decreasing. It was 7.2 ms with Trident, it's now 1.3 ms with Tomahawk III. Do you really think that's enough buffer?
  4. Very interesting topic and replies - well done everybody. For my part, I'll carry on being quite 'anal' on the use of the very same Fabrics that were architected for DC traffic (i.e. in particular with Leaves of 16MB of total buffer which can make sense in a DC as highlighted in this Ivan's article) but in an environment that is far from being a DC: The SP's Virtualised EDGE (i.e. NAS, NAM and PE control and, more importantly, data plane deployed as VNF along with most of the services such as CGNAT, MobilePacketCore and so forth) where there's a huge difference in terms of the following:

    - RTT is much more variable and can be much higher than a typical DC RTT
    - The TCP Congestion avoidance algorithm cannot be optimised for shallow buffering as it
    happens for Google, Amazon and so forth as the SP is simply not in control of the
    TCP/Applications as it just transports them.
    - There can still be huge interface speed difference especially at the Leaf layer going
    from , say, 100Ge to 10Ge.

    - Last but not least, the assumptions at the basis of the 'Stanford Model' of buffer sizing (advocating small buffers in SP networks) have been challenged from several angles and by several papers (see below some of the challenges):
    1. It’s not just the link utilisation and the queueing delay to be optimized but the quality of service perceived by TCP flows too as at the end of the day that's where your customer is...
    2. A fixed number N of long-lived TCP flows is a pretty poor model. A SP network is more complex than this
    3. The model should differentiate between Core and Edge/access links where the interface speed difference can be important

    Such a challenge opened up a much more complex scenario to be considered to properly model a SP network capable of providing a more realistic formula about the best buffer size in every context. I guess this can be part of the next blog that Ivan said it'd be about inter-DC traffic (or maybe actually of a third one since the environment I am talking about is not inter-DC but Virtualized SP Edge.. ) ?
    Ciao
    Andrea
  5. Hi Ivan, FYI Mr Warner's incredible pages seem to have been discontinued, ( I bet the switch vendors are happy on that one). Maybe he retired, I doubt its just a glitch, as he has disappeared from the list of "Staff individual home pages", he was there on the MAY 2019 capture in web.archive.org, but not today. Maybe we should petition to have those pages restored or curated by another interested party at UCSC. BR David Shephard
    Replies
    1. Thanks for the info! Enno Rey noticed that yesterday and saved a snapshot of the page.

      https://twitter.com/enno_insinuator/status/1145389102906454016?s=12
  6. thank you guys for saving such a key info - much appreciated
Add comment
Sidebar