Building network automation solutions

9 module online course

Start now!

Does Small Packet Forwarding Performance Matter in Data Center Switches?

TL&DR: No.

Here’s another never-ending vi-versus-emacs-type discussion: merchant silicon like Broadcom Trident cannot forward small (64-byte) packets at line rate. Does that matter, or is it yet another stimulating academic talking point and/or red herring used by vendor marketing teams to justify their high prices?

Here’s what I wrote about that topic a few weeks ago:

Not many people care about 64-byte packet forwarding performance at 100 Gbps speeds. The only mainstream application of 64-byte packet forwarding I found was VoIP.

It’s a bit hard to generate large packets full of voice if you have to send them every 20 msec (the default interval specified in RFC3551), and it’s even harder to generate 3.2 Tbps of voice data going through a single top-of-rack switch. Unless my math is wrong, you’d need over 100 million concurrent G.729A-encoded VoIP calls to saturate the switch. Maybe we should focus on more realistic use cases.

Not surprisingly, Minh Ha disagreed with me:

TCP ACK is min-size 64 bytes, and small packet 200-300 bytes or less are quite common in DC, including FB DCs. TCP ACKs take up a fair amount of traffic in the Internet and also inside a DC, so for this reason alone it’s worth having a chipset capable of processing 64-byte packets efficiently. And since 100GE links are normally aggregate links, they accumulate even more of those small packets from different sources. That makes small-packet processing even more important.

It’s obviously time to step back and do some Fermi estimates. Let’s take Juniper QFX5130 switch (supposedly built on Trident-4). They claim 25.6 Tbps and 5.68 Gpps forwarding performance. Dividing these two results in line-rate forwarding performance for packets longer than 560 bytes… but they’re using marketing math1, counting each packet as it enters the switch and leaves it, so the real line-rate forwarding performance starts around 280-byte packets. The true minimum packet size is probably a bit smaller due to FCS and inter-frame gap (yes, vendor marketers include them in bandwidth-focused forwarding performance).

Anyway, that performance is still good enough for 200-300 byte packets. What about TCP ACKs? As it turns out, TCP usually sends an ACK packet for every data packet, which means that there could never be more 64-byte ACK packets than large (1500-byte) data packets. Assuming that mix, the average packet size would be around 780 bytes – yet again, Broadcom silicon’s performance is more than good enough.

But what if I’m wrong and there are tons of smaller TCP data packets? In that case there’s either something badly wrong with your TCP stack, or you’re running some sort of request-response protocol using small packets, in which case you won’t be able to saturate the fabric links anyway due to RTT and endpoint latency. The details are left as an exercise for the reader.

Could there be another scenario that would result in gazillions of small packets? Probably – and I’d love to hear about it. If you’re aware of a real-life application that can generate 12.8 Tbps worth of small packets on servers connected to a single ToR switch, please leave a comment.

For everyone else, here’s the TL&DR version of this debate: Broadcom doesn’t implement line-rate forwarding of small packets because their target audience knows they don’t need it, and implementing it would just make their silicon unnecessarily expensive.

But what if you’re running one of those ultra-high-speed unicorn applications that I haven’t heard of yet? Stop complaining, figure out your traffic profile, and use fewer ports on the switch if needed. Problem solved.


  1. Considering that they claim 25.6 Tbps forwarding performance on a switch with 32 x 400GE ports (= 12,8 Tbps of bandwidth), I’m pretty sure we can use the lower number. ↩︎

6 comments:

  1. In their datasheet it's stated "system throughput (bidirectional)". That's because of full duplex. All vendors are calculating switching capacity like this. So for the calculation of the frame size, you have to divide it by two. The calculated ~281 bytes frame size is already pretty small from my point of view. What's the average frame size in an average environment? Around 500 bytes? With those specs they maybe could cover 99 % of the market. Would love to see an example in real life of exhausting 12.8 Tbps line rate with real life traffic over a reasonable amount of time.

  2. Hi Ivan,

    Very detailed analysis as usual, and I agree with most of what you said above :)) . The thing is, in my comment as you quoted above, I said 100-400GE links are mostly aggregate uplinks. So a 12.8Tbps switch would normally be a backbone switch. In that case, each uplink port will be the accumulator of packets from many different sources, and these packets come in unsync manner, meaning they can come one right after another, or many at the same time. Add to that many-to-one, incast-prone traffic patterns, and the number of small packets, i.e 64-300 byte packets, can overwhelm a backbone port's pps capacity, in a busy DC.

    Also, for some HPC scenarios, they remove the 64-byte packet size and have packets as small as 40-32 bytes:

    https://www.nextplatform.com/2019/08/16/how-cray-makes-ethernet-suited-for-hpc-and-ai-with-slingshot/

    So switches made for those environment (say high-energy physics apps which have very high data rate requirements) might be required to process a lot of small packets.

    You're very much on point about Broadcom not implementing high-end packet processing due to cost reasons. Economics come first. But I actually don't believe they, or anyone ATM, can do it at all. Processing 64-byte packets at 100GE requires per-packet processing budget of 6.7ns if your LC has only 1 port, and 6.7ns/n for n-port LCs. For 400Ge, the time budget is 4 times lower. I doubt that will ever be possible given technical constraints with TCAM speed, power budget and signal integrity problems at high speed. If I recall correctly, even at 10GE, the line-rate starts at 150 or 160-byte packet or something similar. Those numbers seem to have stuck throughout the decades. That's why at very high speed, label switching is preferable to IP, esp. to IPv6, due to its greater speed and lesser power requirement.

    I bring this up so we can all think about it and discuss the nuances/implications, not as a vendor-bickering exercise :)) .

    Cheers Minh

    Replies
    1. Most ASICs I have encountered including the broadcom ones have mechanisms in place to handle bursts of small packets arriving one after the other without overwhelming the forwarding engine.

      In 11 years of operating networks with datacenter merchant silicon across many hundreds of thousands of deployed switches I have never seen a switch come anywhere need hitting a PPS limit outside of lab test scenarios designed to find them.

      Once you include Jumbo packets and TCP behaviors such delayed ack into the traffic profile the average packet size is measured in KB.

  3. Btw Ivan, just a minor detail, related to vendor's math, using the above example of 25.6 Tbps and 5.68Gpps. Since they normally count both bandwidth and pps twice, I think the effective line rate would still be for packets 560 bytes -- 12.8/2.84.

  4. @Minh Ha:

    > Add to that many-to-one, incast-prone traffic patterns, and the number of small packets, i.e 64-300 byte packets, can overwhelm a backbone port's pps capacity, in a busy DC.

    Keep in mind that the spine switches are not over-subscribed. To get into the scenario you propose, you'd have to have at least 10 Tbps (per spine) worth of small packets. Yet again, I would love to see something generating that.

    > That's why at very high speed, label switching is preferable to IP, esp. to IPv6, due to its greater speed and lesser power requirement.

    It's just the question of looking up 24 bits (label) versus 64 bits (IPv6 prefix). There might be power issues (I know nothing about those), but assuming they use TCAM for lookups, I don't expect longer lookup to be significantly slower. Table sizes are obviously a different story.

  5. If you have fabric shared amount many different apps with mixed traffic profile and apps that scale horizontally, than 2-4 x25Gbps shall be more than enough for any particular server connected to that fabric and than the volumes of n x 1TBps are just efficiently spread across the DC fabric covered with multiple servers. Is that reasonable thinking?

  6. @Vuk: As always, the correct answer is "it depends". If you want to do your design right, there's no way around understanding the traffic requirements. Once you have those, everything else becomes a piece of cake.

Add comment
Sidebar