Repost: On the Importance of Line-Rate Switching of Small Packets
I made a flippant remark in a blog comment…
While it’s academically stimulating to think about forwarding small packets (and applicable to large-scale VoIP networks), most environments don’t have to deal with those. Looks like it’s such a non-issue that I couldn’t find recent data; in the good old days ~50% of the packets were 1500 byte long.
… and Minh Ha (by now a regular contributor to my blog) quickly set me straight with a lengthy comment that’s too good to be hidden somewhere at the bottom of a page. Here it is (slightly edited). Also, you might want to read other comments to the original blog post for context.
I don’t deny that small packets are not too much of an issue on a daily basis, but there’re several considerations that put that topic beyond academia. First one is quite obvious. To benchmark how well a router/switch performs vs the next one, a common ground needs to be used, and as traditionally routers are evaluated using their worst-case performance, small packets are useful for benchmarking.
Second, small-packet performance shows the basic capability of a router (let’s call them all routers as most of them do layer 3) for doing plain destination-based lookup. This basic capability deteriorates quickly as we start putting sophisticated features on, like security or QoS. Cisco QuantumFlow processor, when it came out, could handle about 8Mpps and degraded to 2Mpps as more features are activated. So it’s always good to know at least the baseline performance capability, so we can have a rough expectation of how well it will perform under stress.
I believe back in the early 2000s, one of Juniper platforms was known for ingress-to-egress-port out-of-order delivery of packets, under heavy traffic condition. All routers having buffered-crossbar architecture are susceptible to this of course, but the situation is caused when there’s massive backlog of cells in the VOQ and fabric, a situation that happens when the packet lookup/processing performance is less than optimal. So it’s always good to have a superior architecture that can handle packets smoothly, and that metric is reflected in how well a platform handles small packets.
Third, real-life traffic is not uniform, but tends to be long range dependent (LRD)/self-similar. Even Broadcom admits as much on page 5 of this presentation.
The nature of self-similarity/LRD can be due to heavy hitters/elephant-mice flow distribution, the nature of file distribution etc. TCP congestion control also contributes to LRD/self-similarity. With this kind of traffic, congestion spreading is reality, and weak-architecture routers will be killed. So high-performance routing platforms are always good to have. Self-similar traffic also makes big buffer mostly useless, but I digress ;).
I’m aware that sometime ago, there was an argument going back and forth about small packet performance needed or not – someone was nit-picking on Arista if I recall – and Arista’s response was something along the line of “we produce switches to meet our customer’s use cases, and by far none of them demands superior small-packet handling, so it’s basically a non-issue”. While on the surface this seems reasonable, when looking deeper, not so much, for all the above reasons. After all, if your router is so good, why wouldn’t it outperform competitors on the most basic of all benchmarks, packet forwarding? If yours is indeed supreme, then it doesn’t matter what benchmark right, you’d still outperform. If it can perform well with small packets, then I can sleep well buying it, knowing that I can trust its tenacity even in extreme or hostile conditions.
And speaking of hostile conditions, DDOS is another reason small-packet performance matters, as it translates directly to how much a router can take before it goes down. Of course there’re other measures to protect against DDOS, but if all vendors provide similar capability in those aspects, then what stands one apart from another would be packet-handling capacity, as that will decide who’s the last man standing in a heavy DDOS attack.
Sometime ago, I came across this paper from IBM (released in 2013) that surveyed over 30,000 servers, located across more than 50 production data centers, over a two year time span. Their findings, among other things, are that 80%+ of packets are 500 bytes or less, and at that, more small packets coming in than going out.
An excerpt from it about the packet size: “the average MTU is 1806.81 bytes due to the dominance of the traditional Ethernet MTU value equal to 1500 bytes as shown by the median MTU value. For the network load on each server, we see that: (i) the average server network traffic is roughly 1.16 Kpps and 5.7 Mbps; (ii) the average packet size is 300 bytes, and (iii) the weekend day network traffic is only slightly lower than the week day traffic.”
Re AWS router, I completely agree with you that AWS doesn’t need all the fanciful add-on features that Juniper (and other vendors) provides – that’s probably why they decide to do it themselves :)) . Vendors pack all those features on in an attempt to differentiate, and in order to charge higher for their devices. A lot of those features go unused, not just by Cloud providers, but the average enterprise/SP as well. By doing it themselves, AWS can use all the chip areas that otherwise would be wasted on unneeded stuff, to optimize for packet performance, which is mainly what the Cloud needs. While their routers are most likely not top-notch in quality due to their inexperience, for utility computing, that’s all they care and need, as you rightly pointed out.
At the end of the day, I feel networking (and IT in general ) has commoditized so much over the years – all the more so since the Cloud became mainstream – that we can expect to see more and more vendors compress or simply go away, and those who want to survive will have to reinvent themselves and differentiate their products instead of relying on the likes of Broadcom to provide chipsets/ASICs for them, essentially turning themselves into Broadcom resellers. Packing a lot of features that almost no one uses/needs is not the answer though 😜. There’re always markets for good and competitive products that solve the right problems, just like AMD has proven by going back from the brink of death several times after being dealt almost mortal blows by Intel, who’s feeling it more and more with each passing year as their loyal customers either go AMD, ARM or do it themselves, like Apple.
Oh, and I too did go thru the Broadcom material shared by Oleg. Most of it is about physical layer stuff, and there’s no treatment of the ASIC architecture, like the amount of CAM/TCAM, the traffic manager, the fabric scheduler, the crossbar architecture etc, the important stuff basically. Typical Broadcom… It’d be really unfortunate if the networking industry needs to rely on them to innovate on its behalf and provide vendors with the backbone of their products. Kinda reminds me of how the server/PC industry has stiffened for decades after Intel took center stage and eliminated most of the competition, making us stuck with their crappy CISC for a long, long time up until recent years.
Always a pleasure reading your stuff. You and Ivan are are an explosively powerful mix. :)
Totally agree with you on small packets.
A couple of things though for the time being:
I just want to take a stand for once in defense of the high-end vendors as they are asked for hundreds of sometimes pretty exotic features by their big Service Provider customers and therefore the code and the chipset gets as you pointed out extremely populated with stuff. I have the impression it is inevitable and it's not just made on purpose to make more money as such. I reckon they'd rather avoid it if they could.
The other thing is if you can open up another blog soon on the interesting. statement: "Self-similar traffic also makes big buffer mostly useless," :)
Last thing: I might be naive but our market is in need even more urgently now that the scenario is getting fuzzier and fuzzier of an independent testing facility for functionality AND performance. Universities should be involved and the vendors too of course. Stringent test procedures and well documented and thus reproducible. In short, proper stuff.
IMIX can also serve as a baseline. It would be better benchmark than for 64B size packets. More important thing is to do multidimensional scale and performance tests on the same OS. Not like unidimensional tests on a tuned OS version per test as some Chinese vendors got used to do it.
In a test with a single feature the performance may not drop significantly. A router OS image packed with many features requires additional CPU cycles to check IF statements and bypass the code, even if it is not turned on but its software module is on the HW line card. Some vendors used a pre-canned OS images to use just the code optimised for a particular tested feature. In reality we turn on many features at the same time and the similar setup should be validated during the test. There are customers who are waiting several years for one official image. Be careful with such vendors.
One more - It's better to do test based on IMIX as you can have a higher probability of detecting malfunctions and bad architectures. With a fixed size of packets a vendor can use a simpler dedicated ASICs which perform better for this size but not the other. At the end of the day DDOS is still IMIX.
Thx for the kind words :))! Yeah, I too am aware of vendors being asked by certain high-end SP customers to put all sorts of features on to their chipsets, in kitchen-sink style RFPs. There're always such customers and it's fine to address their requirements as such. But that's exactly why vendors have product lines, like Broadcom creates different chipsets for different markets: Trident for Enterprise, Jericho for SP, Tomahawk for Cloud which is light in feature and powerful in forwarding capacity. Hypothetically speaking, if once-great market leaders like Cisco executed well, it'd not be struggling the way it is now. And I do mean struggling, because Cisco’s revenue seems to have been stuck at the current level for a few years now, this year slightly worse than 2019. Looks like Cisco’s core business is being slowly eaten away by Cloudification and whiteboxing. It's been trying hard to diversify in recent years, something it should have started 20 yrs ago when its spirit was still vibrant.
I personally think the problem goes deeper than tech, because if it's just tech, Broadcom should be in the same rabbit hole. The problem, IMO, lies with vendor strategy, allocation of resources, and execution. Let's dissect Cisco. The company was a pioneer in the golden age of networking, the 90s. Cisco in its young life was prized, among other things, for its innovative strength and its dynamic culture. IOS for ex, was written by high-class programmers. EIRGP was another great invention, a protocol if rightly configured and designed, is superior to link-state protocols in scalability, and downright better in tree topologies, esp. if one wants to implement valley-free routing, without resorting to BGP unnecessarily.
But Cisco couldn’t take advantage of them and of its incumbent advantage the way MS did, making Windows the de-facto standard for Desktop OS. IOS got nowhere near that, and instead, got chopped into all sorts of variants, like IOS-XE, IOS-XR etc etc. It’s too complicated and slowly customers started to complain about the complexity of it all. That’s just plain stupid IMO, and opened the doors for new entrants to come in and attempt to “differentiate with their NOS”. Cisco dug its own grave on that one.
As for EIRGP, as promising/elegant as it was, Cisco was never able to impose its own weight and make it into a quasi standard either. Terrible execution, due in part to the then mgmt’s arrogance and excessive focus on stock price and addressing “shareholder value”, I must say. Lucent was a prime example of how following such misguided practice could lead to destruction. Cisco is wiser, but just, because right now, its public image is just a tired aging company, instead of a dynamic technology leader it once was in its younger years. At one point, March 2000, Cisco was the most valuable company in the world, bigger than MS; now its market cap is one-eighth of MS’ size. Is that the result of sound strategy, good prioritization of resources, and excellent execution? Hardly :p.
And speaking of the market getting fuzzier and in need of indepdendent testing facility, I can’t agree with you more! The other day I saw someone mentioned Innovium Teralynx ASIC reaching its 1 million ports on shipment milestone, so I decided to dig around on Teralynx. What I came across was a bunch of superficial whitepapers and articles praising its prowess, no worthwhile architectural info. Here’s a sample of those totally worthless articles:
I did managed to find one nice piece of info though: “As for latency, Khemani says the typical port to port hop is on the order of 500 nanoseconds, but for typical alternatives it is more like 1 microsecond; the important thing is that this number is much lower than what Tomahawk 4 will deliver.” So a high-density 400GE switch, offers essentially the same port-to-port latency as 10GE switch? And Tomahawk is worse than that? And I thought the point of having 400G as a standard was so that things could improve on all fronts :p. Looks like the only things that did improve was the Serdes frequency and the serialization speed. This is basically the same as in computing. Intel (and others) comes up with superfast parallel CPUs, only to have them dragged down by memory speed bottleneck, and application latency remains mostly unchanged.
These situations are allowed to happen because the whole networking industry relies on a handful of companies providing them with chipsets/ASICs, the backbone of their products. It stiffens innovation and leads to obfuscation of info as you rightly mentioned, and it doesn’t matter since these guys, being the only sources the industry can turn to, make their own rules and get to throw their weight around.
And just like Piotr brought up, tests should be multi-dimensional, because even on hardware levels, there’s no chipset architecture that works best for all scenarios. An ASIC may test well with one feature or at a certain level of offered load, only to fall apart as more features are activated or at higher utilizations. Out of order packets for ex, are one outcome that shows up when buffered-crossbar chipsets are stress-tested heavily. Cisco probably learned from this and so, when they designed their CRS platform, they ruled out bit-slicing/cell spraying, sacrificing throughput for in-order port-to-port delivery and better latency:
Re my statement about self-similar traffic and big buffer, basically self-similar/LRD traffic means traffic arrival is bursty on many time scales. And since it’s bursty on many time scales, having big buffer helps little with the situation (how much bigger can you make your buffer if the burstiness persists for long), not to mention the bigger the buffer and the more it gets occupied, the larger the latency becomes. And by necessity, by having big buffer, these memories will have to be off-chip, and most likely made of DRAM/RLDRAM, further increase RTT and add more to total device latency. Even the use of HMC here
won’t help because HMC trades latency for bandwith. So yes, big buffer has serious latency implications and is best avoided.
For more evidence on the existence of self-similarity in modern network, you can read this one here:
Toward the end, it said “Next, we studied the transmission properties of the applications in terms of the flow and packet arrival processes at the edge switches. We discovered that the arrival process at the edge switches is ON/OFF in nature where the ON/OFF durations can be characterized by heavy-tailed distributions.” That’s the sign of self-similarity, or bursty on different time scales. This paper also mentioned small packets as well, about 50% of the packets they sampled are of the small varieties.
And wrt the futility of using big buffer to address burstiness, esp. Microburst in the DC, you can take a look here, this one involved MS research so its findings are empirical as well:
@Minh Ha: Regarding the comparison of Cisco and Microsoft
Microsoft had the guts to effectively rewrite the core operating system when going from 95/98 to XP, and again to rewrite lots of insecure parts with Vista/7.
Cisco never found the willpower to do anything similar with Cisco IOS, so it's still running as a single-memory-image blob within a Linux process on IOS XE (I'd be super-glad to be corrected) with tons of lipstick applied to that aging piglet.
Cisco also got hooked on acquisitions because they were never able to put their house in order, and the internal overhead made R&D-through-acquisitions a much more palatable approach... and so we got IOS XR and Nexus OS (and a ton of other things).
@Ivan, Yup, 100% with you there :))!!!!! You remind me so much of the good old days, and that's definitely part of what makes your writing so invaluable!
I don't see eye to eye with Bill Gates & MS on quite a few things, but have it to give to them on their execution and the will to try and take great, calculated risk. That dynamic, vibrant spirit is exactly what's made them into what they are today. They gave up on their old OS code base and forged ahead with the new NT family code, starting with NT 3.5, 4.0, 2000, XP...They came up with Active Directory in Windows 2000 server which was a tsunami that'd taken the world by storm back in those days, totally departing from their flat, NetBIOS-based, centralized directory service in NT 4.0 into the DNS-based, distributed, eventually consistent directory service model. That was a huge move and great risk if it didn't pan out. But they knew what they did and got the right focus and determination, and managed to put out Novell and its E-directory out of business. And yes, I still remember the security part in Vista/7. And most obviously, MS knew when the software industry was saturated to transition their business quickly enough into a service provider model, like now, even giving away their iconic Windows OS in the process. That's big-league.
Cisco was much more of a mess :)) . They did tons of M&A in the late 90s with the grand vision of being the one-stop shop for the converged world of data and voice. Among their acquisitions were optical networking companies. AFAIK, this is 2021, and Cisco is still a nobody in optical networking, and in mobile networking as well. Huawei’s revenue was 180 billion in 2016, almost 4 times that of Cisco, and it was nowhere to be seen while Cisco reigned supreme back in the late 90s-early 2000s.
If Cisco was well-run and didn't get obsessed with stock price and "shareholder value", by now they could probably be too big to fail, to big to worry about the Clouds and all the rest.
And speaking of all the Cisco OSes, you surely would remember CatOS; for a time we're stuck with running IOS and CatOS because Cisco was having trouble integrating Crescendo IP after acquiring them ;).
There is another issue with bad small packet performance. It shows that the architecture is not deterministic. For Web browsing and YouTube it is not a problem, but for a lot of safety critical, industrial, closed-loop applications it is. Most engineers are crying back for TDM and ATM. But because of fashion, they would not get it for any reasonable cost. The typical solution is a next round on a spiral development, reinventing the wheel in a new form. DetNet is coming to the rescue... But the youngsters have to rediscover all the old problems. This will take same time. Instead of just producing more TDM/ATM. Maybe sometime they will also reinvent PNNI. DetNet would need it... :-) SDN has already reinvented TDM/ATM management plane. What comes next in reinventions?
BTW, the TDM architecture was created because the digital processing speed was not enough good for the higher physical link speeds. That is why you needed a very simple hardware implemented demultiplexing first and only then looking at the meaning of the bits. As we go for terabit speeds, it may happen again. They will hide it under some new names, but there is nothing new under the sun. Just new clothes... :-)
Hi Ivan i guess small packets linerate throughput relevancy can become more emergent by the nowadays trends with populating Internet by IoT. Those devices are able to generate small packet flows which has to be aggregated by the backbones. More IoTs to aggregate more need in small packets linerate throughput is obvious.
@ Minh Ha regarding self-similarity and big buffers. Thanks for the pointers to some very interesting material.
I just wanted to report a tremendous amount of recent top-quality information @ this 2019 Stanford University's "Workshop on Buffer Sizing" @ http://bufferworkshop.stanford.edu/.
This is really a workshop fuelled by the 2004 Nick McKeown's renowned paper "Sizing Router Buffers".... but 15 years later ... @ https://web.stanford.edu/class/cs244/papers/sizing-router-buffers-redux.pdf
Having said that, it's not black and white as with any complex thing; One of the concluding remarks of the workshop speaks for itself: "We still know very little about buffer size".
Enjoy the reading !
It'd be great if this could be a subject for a series of Ivan's blogs too. It could be a chance (call me naive.. ) for networkers, academics and vendors to exchange ideas/experiences. Major router vendors were in fact missing at such workshop...
Andrea Di Donato
Sorry - still me.
Just to say that the working link to the workshop is: