Repost: Drawbacks and Pitfalls of Cut-Through Switching
Minh Ha left a great comment describing additional pitfalls of cut-through switching on my Chasing CRC Errors blog post. Here it is (lightly edited).
Ivan, I don’t know about you, but I think cut-through and deep buffer are nothing but scams, and it’s subtle problems like this [fabric-wide crc errors] that open one’s eyes to the difference between reality and academy. Cut-through switching might improve nominal device latency a little bit compared to store-and-forward (SAF), but when one puts it into the bigger end-to-end context, it’s mostly useless.
Take the topology in the post, say if the spine switch operates in cut-thru mode, then it can forward the frames a bit faster, only to be blocked up at the slower device(s) down the chain, as they’re lower-end devices and so, can’t handle the higher rate – asynchronous speeds – and being at the lower end, they might be SAF devices themselves. The downstream devices then become the bottlenecks, so end-to-end latency is hardly any better. Also, as cut-thru switches can’t check CRC, they will just cause retransmission of bad packets, and along with it, an increase in end-to-end latency.
Even within the cut-thru switch itself, cut-thru mode is not always viable. Let’s use Nexus 5k as an example. It uses single-stage crossbar fabric with Virtual Output Queue (VOQ). If you have output contention, which you always do if your network is highly utilized, then the packets need to be buffered at the input waiting for the output to be available. In that case, cut-thru behavior is essentially as good as SAF; both have to buffer the packets and wait for their turns to transmit.
Also, Nexus 5k (and other crossbar switches) uses cell-based fabric + VOQ to deal with Head-of-Line (HOL) blocking. So basically the crossbar has to provide speed-up/overspeed to both compensate for cell tax and evade the HOL blocking problem. Since the crossbar is therefore faster than the input interfaces, the asynchronous-speeds situation once again surface, and cells will have to be buffered before being sent across the crossbar. Plus, in each cell time, there’re arbitration decisions made by the crossbar schedulers in regard to which cells get to enter the fabric, so buffering and waiting are inevitable.
All in all, the (dubious) benefit of cut-thru switching seems to be almost totally nullified. Not to mention cut-thru switches have more complicated ASICs and wiring than simple SAF switches, making it more expensive for no tangible gains.
I think cut-thru switching only makes sense if the whole network fabric runs the same model of switches, with simple protocol stack. So the place where it makes sense is niche markets like HPC cluster running low-overhead infiniband, or HFT trading, but the latter are mostly criminals trying to front-run each other anyway, so not sure if it’s ethical to provide a tools for them to do damage to society.
In day-to-day networks that deal with a mixture of traffic types and aggressive traffic patterns, cut-thru switching, like deep buffer, is just a diversion, and provides yet another opportunity for vendors to sell their overpriced boxes.
I must admit, I did find Cisco very admirable for having the guts to come out and say it like it is, that these days Cut-thru and SAF are very similar performance-wise.
I'm glad you find my comment useful, very much so :))! And as always, a big thank you to you for bringing up important topics so different people can chip in and share their often knowledgeable and very helpful perspectives on them.
Re cut-thru switching, I think Arista is one of the prominent vendors that provide several good products that claim big numbers on cut-through performance. No doubt they are great products, but still I'd stand by the bigger viewpoint in my comment that cut-thru switching is mainly useful in niche segments. Just the other day I luckily ran across Arista's published whitepaper on the validity of Cisco's benchmark test of the Nexus 5k vs Arista 7100S. It's a good read if you're interested:
There're some important conclusions from there which I agree with, given what's made available about the architecture of Arista cut-through line of products:
It seems Arista is a bit reserved when it comes to sharing the architecture of their switches with the world, compared to say, Cisco and Juniper. But thru the 2 links, I can work out that basically they use shared-memory architecture for their cut-thru portfolio, which is the best router/switch architecture when the bandwith scale permits its use. Looks like Arista knows this well and uses this to their great advantage.
Going back to the points I agree with in their rebuttal paper, the first point I agree with straightaway is the 4th paragraph: Nexus performs better under higher load, while Arista 7100 outperforms under light load. Obviously, shared memory outperforms VOQ crossbar architecture when both architectures are viable. So if 2 switches, both having the same fabric rate, the shared-memory one will outperform because it's the ideal architecture, independent of switching method, cut-through or SAF in lighter traffic condition.
But shared-memory has limitations that makes it unable to scale to very large size. The memory speed can't keep up with line rate, plus bank collision and timing issues will slow it down, and shared buffer will suffer more in case of high load and tree saturation, while VOQ will limit this congestion spread to certain ports and can still function under higher congestion.
So the conclusion in the paper basically sums it up: for lighter traffic patterns that demand very low latency, cut-through can help (with a shared memory architecture of course). For high-end switching in the Tbs and Multi-Tbs with aggressive traffic patterns that involve heavy hitters, high fan-out multicast, or both running through the same chassis, cut-through degrades to SAF due to output contention, and shared memory architecture dies. VOQ and crossbar will have to be used, despite their significant complexity and generally higher latency compared to its simple and ideal shared-memory counterpart. The higher latency (regardless of switchig methods) can be alleviated by higher speedup (very hard at 25 or 50Tbps) or better arbitration (equally hard at the same scale, much more so when multicast and unicast traffic are mixed together).
Also, the longer the paths between 2 endpoints, the less effective cut-through switching, due to other components of delay dominating total E-t-E delay. And at higher rate, say 100Gbe and more, pipeline processing delay dominates serialization delay, so the effect of cut-through is also diminished. One other thing is in case of jumbo-frame traffic, which gets a latency boost with cut-through, all else being equal, the reduced-latency of one flow may come at the expense of another. Say 2 packets destined for one egress port, one is a 9k jumbo packet, another 64 byte TCP ACK. If the 9k one gets scheduled ahead of the ACK, and assuming shared-memory architecture as it's often used for fast cut-thru, the ACK will have to wait in line until the whole 9K packet gets across, increasing its delay considerably, causing high jitter. So low latency for one flow comes at the cost of heavy toll for another. With VOQ crossbar, due to the way packets are cellified, and cells get interleaved between cell-times thanks to the crossbar scheduler, this jitter problem is mitigated, increasing fairness.
So these are some subtle issues one should factor in when considering whether cut-through is the right thing. It would really help if vendors are open about their switching architecture, their pipeline latency, like how much VXLAN increase the total processing delay, with recirculation or no recirculation. Generally they keep those details hush-hush but one can always ask if they intend to do a big purchase :)) .
Here's one of the primary reasons you'll see detailed packet walk and architecture presentations from Cisco but not from smaller vendors:
Cisco can afford to say **** B******* ;)