Response: Is Switching Latency Relevant?

Minh Ha left another extensive comment on my Is Switching Latency Relevant blog post. As is usual the case, it’s well worth reading, so I’m making sure it doesn’t stay in the small print (this time interspersed with a few comments of mine in gray boxes)


I found Cisco apparently manages to scale port-to-port latency down to 250ns for L3 switching, which is astonishing, and way less (sub 100ns) for L1 and L2.

I don’t know where FPGA fits into this ultra low-latency picture, because FPGA, compared to ASIC, is bigger, and a few times slower, due to the use of Lookup Table in place of gate arrays, and programmable interconnects.

No-one is talking about this in public (for obvious reasons), but what I’ve heard pointed in the direction of pre-processing data streams in FPGA before they even hit the switch-to-server links, probably to reduce the server CPU load/latency.

You could do that in a switch or in a server NIC, and you could get plenty of bandwidth into a server these days, so the solution you’re trumpeting depends on what you’re selling ;)

In any case, looking at their L2 and L1 numbers, it’s too obvious the measurement was taken in zero-buffer and non-contentious situations. In the real world, with realistic heavily bursty, correlated traffic, they all perform way less than their ideal case. But regardless, L3 switching at 250ns even under ideal condition is highly impressive, given Trident couldn’t achieve it in any of their testing scenarios.

To be fair, Trident chipset was not designed for ultra-low-latency environment. It’s like comparing Ford Fiesta with Porsche 911 – one of them is much faster, but also a bit more expensive.

Again, I’m not bashing Broadcom. It’s just I find it laughable reading their apologies in their report you linked to, wrt how they don’t “optimize” for 64-byte packets (love their wording), and how they manage to find a way to make their competitor finish far behind in the tests. Granted, Mellanox was trying the same thing in their test against Broadcom, so they’re all even, and we should only take these so-called vendor-released testing reports with a grain of salt.

Yet again, not many people care about 64-byte packet forwarding performance at 100 Gbps speeds. The only mainstream application of 64-byte packet forwarding I found was VoIP.

It’s a bit hard to generate large packets full of voice if you have to send them every 20 msec (the default interval specified in RFC3551), and it’s even harder to generate 3.2 Tbps of voice data going through a single top-of-rack switch. Unless my math is wrong, you’d need over 100 million concurrent G.729A-encoded VoIP calls to saturate the switch. Maybe we should focus on more realistic use cases.

The elephant in the room that you alluded to, is most likely endpoint latency. It’s pretty irrelevant to talk about ns middlebox-latency when the endpoints operate in the ms range :p . And endpoint latency gets even worse when features like interrupt coalescing and LSO/GRO are in place. Must be part of the reason why the Cloud’s performance for scientific apps sucks, and funny enough, they actually admit it as I found out recently.

But IMO, that only means the server operating system, the hypervisor, the software switch etc, are the ones that need innovation and up their game, instead of using their pathetic latency figures as an excuse not to keep bettering routers’ and switches’ performance. Overlay model is notoriously slow because it’s layer on top of layer (think BGP convergence vs IGP convergence), and as mentioned in your previous post, Fail fast is failing fast: “If you live in a small town with a walking commute home, you can predictably walk in the front door at 6:23 every evening. As you move to the big city with elevators, parking garages, and freeways to traverse, it’s harder to time your arrival precisely,” that kind of overburderned, complex architecture is not deterministic and no good for applications with real-time requirements. Infiniband shied away from TCP/IP for the same reason, and used a minimal-stack model to keep their latency low.

The Cloud and their overlay model is a definitely a step backward in terms of progress. By doing it cheap, they make it slow. Good for their greedy shareholders, sucks for consumers who truly need the numbers.

Don’t blame the shareholders for what the customers are not willing to pay for. Public cloud is like public transport – it will eventually get you reasonably close to where you want to go at a reasonable price.

If you need to get there faster, or if you want to get to a weird place far away from public transport, you take a taxi (bare-metal instance), and if you move around so much that taxies become too expensive, you buy a car. Using any utility in the way it was not designed to be used (because it’s cheaper that way) and complaining about the results you deserve is ridiculous.

Well, I guess I can stop complaining now that bare-metal instances are a thing. But yeah, taken as a whole, basically technology winter seems to continue. These days about the only kind of progress we have is corporate-PR progress.

Speaking of HFT, there seems to be a lot of fanfare going on there when it was big some 10 yrs ago. FPGA was often mentioned as the way they sped up their end-to-end latency. But I ran across comments of some of the guys who actually did HFT for a living sometime back, and they said it’s all hot-air, with most of the stuff they try to optimize being on the software level, such as doing away with message queuing (and so, safely getting rid of locks) to unburden their programs of concurrency synchronization, which is a big latency killers. Staying away from language that performs garbage collection is another thing, as there’s no one-size-fit-all garbage collection algorithm that’s optimized for all use case, and regardless, it’s an additional layer compared to explicit memory management, and more layer means slower.

I would suspect that not all HFT players are equal. Some of them sit at a stock exchange, and try to make money by being as fast as possible. For example, IEX claims it can execute trades in 320 microseconds and uses equal-length fiber cables to all participants to ensure fairness. Shaving off microseconds makes perfect sense in such an environment.

Others work globally, care about latency on trans-continental links, and optimize their software stack. Using FPGAs makes no sense if your baseline latency is 50 msec.

From what I know of RenTech, one of the biggest if not the biggest HFT (they also do other algorithmic trading besides HFT), they rely on software with big-data models, not fancy hardware.

3 comments:

  1. Hi Ivan, thx a lot for your feedback :)) ! Thanks to your reminder, I remember the FPGA wrt HFT now. Basically some 10 yrs ago, Xilinx was proposing to implement the whole trading program within FPGA itself, and cut down end to end latency to a few microseconds, thanks to FPGA massively-parallel and fast, deterministic performance. The reconfigurability of FPGA is utilized for updating the trading program when they change the code. ASIC won’t be able to do this obviously.

    IEX that you mentioned is an Exchange. Exchanges like IEX were created to promote fairness after there were so many claims/complains/court cases about HFT firms using speed advantage for predatory tactics like front-running, and profiting unfairly. So naturally, as you mentioned, they’d use stuff like equal-length fibre cables to implement speed-bumps, in order to level the playing field for all participants. But that equal-length speed-bump only applies equally to traders located within close-enough distance to the Exchange i.e. within the same country. It doesn’t make sense for someone too far away to attempt to do HFT against, say US firms, using US-based exchange, because their speed, in the ms range, would place them at great disadvantage against the US-based players, despite the speed-bump. They’ll get front-run and lose out. In any case, HFT is in decline these days – thank God -- after all the low-hanging fruits have been picked, so they’re a niche market.

    Re the 64-bye packet thing, I mentioned the reasons previously already, and again in the comment responding to Andre's, but even besides that, TCP ACK is min-size 64 bytes, and small packet 200-300 bytes or less are quite common in DC, including FB DCs. TCP ACKs take up a fair amount of traffic in the Internet and also inside a DC, so for this reason alone it's worth having a chipset capable of processing 64-byte packets efficiently. And since 100GE links are normally aggregate links, they accumulate even more of those small packets from different sources. That makes small-packet processing even more important.

    Also, by having a powerful ASICs being able to do massive rates of packet processing – which is what small-packet processing amounts to – there’ll be less need for on-chip buffer, and so, the chipset will be able to accommodate higher port count/density/radix, translating to better cost-efficiency and higher return on investment for both vendors and users. Having faster-processing engine alone also translates to higher density, because obviously there’s no point jamming more ports into a line-card unless its ASIC can handle the accumulated packet rate.

    I understand the laws of physics are absolute, and have no problem at all with vendor chipsets incapable of improving their horsepower, beyond a certain point. What I find laughable, is vendors sugarcoating the challenge they have. While I always bash Cisco C-level execs for their pathetic business practices, I keep finding myself in love with Cisco tech people's honesty. People like Lukas Krattiger and many more, are very down to earth, possess good old common sense, and say it like it is. For ex, one can look at Cisco's public doccos, and see that they acknowledge when they can't do small packet, and they don't try to dress it up with poor-taste jokes like "don't optimize for this and that". I'm happy to be corrected and will change my opinion of Cisco accordingly.

    And I still consider the Cloud a step backward in terms of progress, for the reason they commoditize technology and disincentivize new breakthroughs in both software and hardware. Pls don't take my comment personally. I say this to everyone I know, not just you Ivan :)). I don't even want to bring up other filthy aspects of the Cloud, as you already made a post about how much a scam AI is ;) . And yeah, their shareholders have a lot to answer for, technologically and socially, but it's a multi-disciplinary discussion, and will be far too long to discuss here.

  2. While on the subject of ASICs - Has there been any updates on Cisco's Unified Access Data Plane (UADP) ASIC that the community is aware about?

  3. Hi Jeff,

    The latest update about UADP ASIC that I can find is this blog from 2 weeks back:

    https://ciscolicense.com/blog/cisco-uadp-asic-chipset/

    So looks like there's not much change from last year, when they made a presentation about it here:

    https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2020/pdf/BRKCRS-2901.pdf

    On page 46, it was mentioned UADP 3.0 was supposed to use 14nm process and have 19.2b transistors. It seems a lot of that transistor budget goes into the on-chip buffer of 36MB.

Add comment
Sidebar