What Is Ultra Ethernet All About?
If you’re monitoring the industry press (or other usual hype factories), you might have heard about Ultra Ethernet, a dazzling new technology that will be developed by the Ultra Ethernet Consortium1. What is it, and does it matter to you (TL&DR: probably not2)?
As always, let’s start with What Problem Are We Solving?
For decades, we’ve known how to divide numerous computational problems into smaller parallel computations. Weather forecast is among those, as are multiple physics, chemistry, or astronomy simulations. The whole field of High-Performance Computing is built around the ability to split a large problem into smaller parts and combine their results. Big data is no different (remember map-reduce algorithms?), and neither is machine learning, whether training an ML model or trying to get usable answers from the trained model.
However, AI/ML is hot, and all the other stuff is boring3, so let’s claim we need Ultra Ethernet for AI/ML workloads to generate attention, sell more stuff, and attract VC funding.
OK, it looks like the problem we’re trying to solve is building a fabric to support distributed computations running in large clusters. Why do we need a new Ethernet technology to do that? What’s wrong with the existing Ethernet?
As it turns out, for whatever weird reason, the programmers writing the software solving real-life problems want to focus on those problems, not on exchanging data between parts of the distributed computation. It’s easiest for them to assume their algorithms run in multiple threads that can share memory. That approach works beautifully as long as the nodes running the computation share actual memory, be it many CPU cores in a server or multiple GPUs in a single chassis, but what happens when you have to grow beyond that? We knew the answer decades ago: Remote Direct Memory Access (RDMA). This solution allows a program to write directly into another program’s memory even when the other program runs on a different host.
RDMA was designed to run over Infiniband. As expected, someone inevitably said4, “Hold my beer, I can make this work over Ethernet,” so RoCE (RDMA over Converged Ethernet) was born. There are just a few tiny little problems with RoCE:
- Like FCoE, RoCE works best over lossless transport – that’s why it’s running over Converged Ethernet that supports Priority Flow Control5.
- RoCE cannot deal gracefully with packet reordering.
- RoCE’s performance catastrophically degrades when faced with packet drops or out-of-order packets.
Running RoCE across a single switch is a piece of cake, and if you buy one of the newest data center switches6, you’ll get over 50 Tbps of bandwidth in a single box – enough to build a cluster of 128 high-end GPUs with 400 GbE uplinks. You won’t need more than that in most environments, so you probably don’t have to care about Ultra Ethernet. Congratulations, you can stop reading.
Still here? I have a fun fact for you: some people need GPU clusters with tens of thousands of nodes, and the only (sane) way to connect them is to build a vast leaf-and-spine fabric, and that’s where RoCE rears its ugly head:
- Data exchange between a pair of nodes is a single UDP flow. Goodbye, efficient per-flow load balancing.
- Lossless transport across a fabric inevitably has to deal with congestion avoidance, and trying to get fabric-wide congestion avoidance to work is usually an exciting resume-enhancing challenge7.
Want to know more? Read Datacenter Ethernet And RDMA: Issues At Hyperscale8.
As always, you can fix the problem in three ways:
- Fix the application software or middleware (fat chance)
- Develop a new networking technology and sell tons of new ASICs (the Ultra Ethernet way)
- Insert a smart NIC between the application and the network, and let it handle the hard stuff9, like pretending the out-of-order packet delivery didn’t happen.
Broadcom figured out they already have the new technology10. Many chassis switches use internal cell-based transport and cell spraying across all (intra-switch) fabric links with the egress linecard handling packet reordering before forwarding the reassembled packet to the egress Ethernet port. That solves the optimal load balancing part of the challenge.
These switches also use Virtual Output Queues – the output port queues are on the ingress linecards, and the ingress linecard transmits data over the fabric only when the egress port can accept it. Virtual output queues move the congestion closer to the ingress port, preventing fabric-wide congestion from spreading when an output port becomes congested. That solves the “congestion avoidance in large fabrics is a hard problem” part of the challenge.
Now imagine you disaggregate11 a chassis switch, repackage linecards and fabric modules into standalone boxes, and replace internal copper12 links with transceivers and fibers. Congratulations, you built a virtual switch that can supposedly have up to 32.000 ports. BTW, do you remember those virtual output queues from the previous paragraph? Each ingress switch should have an output queue for each output port13, so you need at least 32.000 fabric-facing queues per ASIC. All that complexity is bound to result in an expensive ASIC. Lovely, but what else did you expect?
There’s just one tiny problem with Broadcom’s approach: while it (probably) works, it’s proprietary and will likely stay that way forever14. That might upset other vendors (more so if they don’t have a comparable ASIC), so they’re trying hard to hammer the square peg (RoCE) into the round hole (Ethernet with minimal modifications). Ultra Ethernet position paper claims they plan to:
- Add multipathing and packet spraying to Ethernet.
- Get rid of the “Ethernet is not IP and does not reorder packets” constraint.
- Invent new magic15 that will finally solve fabric-wide congestion challenges because “None of the current algorithms meet all the needs of a transport protocol optimized for AI.16”
Now for the final bit of the puzzle: how can you run a protocol that does not tolerate packet reordering across a network that reorders packets due to packet spraying? It turns out you could do some heavy handwaving in the existing RoCE NICs; for more details see the To spray or not to spray: Solving the low entropy problem of the AI/ML training workloads in the Ethernet Fabrics article by Dmitry Shokarev.
Behind the Scenes
J Metz described some of the reasons the Ultra Ethernet Consortium was created in a LinkedIn post:
We didn’t create UEC because “IEEE was too slow.” We created UEC because in order to ensure that the network fabric was tuned for the workloads correctly, there were too many touchpoints for IEEE (or IETF or NVMe or SNIA or DMTF or OCP or… or… or..) to cope with.
We entertained the notion that we could attempt to insert our ideas into each and every standards or specification body as needed, but that was a fool’s errand. Instead, we have opted to create our own SDO (Standards Development Organization) and, instead, partner or align with these other organizations to ensure industry-wide compatibility.
So far, with working relationships with IEEE 802.3, OCP, and many more in the works, this strategy appears to be succeeding. UEC has tapped into a much bigger need than even we anticipated at the onset.
Revision History
- 2023-12-13
- Added the Behind the Scenes section
-
Because they’re sick and tired of the glacial speed at which IEEE moves ↩︎
-
Unless you work for a hyperscaler or train ML models on GPU clusters with tens of thousands of nodes, in which case I hope you’re not reading this blog for anything else but its entertainment value. ↩︎
-
After the big data hype dispersed, we realized we were sugarcoating applied statistics. ↩︎
-
Around 2010, almost a decade and a half ago ↩︎
-
Did you notice I wrote the “What is PFC” blog post around the same time RoCE was launched? I promise that’s pure coincidence. ↩︎
-
Assuming you can get them anytime soon ↩︎
-
It’s not that we wouldn’t have been aware of these challenges in the past. We encountered them with FCoE and iSCSI, but we never tried to run those protocols at such a scale. ↩︎
-
Another article that seems to hint it’s time to replace TCP in the data center ;) ↩︎
-
Proving yet again RFC 1925 rule 6a ↩︎
-
For more details, watch the Networking Field Day 32 videos. ↩︎
-
Isn’t that a great word when you want to sound smart? ↩︎
-
Or whatever it is these days ↩︎
-
Or bad things start to happen, like a congested port slowing down adjacent ports. How do you think we know that? ;) ↩︎
-
At least considering the history of Broadcom’s openness. ↩︎
-
For ages, people claimed they solved Internet-wide QoS or implemented a centralized control plane for WAN networks. Unfortunately, most of those solutions had an unexpected glitch or two. I don’t expect this one to be any different. ↩︎
-
Not “optimized for long-lived high-volume flows” or “low number of high-volume flows,” it has to be “optimized for AI” because that hype sells unicorn farts. ↩︎
i think that the amount of reordering caused by packet spraying in data centers is widely overestimated. see also: https://engineering.purdue.edu/~ychu/publications/infocom13_pktspray.pdf if one finds that the paths are too asymmetrical in delay, flowlets or something more advanced such as hula might be thinkable. at that scale though, it might be worth looking into optical circuit switching.
From what I see, DC is a blanket term that has lost meaning, just like 'Cloud'. Different kinds of DC have different kinds of workload and traffic pattern. Flowlets switching, including Cisco's own solution (forgot its name) are not designed for the scale of HPC, so they're untested solutions for that kind of environment and likely won't scale.
Turbulence modeling in particular, is extremely data and network intensive due to the nature of Turbulence -- the number of degrees of freedom scale exponentially with model size, which is why weather forecasting cannot forecast very far, while climate modeling is totally hopeless. These are the ultimate Big Data scenarios, and as usual, almost never get mentioned in mainstream press because they're super hard and not trendy. But they require as much horsepower as the biggest supercomputer fabrics can muster just to solve a tiny fraction of their problems. So claims that AI/ML require more processing power than what comes before them -- which is the impression I'm getting these days -- don't stand up to scrutiny.
HPC DCs that work on things like Turbulence often use Dragonfly Fabric Topology -- to minimize network diameter -- with per-packet Adaptive Routing, so asymmetric routing and out-of-order delivery become a big problem the higher the fabric gets utilized.
Circuit switching using optical paths does away with reordering, but it's physically expensive and therefore doesn't not get implemented in the big HPC fabrics, and is not mentioned in the Broadcom solution either.
> Flowlets switching, including Cisco's own solution (forgot its name) are not designed for the scale of HPC, so they're untested solutions for that kind of environment and likely won't scale.
Whether they scale depends entirely on the implementation. If you work around having to hold states for all available paths, it scales. Untested for that kind of environment, yes, but only because most people don't have access to one.
> Circuit switching using optical paths does away with reordering, but it's physically expensive and therefore doesn't not get implemented in the big HPC fabrics
It also requires reasonably-stable traffic distribution to work well. That's why it works for Google -- if you have enough (somewhat predictable) traffic, the patterns are stable enough at the network core for optical switching to work well.
> and is not mentioned in the Broadcom solution either.
Because Broadcom does not sell anything along those lines?
> Circuit switching using optical paths does away with reordering, but it's >physically expensive and therefore doesn't not get implemented in the big > HPC fabrics
>It also requires reasonably-stable traffic distribution to work well. That's >why it works for Google -- if you have enough (somewhat predictable) >traffic, the patterns are stable enough at the network core for optical >switching to work well.
Google has tooling around Topology engineering (Ref: Section 4 https://dl.acm.org/doi/pdf/10.1145/3544216.3544265 and https://dl.acm.org/doi/pdf/10.1145/3603269.3604836), which re-configures logical connectivity as the Traffic/Topology evolves. Having systems that can operate this kind of network is a lot more crucial and generally out of reach from most folks if one digs into what it takes to build these kinds of systems.
Looks like even for ML workload, network bottleneck can become severe:
https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/
"But, in many training workloads, GPU utilization hovers around 50% or less due to memory and network bottlenecks."
"Google trained all its LLM models, LaMDA, MUM, and PaLM, in a TPU v4 pod and they claim very high TPU utilizations (over 55%) for training."
55% TPU cluster utilization is considered good. So obviously they want to improve. But will UltraEthernet provide the answer?
So I watched "Broadcom Ethernet Fabric for AI and ML at Scale". They already introduced this concept of cell-spraying fabric a few years back in a paper. Looks like they're now putting that concept in action. The problem is: scale matters. Inside a xbar switch, high throughput and ECMP are achieved through cell-switching -- the power of a central fabric scheduler like iSlip therefore is vital to xbar performance -- while latency and HOL blocking are achieved with VOQ -- without VOQ, input-buffered switches' performance tops out 58% for the most simple type of traffic, due to HOL blocking.
But when they try to turn it into a network-wide fabric, they don't have a central scheduler to arbitrate cells efficiently, so what do they rely on? Distributed scheduling at each node from what I understand. But that solution is no longer the simple VOQ xbar like those found in the single-stage Cisco 12000 series (designed by Nick Mckeown) or the multi-stage CRS router.
Memory-less xbar (the xbar itself doesn't have any buffer) routers don't have an out of order delivery problem, so output interfaces don't worry about cell reordering; they only do SAR (segmentation and reassembly) function. Virtual switch built using an aggregation of smaller switches, like the solution Broadcom proposes in their vids, DO. This is the same out-of-order delivery problem that affects buffered-crossbar routers.
How do Broadcom deal with reordering? I don't see that mentioned -- if I did miss something in their presentation(s), pls point out. Reordering at the destination switch? Damn. That's a lot of work. They did mention credit-based flow control, which is nonsense. Priority flow control creates HOL blocking, the same problem that stops input-buffered switches' performance at 58% for the most simple type of traffic, for which VOQ was designed to address. Plus, PFC (credit-based FC a variant) makes the circuitry more complex, and therefore harder to produce and test for; all this means higher price for the network devices. Which leads us right to the next problem: "virtual switch that can supposedly have up to 32.000 ports".
If this kind of chassis is built using VOQ doctrine, I don't think it's physically possible. Because in VOQ system, each port, and in this case, it means each physical port on a switch -- any switch in the fabric -- has to have a VOQ for the other 32000 ports in the chassis. Let's say that leaf switch has 64 ports. That means 64x32000 VOQs. This is just too much. How are you gonna design the physical circuitry for this? How big will the PCB become, and how many layers will it have??? What about the heat this monster will dissipate, will it melt up or otherwise damage the material? And who's paying for the power bill? Color me skeptical, but I'll believe it when I see it.
For these reasons, I think their solution will be subpar, just like the others mentioned in the APNIC article.
Also, the solution for this particular kind of problem already exists, in the form of Dragonfly HPC fabrics with Adaptive routing. No need for crappy PFC. Deadlock is avoided with the use of Virtual Channel and some form of valley-free routing.
> How do Broadcom deal with reordering? I don't see that mentioned
Of course not, that's their secret sauce.
> Reordering at the destination switch? Damn. That's a lot of work.
Not if you use a trick similar to RDMA -- use fixed-length cells, and copy incoming cells into the right place in the buffer memory (in reality: probably use some scatter/gather trick to avoid copying overhead). The only thing left is to deal with packet losses and retransmissions (or you ignore them like ATM did).
I can answer some of this since I worked on a Broadcom Jericho DDC in 2015-2017. Cells are indeed put back into order at the egress ASIC (otherwise packets would be corrupted). I think packets are also kept in order although I'm not sure how. Forwarding logic isn't duplicated for each port so you don't need 64x32,000 VOQs, just 32,000 in each ASIC which can be stored in on-chip SRAM. All the ASICs are now 500W monsters and all the PCBs are thicc af (as the kids say) so IMO you might as well use the best ASIC. At the time we found that Jericho+Ramon was actually slightly cheaper and noticeably faster than Tomahawk.
Note that Ultra Ethernet works differently from DNX/Jericho so today's DNX networks aren't a preview of how Ultra Ethernet will look (although one's available today and the other is not). It sounds like Ultra Ethernet will have NIC-to-NIC congestion management and reordering while DNX does everything inside the switch. Mellanox is already selling some flavor of NIC/DPU-based congestion management.
It's easy to say something like adaptive routing with virtual channels and valley-free routing like all the academic papers but Ethernet ASICs either don't support those features or they've never been turned on (maybe they don't work). It's possible that Ultra Ethernet will get those features in a more reliable and standardized way.
@Wes, thx for the info!! Re the VOQ being stored in on-chip SRAM, basically Broadcom has traded one problem for another: space for time. If they create separate 32k VOQs per port, it'll blow up the size of their ASIC, increasing power draw considerably. So they opted to have the VOQ share the on-chip SRAM. Now speed becomes the problem, considering situations when thousands or tens of thousands (or way more) of cells are arriving simultaneously, which happen because of the sheer size of their proposed fabric. SRAM speed (even the multiport variety which enables several accesses at once) is always the weak link in the chain when it comes to ASIC processing power. Now it's made way worse with the sheer number of VOQs. SRAM is heat-prone too, so the bigger amount, the worse.
OTOH, If they split the memory into 32k VOQ, that will help the speed at the expense of space and power. So my guess is they opt for midway approach between the 2 extremes. Still, due to the sheer number of VOQs, there're situations like that I mention, that will create big bottleneck and congestion spreading.
"I think packets are also kept in order although I'm not sure how."
This is the big problem. Like I said, memoryless-xbar has no OOO (out of order) delivery problem. But buffered xbar routers do experience OOO cell arrival. You'd expect more of this OOO problem under high loads. Indeed, our friend Andrei has brought up one case from his own experience stress-testing Juniper routers a few decades back:
https://blog.ipspace.net/2020/05/ip-packet-reordering.html#64
This is not a bug, but functioned as design, and it applies equally to virtual xbar fabric made out of member switches, as we've been discussing here. The OOO problem gets worse with the buffer depth of the transit switches, or the offered-load level, or both. And don't forget, xbar devices achieve their performance not just with VOQ, but as importantly, with fabric speedup, generally x2 (xbar speed twice the collective speed of the total number of ports in the fabric), though the higher the speedup, the better the performance/work-conservation. Speed-up is very difficult the bigger the virtual fabric gets for obvious reason. It'll require either much more powerful switches in the middle that act as the xbar, or many more switches there than at the edge. That adds up both CAPEX and OPEX, and increases MTBF. The fabric can be built without the speedup, but its performance will be underwhelming, with OOO getting worse.
HPC people realize this, the absurdity of building a monstrous fabric. They opt for making smaller fabrics and interconnecting them instead.
"It's easy to say something like adaptive routing with virtual channels and valley-free routing like all the academic papers but Ethernet ASICs either don't support those features or they've never been turned on"
As I mentioned in my previous comment to Dip, adaptive routing in hardware with VC and Valiant (a form valley-free routing) has been in place for many years in HPC clusters. They work fine. So if Ethernet can't provide similar functionalities, it can keep underperforming.
Overall, while I think it's good for Broadcom to go ahead and implement the cell-spraying concept they put in their paper a few years back, given what we've discussed so far, this fabric lags behind the existing HPC architecture, which has been designed to deal with more bursty traffic pattern, the hardest type of traffic, and therefore, is well-placed to handle more stable AI/ML workload. But again, scale matters, so certain customers with less demanding workloads might find them worth the money for their needs. This discussion is to to belittle Broadcom, only to discuss the pros and cons as we find them.
Also, as you said, UltraEthernet might be different. So let's wait and see what it'll look like when it finally comes out.
@Dip, thx for links!! Going through the first one -- the 2nd paper is basically a more particular application of the first one -- gives me the impression that Google got a fair bit of inspiration from Cray Cascade's Dragonfly HPC fabric design. Cray Cascade is the predecessor of Clay Slingshot, the topology/fabric used by some of today's biggest supercomputers, incl. Aurora. Cray architecture, in a nutshell, is Dragonfly topology using adaptive routing in hardware, with Virtual Channel and Valiant routing (a form of valley-free routing) employed to avoid deadlock & increase throughput and latency. Google's direct-connect and non-shortest path routing are also part of the philosophy underpinning Cray's design.
Indeed in section 7, they mentioned it: "The direct-connect Jupiter topology adopts a similar hybrid approach as in Dragonfly+ [32] where aggregation blocks with Clos-based topology are directly connected. Compared to hardware-based adaptive routing in Dragonfly+, Jupiter provides software-based traffic engineering that can work on legacy commodity silicon switches."
There're some minor differences, including Google using SDN to compute path for topology engineering (fancy term for traffic engineering), while Cray uses adaptive routing in hardware, but the overarching concepts are the same. For me, this is a testimony of the power of the Dragonfly design.
Ivan raised a valid point as well, relatively stable traffic matrix is needed for optical switching to work, because as we know, MEMS optical switching is much slower than electrical switching, and so opting for a circuit design to remove this problem, is a much better choice for optical switches. For long-lived elephant flows (stable traffic) which characterize AI/ML traffic, this paradigm fits well, that's why Google pursued it. They said as much in the 2 papers.
Section 4.6 of the first one also admits topology engineering is also an infrequent exercise, and the bulk of traffic engineering is still done by good old-fashion routing:
"Thanks to OCS, topology engineering can be optimized to be on-par with (or faster than) routing changes. However, based on simulation results, we find that block-level topology reconfiguration more frequent than every few weeks yields limited benefits [46]. This is because of two main reasons: 1) concerning throughput, most traffic changes can be adequately handled by routing. The kind of traffic change that requires assistance from topology tends to happen rather slowly. 2"
Goes to show timeless fundamentals never go out of styles, and simplicity trumps complexity most of the time. ANd for this particular reason, I disagreed with Ivan's statement: "Unless you work for a hyperscaler or train ML models on GPU clusters with tens of thousands of nodes, in which case I hope you’re not reading this blog for anything else but its entertainment value." As with all fields, no one knows it all and everyone makes mistakes here and there. So for people like Ivan who understand the history of networking and its the timeless principles, his blog entries, esp. in-depth articles like this one here, would always offer food for thought for networkers of all levels. Working for the biggest providers in the world is no substitute for understanding history and its universal patterns, so it's not a condition sufficient to be a great networker.
Also, Google's 1st paper mentions a great bit of technology: the optical circulator. It enables bidirectional operation of DCNI links to halve the number of required OCS ports and fibers utilizing birefringence (double refractive index material) crystals and taking advantage of bosonic (not subject to Pauli's exclusion principle that applies to Fermions like electrons) and chargeless natures of photons so that superimposed lights don't destroy each other (applicable for linear optics which is what's going on inside the fibers). I consider this kind of thing real progress (not done by Google btw), because hardware is a lot more always difficult than software.
Cheers!
Reading this article and these comments has been a true pleasure. I'm going out on a limb here a bit, but what about using mesh PCIe (or later bus/protocols) to provide the ability to virtualize the resources of all interconnected machines and similarly provide data transfer using whatever protocols are specified by the application+ layers? This has the benefit of expanding the available memory (or compute) to the sum of all chipset limits while doing nifty networking fun in the process? The bus architecture of PCIe isn't ideal, so hopefully they flatten that architecture (and maybe even returning to parallel with a future proofing towards a return to analog).
I’ve recently read your insightful blog on Ultra Ethernet and its predecessor. I’ve also gone through the Ultra Ethernet Consortium’s progress towards the v1.0 set of specifications and the position paper on Ultra Ethernet. Both resources mention the Ultra Ethernet Transport protocol, described as an open protocol specification designed to operate over IP and Ethernet.
In the specification article, the UEC stack includes a Transport layer, and a dedicated Transport working group is developing specifications for an AI/HPC transport. This leads me to question whether TCP/UDP will be superseded by a new transport specifically designed for AI/HPC applications. If this is the case, will there be any adapter or converter libraries available to facilitate this transition? Could you provide further details on this matter?