Rant: Broadcom and Network Operating System Vendors
Minh Ha left the following rant as a comment on my 5-year-old What Are The Problems with Broadcom Tomahawk? blog post. It’s too good to be left gathering dust there. Counterarguments and other perspectives are highly welcome.
So basically a lot of vendors these days are just glorified Broadcom resellers :p. It’s funny how some of them try to up themselves by saying they differentiate their offerings with their Network OS.
First off, since all the heavy-weight data forwarding is done in the LC and the fabric, their devices, regardless of brand name, are essentially Broadcom routers/switches first and last. No matter how their OS is written, they will get stuck with Broadcom-provided fabric architecture and forwarding pipeline, with all of its limitations, sometimes severe ones like the horrible fabric scheduler that get pointed out in the Mellanox test.
I won’t go into too much detail here as it will dilute the bigger point I’m trying to get at, but as we’ve discussed before, providing line-rate for 64-byte packet at 100Gbps is mostly mission impossible, considering that packets arrive every 6.7 ns and you have probably less than half that time to finish an IP lookup if you want to make it in time. Couple that with a port count of 32 or 64, and basically classic crossbar fabric arbitration using central scheduler doesn’t work anymore, as the scheduler will become both the bottleneck and power hot-spot, and one will have to use distributed schedulers to scale.
From what Enrique observed and the way it played out accordingly in the Mellanox test, Broadcom’s approach to scheduling – basically centralized and attempt to compensate for its inefficiency by using grouping with 2 schedulers – is nonsensical and pathetic. Why bother promoting 400Gbe when you’re clearly incapable of handling 100Gbe properly? And believe it or not, small-packet line-rate processing is very important; DDOS survival for ex, needs that, plus it means when you start providing value-added classification besides basic IP lookup, your router won’t fall over.
Second, when it comes to Network OS, Cisco and Juniper have been in business for 36 and 24 years respectively, and as they got all the highest-class protocol implementers like Dave Katz and Henk Smit working for them during their formative years, their core routing and switching codes had matured and stabilized before some of the new vendors even came to be, so for younger vendors trying to differentiate themselves using the OS is laughable. Who trusts their code to be bug-free, esp. when it comes to complex protocols like OSPF, BGP, and yes, xLFA :p.
All in all, these 2 points make it totally unjustifiable for merchant-silicon vendors these days to charge a premium for what amounts to essentially commoditized, off-the-shelf products.
Monopoly is never good for any industry, and networking is no exception. Among other things, it stiffens innovation. And the result is what we’ve been witnessing for a long time now, lazy networking and second-tiered products. Many vendors hardly invent anything these days anymore except crappy ‘new’ encapsulation formats. And Broadcom, as they don’t have any sizable competitor that can threaten their domination, they won’t feel the need to improve and fix all the nasty dogshit in their chipsets anymore.
And vendors still have to wonder why cloud companies are taking their business away :p. They only have themselves and their laziness to blame. The prize for not reinventing yourself is a one-way ticket to oblivion.
On that note, vendors like Cisco and Juniper, who’re trying hard the old-school way, to produce their own silicons, are worth admiring, no matter what the quality of their products. They deserve credit for at least trying.
<i>providing line-rate for 64-byte packet at 100Gbps is mostly mission impossible, considering that packets arrive every 6.7 ns and you have probably less than half that time to finish an IP lookup if you want to make it in time</i>
I did not clearly understand this. It seems to imply that switch forwarding delay (as defined by RFC 4689) needs to be lower than packet transmission time in order to get line-rate processing. As far as I know, there are no commercial switches with sub-10 ns forwarding latency; Mellanox announces their 300 ns cut-through latency @ 100Gbps as the best latency.
If the comment refers to the TCAM access becoming a bottleneck, I believe there are mechanisms to pipeline TCAM accesses and avoid the bottleneck (or other tricks such as banked implementations with parallel bank accesses, reuse previous cached results, etc.), or simply replication mechanisms to increase the throughput of this element.
Could someone clarify this? Thanks!
I don't know what I did to be mentioned in the same sentence as Dave Katz. I apologize. I didn't do it on purpose, I swear.
When I said 6.7ns, I was referring to the inter-packet arrival time at 100Gbps for 64 byte packet. It was 6.7ns because the total frame size is in fact 84 bytes after adding inter-frame padding. To get line rate processing at 100Gbps for 64 byte packet, therefore, the line-card's packet processing pipeline stage delay will have to be less than or equal to this, otherwise packet will arrive faster than it can process. It basically translate to about 149 million packets per seconds. This is, of course, assuming that the entire LC has only 1 port. If a LC has, say 8x100Gbps and only 1 packet processing pipeline inside the ASIC, then that pipeline stage delay will have to be 8 times faster. Since the lookup is one stage within the pipeline, it will have to complete a lookup and return in 6.7ns or less, and many times faster if the LC has more ports. So the TCAM's clock rate will have to be fast enough to support this kind of latency. In fact, even when the LC houses only 1x100G port, the TCAM has to finish way faster than 6.7ns, because for many years now vendors have made use of hierarchical FIB instead of flat FIB, to avoid the slow update issue inherent in TCAM, something that kills Openflow fine-grained flow-based switching. BGP PIC is one of those attempts to avoid unecessary FIB update on re-convergence. With hierarchical FIB, there're multiple lookups to get the destination interface, so TCAM will have to finish much quicker than 6.7ns if the whole lookup thing is to complete within the time budget.
That's why for very high speed switching, MPLS is preferred over IP because it's more high-performance. And MPLS uses CAM, not TCAM, so it's more power-efficient as well. There's no substitute for label switching at the highest speed.
Mellanox's 300ns cut-thru latency is the total device latency. To be honest, I read that for cut-thru switching, the method of measurement is the time gap between when the first bit enters the ingress until that same bit reaches the egress, and not counting the time between reaching the egress and going out of the egress aka time sitting in reordering buffer and output queue. That's likely why they can claim this kind of latency. And even then, I think this kind of total latency only applies when they do L2 cut-thru, the simplest, most lightweight form of packet processing, and with uniform synthetic traffic anyway.
The total/nominal device latency = ingress pipeline processing time + cell segmentation delay + input buffering delay + fabric arbitration time + transmission thru the fabric + cell reassembly in case of cell-based fabric + output queueing delay. That's why there's a big gap between inter-packet arrival time and total device delay. Devices with high-performance architecture and superior fabric scheduler generally have lower total delay. Packet buffering is the most dominating factor as they're generally the slowest part of the processing due to high memory RTT.
As for TCAM pipelining, I'm aware of SRAM pipelining as people seek an alternative higher-performance lookup memory than TCAM since TCAM clock rate is generally slow, but SRAM pipelining suffers from complex wiring and wastage of memory, and so, not yet found in mainstream routers. But I'm not aware of TCAM pipelining. I know vendors have been using chip level parallelism with TCAM for years to overcome the low clock rate and high power usage issue as they need more Mbs to store bigger and bigger routing tables, but it's parallelism, not pipeline. The nature of TCAM doesn't lend itself to pipelining AFAIK. Chip parallelism is also the reason why vendors can afford tricks like TCAM carving/resizing and UFT.