What Are The Problems with Broadcom Tomahawk? We Don’t Know

One of my readers has customers that already experienced performance challenges with Tomahawk-based data center switches. He sent me an email along these lines:

My customers are concerned about buffer performance for packets that are 200 bytes and under. MORE IMPORTANTLY, a customer informed me that there were performance issues when running 4x25GE connections when one group of ports speaks to another group.

Reading the report Mellanox published not so long ago it seems there really is something fishy going on with Tomahawk.

Let’s be realistic: every chipset has limitations. Every silicon vendor has to make tradeoffs. Every vendor loves to find a chink in another vendor’s product and tries to present it as a glaring abyss.

However, there are vendors that are pretty open about their chipset architectures. Well, make it A Vendor – Cisco has some amazing “a day in the life of a packet” Cisco Live presentations that anyone with an email address can watch (if there’s someone else as open as Cisco is about their internal architectures, please post a comment).

Then there are vendors that document the realistic boundaries of their architectures. For example, when Cisco was selling oversubscribed 10GE linecards for Nexus 7000 they documented which ports share the forwarding bandwidth. Arista was pretty open about the forwarding performance of their 7300 switches (so it was easy to deduce they are not linerate at small packet sizes). Brocade documented their ASIC limitations in regards to link aggregation.

So what’s the problem with Broadcom chipsets? We don’t know what the limitations are because they’re hiding the information, and everyone who does know what’s really going on can’t talk about it.

Well, the real problem is that Broadcom doesn’t have to care – every major data center switching vendor (apart from Brocade, a few Arista and Cisco switches, and Juniper QFX10K) is using some Broadcom chipset.

Before someone starts listing a dozen reasons why hiding that information makes sense, let me point out that Intel publishes all of their documentation.

The only way to get any information on how the stuff that’s sitting inside most of data center switches these days really works is to try to reverse-engineer tidbits from various vendor presentations or their configuration guidelines, or rely on Broadcom competitors spilling the beans disguised as a test report. And yet some people call this state of affairs progress.

For a more articulate take on the same topic, read the excellent blog post by Carlos Cardenas, and if you want to know more about implementing a networking operating system, register for my online course with Russ White as the guest speaker.

8 comments:

  1. Hi Ivan,

    You are completely right, while the internal design could be secret, ASIC designers and switch vendors should clearly define the capabilities and limitations of their products.

    The Mellanox report that you link is really interesting and insightful. While I don't know the internal architecture of the ASIC, the unfairness problems in Tomahawk seem to come from a hierarchical organization of the switch. One such organization was presented by Kim et al in "Microarchitecture of a High-Radix Router", ISCA'05 (link). The design proposed in the paper organizes the ports in groups of 8 ports in the same "row", with different groups connected by "columns".

    The problem of hierarchical organizations is global fairness. While a simple round-robin (RR) arbitration mechanism is locally fair, a multi-level RR mechanism is not globally fair. This problem is clearly presented by Abts et al in "Age-based packet arbitration in large-radix k-ary n-cubes", SC'07 (see Figure 4 in this link), and seems to be related to the fairness issues of the Tomahawk.

    The results in Figures 2 and 3 of the report are consistent with a hierarchical switch organization, in which ports are divided into two groups of 16 ports. In particular, with the following division: group 1 containing ports 1-8 and 25-32, and group 2 with ports 9-24. If you observe carefully the results, you can notice that each of the two groups always get 50% of the bandwidth, which is later divided evenly between the active ports of each group (this may not be obvious at first sight since the number of ports in group 1 is not consecutive, particularly in test 2 in Figure 3, but it holds in all cases). The other latency and packet loss results highlight that Tomahawk employs an internal architecture which is lossy, while Mellanox probably inherits the design from its lossless Infiniband switches.

    It is very interesting to consider that solutions designed to provide fairness at L2 (such as DCB QCN, which converges to fair shares between flows) will fail if the individual switches are not fair. Related to this, there was a European company called Gnodal (from Bristol, UK) which designed an Ethernet solution with global fairness. It employed a proprietary overlay encapsulation mechanism within its fabric, which included an "Age" field in the frame header used to provide per-port global fairness. Too bad they went bankrupt and their IP assets were bought by the American HPC company Cray (which had already used age-arbitration in its XC supercomputer series), maybe in a near future we might find some HPC solution from them using Ethernet with global fairness.

    If Broadcom didn't have such a dominant market share, these types of issues could cause a significant loss. Let's see if some more companies enter the 100G ASIC market.
  2. Juniper has an amazing "An Expert Packet Walkthrough on the MX Series 3D ..." publication by David Roy. Comparable to or even better that ones from Cisco.
    Replies
    1. David Roy is not a Juniper employee though, he basically debugged MX internal structure by himself as a customer (and continues to do so). But yes, it's an incredible booklet, really helpful, and I wish every vendor would write something like that from time to time.

      PS: David Roy's blog is just as incredible, I love his posts. Periodic packet management details, bgp flowspec, event scripts, helpful hints for traffic capture, well everything.
  3. All of the Juniper books (MX, QFX) always have an amazingly detailed packet walk through. I believe the Juniper QFX5100 book was the first to have a very detailed Broadcom walk-through.
  4. Cisco has also found that Tomahawk is unable to NOT drop pause frames. So much for PFC, FCoE and RoCE!
  5. At Broadcom there seem to be different ASIC lines:
    - Trident+ => Trident II => Tomahawk
    - Broadcom DNX

    So, what about that DNX Qumran MX+? Does this ASIC have the same Problems?

    => http://community.hpe.com/hpeb/attachments/hpeb/switching-a-series-forum/7787/2/H3C-2016.pdf
  6. "Intel publishes all of their documentation."
    Is not a true statement. They also require a NDA/SLA for their newest silicon.
  7. Thx for pointing to this blog post Ivan! I've read it along with the comments, the test and the papers linked.

    So basically a lot of vendors these days are just glorified Broadcom resellers :p. It's funny how some of them try to up themselves by saying they differentiate their offerings with their Network OS. First off, since all the heavy-weight data forwarding is done in the LC and the fabric, their devices, regardless of brand name, are essentially Broadcom routers/switches first and last. No matter how their OS is written, they will get stuck with Broadcom-provided fabric architecture and forwarding pipeline, with all of its limitations, sometimes severe ones like the horrible fabric scheduler that get pointed out in the Mellanox test.

    I won't go into too much detail here as it will dilute the bigger point I'm trying to get at, but as we've discussed before, providing line-rate for 64-byte packet at 100Gbps is mostly mission impossible, considering that packets arrive every 6.7 ns and you have probably less than half that time to finish an IP lookup if you want to make it in time. Couple that with a port count of 32 or 64, and basically classic crossbar fabric arbitration using central scheduler doesn’t work anymore, as the scheduler will become both the bottleneck and power hot-spot, and one will have to use distributed schedulers to scale. From what Enrique observed and the way it played out accordingly in the Mellanox test, Broadcom's approach to scheduling -- basically centralized and attempt to compensate for its inefficiency by using grouping with 2 schedulers -- is nonsensical and pathetic. Why bother promoting 400Gbe when you’re clearly incapable of handling 100Gbe properly? And believe it or not, small-packet line-rate processing is very important; DDOS survival for ex, needs that, plus it means when you start providing value-added classification besides basic IP lookup, your router won’t fall over.

    Second, when it comes to Network OS, Cisco and Juniper have been in business for 36 and 24 years respectively, and as they got all the highest-class protocol implementers like Dave Katz and Henk Smit working for them during their formative years, their core routing and switching codes had matured and stabilized before some of the new vendors even came to be, so for younger vendors trying to differentiate themselves using the OS is laughable. Who trusts their code to be bug-free, esp. when it comes to complex protocols like OSPF, BGP, and yes, xLFA :p.

    All in all, these 2 points make it totally unjustifiable for merchant-silicon vendors these days to charge a premium for what amounts to essentially commoditized, off-the-shelf products.

    Monopoly is never good for any industry, and networking is no exception. Among other things, it stiffens innovation. And the result is what we've been witnessing for a long time now, lazy networking and second-tiered products. Many vendors hardly invent anything these days anymore except crappy 'new' encapsulation formats. And Broadcom, as they don't have any sizable competitor that can threaten their domination, they won't feel the need to improve and fix all the nasty dogshit in their chipsets anymore.

    And vendors still have to wonder why cloud companies are taking their business away :p. They only have themselves and their laziness to blame. The prize for not reinventing yourself is a one-way ticket to oblivion.

    On that note, vendors like Cisco and Juniper, who're trying hard the old-school way, to produce their own silicons, are worth admiring, no matter what the quality of their products. They deserve credit for at least trying.

Add comment
Sidebar