Build the Next-Generation Data Center
6 week online course starting in spring 2017

What Are The Problems with Broadcom Tomahawk? We Don’t Know

One of my readers has customers that already experienced performance challenges with Tomahawk-based data center switches. He sent me an email along these lines:

My customers are concerned about buffer performance for packets that are 200 bytes and under. MORE IMPORTANTLY, a customer informed me that there were performance issues when running 4x25GE connections when one group of ports speaks to another group.

Reading the report Mellanox published not so long ago it seems there really is something fishy going on with Tomahawk.

Let’s be realistic: every chipset has limitations. Every silicon vendor has to make tradeoffs. Every vendor loves to find a chink in another vendor’s product and tries to present it as a glaring abyss.

However, there are vendors that are pretty open about their chipset architectures. Well, make it A Vendor – Cisco has some amazing “a day in the life of a packet” Cisco Live presentations that anyone with an email address can watch (if there’s someone else as open as Cisco is about their internal architectures, please post a comment).

Then there are vendors that document the realistic boundaries of their architectures. For example, when Cisco was selling oversubscribed 10GE linecards for Nexus 7000 they documented which ports share the forwarding bandwidth. Arista was pretty open about the forwarding performance of their 7300 switches (so it was easy to deduce they are not linerate at small packet sizes). Brocade documented their ASIC limitations in regards to link aggregation.

So what’s the problem with Broadcom chipsets? We don’t know what the limitations are because they’re hiding the information, and everyone who does know what’s really going on can’t talk about it.

Well, the real problem is that Broadcom doesn’t have to care – every major data center switching vendor (apart from Brocade, a few Arista and Cisco switches, and Juniper QFX10K) is using some Broadcom chipset.

Before someone starts listing a dozen reasons why hiding that information makes sense, let me point out that Intel publishes all of their documentation.

The only way to get any information on how the stuff that’s sitting inside most of data center switches these days really works is to try to reverse-engineer tidbits from various vendor presentations or their configuration guidelines, or rely on Broadcom competitors spilling the beans disguised as a test report. And yet some people call this state of affairs progress.

For a more articulate take on the same topic, read the excellent blog post by Carlos Cardenas, and if you want to know more about implementing a networking operating system, register for my online course with Russ White as the guest speaker.

6 comments:

  1. Hi Ivan,

    You are completely right, while the internal design could be secret, ASIC designers and switch vendors should clearly define the capabilities and limitations of their products.

    The Mellanox report that you link is really interesting and insightful. While I don't know the internal architecture of the ASIC, the unfairness problems in Tomahawk seem to come from a hierarchical organization of the switch. One such organization was presented by Kim et al in "Microarchitecture of a High-Radix Router", ISCA'05 (link). The design proposed in the paper organizes the ports in groups of 8 ports in the same "row", with different groups connected by "columns".

    The problem of hierarchical organizations is global fairness. While a simple round-robin (RR) arbitration mechanism is locally fair, a multi-level RR mechanism is not globally fair. This problem is clearly presented by Abts et al in "Age-based packet arbitration in large-radix k-ary n-cubes", SC'07 (see Figure 4 in this link), and seems to be related to the fairness issues of the Tomahawk.

    The results in Figures 2 and 3 of the report are consistent with a hierarchical switch organization, in which ports are divided into two groups of 16 ports. In particular, with the following division: group 1 containing ports 1-8 and 25-32, and group 2 with ports 9-24. If you observe carefully the results, you can notice that each of the two groups always get 50% of the bandwidth, which is later divided evenly between the active ports of each group (this may not be obvious at first sight since the number of ports in group 1 is not consecutive, particularly in test 2 in Figure 3, but it holds in all cases). The other latency and packet loss results highlight that Tomahawk employs an internal architecture which is lossy, while Mellanox probably inherits the design from its lossless Infiniband switches.

    It is very interesting to consider that solutions designed to provide fairness at L2 (such as DCB QCN, which converges to fair shares between flows) will fail if the individual switches are not fair. Related to this, there was a European company called Gnodal (from Bristol, UK) which designed an Ethernet solution with global fairness. It employed a proprietary overlay encapsulation mechanism within its fabric, which included an "Age" field in the frame header used to provide per-port global fairness. Too bad they went bankrupt and their IP assets were bought by the American HPC company Cray (which had already used age-arbitration in its XC supercomputer series), maybe in a near future we might find some HPC solution from them using Ethernet with global fairness.

    If Broadcom didn't have such a dominant market share, these types of issues could cause a significant loss. Let's see if some more companies enter the 100G ASIC market.

    ReplyDelete
  2. Juniper has an amazing "An Expert Packet Walkthrough on the MX Series 3D ..." publication by David Roy. Comparable to or even better that ones from Cisco.

    ReplyDelete
    Replies
    1. David Roy is not a Juniper employee though, he basically debugged MX internal structure by himself as a customer (and continues to do so). But yes, it's an incredible booklet, really helpful, and I wish every vendor would write something like that from time to time.

      PS: David Roy's blog is just as incredible, I love his posts. Periodic packet management details, bgp flowspec, event scripts, helpful hints for traffic capture, well everything.

      Delete
  3. All of the Juniper books (MX, QFX) always have an amazingly detailed packet walk through. I believe the Juniper QFX5100 book was the first to have a very detailed Broadcom walk-through.

    ReplyDelete
  4. Cisco has also found that Tomahawk is unable to NOT drop pause frames. So much for PFC, FCoE and RoCE!

    ReplyDelete
  5. At Broadcom there seem to be different ASIC lines:
    - Trident+ => Trident II => Tomahawk
    - Broadcom DNX

    So, what about that DNX Qumran MX+? Does this ASIC have the same Problems?

    => http://community.hpe.com/hpeb/attachments/hpeb/switching-a-series-forum/7787/2/H3C-2016.pdf

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.