Broadcom Tomahawk 101

Juniper recently launched their Tomahawk-based switch (QFX5200) and included a lot of information on the switching hardware in one of their public presentations (similar to what Cisco did with Nexus 9300), so I got a non-NDA glimpse into the latest Broadcom chipset.

You’ll get more information on QFX5200 as well as other Tomahawk-based switches in the Data Center Fabrics Update webinar in spring 2016.

Here’s what I understood the presentation said:

  • Each 100 GE port can be channelized into 4 x 10GE, 4 x 25GE, 2 x 50GE or 1 x 40GE. It seems like each port can run 4 lanes at either 10 Gbps or 25 Gbps;

I may be totally wrong, but the way I understand the specs the 100GE ports use 100GBASE-SR4 (802.3bm) standard and would thus be incompatible with switches using older 100GBASE-SR10 (802.3ba) standard, although they would work with all 40 Gbps switches using 40GBASE-SR4.

  • Similar to Trident-2, Tomahawk becomes line-rate (3.2 Tbps) at packet sizes above 250 bytes;
  • Presentation claims overlay routing (VXLAN-to-VXLAN or VXLAN-to-VLAN) is not supported, which is a bit surprising as the forwarding pipeline includes tunnel termination before L2 and L3 lookup, which should be good enough;
  • The switching silicon has 10 queues per port (nice!);
  • Switching latency is approximately 500 ns and can be reduced to 300 ns if the chipset is reconfigured into doing only simple L2 switching;
  • Unified forwarding table (UFT; 128K entries) is split in memory banks that can be allocated to L2 entries, ARP entries and L3 LPM entries;
  • One of the printouts in the presentation hinted at 1K LPM IPv6 prefixes longer than /64;
  • Tomahawk support exact matching of ACL entries in UFT (not TCAM). UFT split with filter-mode profile can have 64K ACL entries, 16K IP LPM entries and 8K ARP/MAC entries;
  • There are 43 queues between the switching silicon and CPU, and you can configure control-plane policing parameters on each queue;
  • The hardware supports 16K MPLS labels (must be a separate MPLS lookup table, not TCAM tricks);
  • TCAM slicing is too tricky for me to understand, but it seems you’ll get between 512 and 6K TCAM entries based on the complexity of the matching conditions. Based on the matching length used by Junos you get up to 512 port- or VLAN ACL entries or up to 1024 IP ACL entries;
  • TCAM is not wide enough for all possible IPv6 matching conditions, so the hardware uses address compression. It seems you can have at most 128 source and destination IPv6 addresses in all filters deployed on the box;

Have I missed or misunderstood something? Please write a comment!

19 comments:

  1. Nice write-up! What do you think about IP-Interfaces attached to SPB? (This was a limitation before, thus the "tunnel termination before L2 and L3 lookup" sounds exactly like that.)
  2. BroadView Instrumentation in the Tomahawk ASIC provides visibility into data plane traffic and hardware table/buffer utilization.
  3. Only time will tell when HP will release a relabeled H3C S6820: www.h3c.com.cn/pub/2015_Event/S6820/index.html

    Btw, HP´s A5510 HI switches (== H3C S5560-EI) supports multiple table capacity modes:

    0: 32000 MAC, 16000 ARP or 8000 ND, 8000 MPLS labels
    1: 64000 MAC, 16000 ARP or 8000 ND, 4000 MPLS labels
    2: 32000 MAC, 32000 ARP or 16000 ND, 8000 MPLS labels

    It would be interesting to know what Broadcom ASIC they use on those
  4. Also interesting, a new H3C s5800ei with 4GB RAM
    "16K to 128K MAC addresses (The number is configurable...)"
    "VLAN mapping 64K entries (configurable)"
    "ND 32K entries (configurable)"
    "64 OpenFlow instances"
    "6K extensibility flow entries"
    "MAC-IP flow table"

    But what ASIC?

    P.S.: The problem with HP is they don´t know their products very well to answer those questions, which is sad...
  5. Thanks for the post. I agree with Peter, Telemetry is a very cool feature introduced with Tomahawk. Also, latency get us nearer to Infiniband which is good for HPC. Is Buffer 16MB? I've heard sometime ago 85MB for the StrataXGS. I'm a bit surprise about VXLAN routing also. Apart from that, really similar than the Z9100-ON
  6. Hey Ivan, Serge here from VMware - what are your thoughts on shallow buffer chipsets as opposed to their Dune-based ones for datacenter ToR/Spine deployments where you have vSphere w/ bursty vMotion patterns and also lots of storage traffic like iSCSI or NFS mounts on ESX hosts?
    Replies
    1. I don't know enough about switching silicon to comment on this question... but I know someone who does, and I'm trying to persuade him to do a webinar for me ;)
  7. I have heard from two independent sources (so it must be true, eh?) that the Tomahawk has four switch cores. They did that to try to keep up with the forwarding rate implied by the sum of the interface rates. If that's true, I suspect that the 16 MB of packet buffer has 4 MB attached to each core. The max buffer available to any flow is capped by what its core can give it. This is all written like its "something I know." It really is a question. Is this how the buffer falls out? Older Tridents are all single forwarding core design.

    Jim Warner, Univ Cal Santa Cruz
    Replies
    1. You got it. The Juniper presentation also mentions 4 x 4 MB packet buffer. Thanks for pointing this out!
  8. In April 2015 Broadcom announced the Trident II+. Are there any products that actually use this part. Is it shipping? It seems like it would have performance advantages over the older Trident II part.

    -jim warner, UCSC
    Replies
    1. Haven't heard anything so far...
    2. Cisco Nexus 3100V use T2+
      http://www.cisco.com/c/en/us/products/collateral/switches/nexus-3000-series-switches/datasheet-c78-736608.html
    3. Another Trident2+ switch, this time from Accton. AS5812-54X.
      Tw pair and SFP+ models. No 100 Gb/s ports. It will be interesting to see if this is the start of a raft of products from the usual suspects.
    4. Juniper has announced one, Arista might already have one... I'm getting lost in all the recent announcements, whenever I have to verify something I open my Data Center Fabrics slide deck (yeah, I know it sounds ridiculous ;).
  9. Ivan,

    UFT split with filter mode will in theory provide 64K ACL entries, but how does this translate to only 512 to 1K with complex matches? This part is not clear, wondering if there are any datasheets that can shed more light on this.

    Replies
    1. The way I understood the Juniper presentation was that UFT handles ACL entries with fixed bitmask (it's a hash table, not TCAM), for other entries you have to use TCAM, which can be split in a few ways, but is still small.

      It would be a miracle if you'd manage to find more specific publicly available information ;) in which case please post a link!
  10. Thanks for the info Ivan ! Yes, i will post a link if i find any more details on this.
  11. AFAIK unlike Trident series, Tomahawk doesn't provide hierarchical queuing. No provisions of clubbing two or more unicast queues under a queue-group. It will be interesting to see what kind of ETS support (limited) tomahawk based switch provides.
  12. What Broadcom ASICs are those:
    H3C S6520-EI => http://download.h3c.com.cn/download.do?id=2485742 => 12.8Tbps
    H3C S5560X-EI => http://download.h3c.com.cn/download.do?id=2697173 => 5.98Tbps
Add comment
Sidebar