Most recently launched data center switches use the Trident 2 chipset, and yet we know almost nothing about its capabilities and limitations. It might not work at linerate, it might have L3 lookup challenges when faced with L2 tunnels, there might be other unpleasant surprises… but we don’t know what they are, because you cannot get Broadcom’s documentation unless you work for a vendor who signed an NDA.
Interestingly, the best source of Trident 2 technical information I found so far happens to be the Cisco Live Nexus 9000 Series Switch Architecture presentation (BRKARC-2222). Here are a few tidbits I got from that presentation and Broadcom’s so-called datasheet.
Number of GE ports. The Trident-2 chipset supports 32 40GE ports with 128 10GE SerDes circuits, so you can split each 40GE port into four independent 10GE ports. The 10GE ports support 1GE interfaces, but it seems cannot have more than 104 GE interfaces.
Forwarding of small packets is not done at line rate (BRKARC-2222 slide 25). Vendors using Trident 2 chipset usually claim to have line rate performance. In reality, you get full line rate forwarding of small packets on 24 out of 32 40GE ports – the limiting factor is the performance of Trident 2 forwarding engine.
Some vendors (including Arista and Juniper) specify maximum throughput and packet forwarding capacity of their switches. Fair enough – do the math and decide whether the numbers are good enough for your workload. Cisco decided to use 24 out of 32 ports – that explains the “weird” number of ports (24, 36, 48 or 96 instead of 32, 64 or 128) on some switches and linecards.
Unified forwarding tables (BRKARC-2222 slide 38). Trident 2 chipset has 16K TCAM table (used for traditional longest prefix matching – LPM) and large Unified Forwarding Table (UFT) that can be used for exact matches (MAC addresses, IP host routes and ARP/ND entries, IP multicast entries) as well as prefix matches in Algorithm LPM mode.
Not surprisingly, the numbers given in the BRKARC-2222 presentation match the numbers found in Arista data sheets (32K to 288K MAC entries, 32K to 288K IPv4 host routes…) – just keep in mind that:
- You cannot reach all maximums at the same time (you cannot have 128K IPv4 routes with 288K IPv4 ARP entries);
- We don’t know what the granularity of UFT allocations is. You probably cannot trade one ARP entry for one MAC address;
- You might have to specify how you want the UFT sliced up (sdm prefer anyone?)
The multistage forwarding (linecard+fabric) in Nexus 9500 uses an interesting UFT optimization (BRKARC-2222 slide 47):
- UFTs on linecard chipsets contain MAC and ARP/ND (IP host route) information;
- UFTs on the fabric modules contain longest-prefix match entries (IP FIB).
End result: 160K MAC entries, 88K ARP entries and 128K IP routes at the same time.
The Nexus 9500 non-hierarchical forwarding mode (BRKARC-2222 slide 49) moves the IP LPM table into linecard UFT. This forwarding mode reduces inter-subnet forwarding latency between ports on the same linecard at the expense of reduced number of forwarding entries (LPM entries share the UFT space with MAC and IP host route entries).
Finally, the Max Host mode uses fabric UFT for IPv4 and linecard UFT for IPv6, maximizing the number of ARP/ND entries at expense of the FIB table size.
No routing with overlays (BRKARC-2222 slide 81). Trident 2 chipset doesn’t support routing of VXLAN-encapsulated packets, and based on some other vendors’ limitations it seems it has the same challenges with any overlay technology (including TRILL and potentially SPB).
It’s my understanding (based on scarce information available) that the problem might lie in the structure of the forwarding pipeline – by the time the chipset figures out it’s the overlay tunnel endpoint for the incoming packet, and performs the L2 lookup of the destination MAC address, it’s too late for another L3 lookup.
The workaround is hinted at in the BRKARC-2222 presentation: the packet has to be recirculated through the forwarding pipeline.
Remember the front-panel cables between F2 and M1 linecards on Nexus 7000? Same idea, implemented in silicon, probably resulting in similar performance.
Cisco solved the problem with its ACI Leaf Engine (ALE) chipset. One could also implement L3 forwarding on fabric modules in modular switches, or use a second Trident 2 chipset (building a leaf-and-spine architecture within the ToR switch).
Takeaway: Trident 2 has challenges performing L3 forwarding in combination with L2 tunnels. Have a long discussion with your vendor before implementing a design that uses the two features together, even when the datasheets imply everything works just fine.
Finally, looking at the Nexus 9300 architecture (BRKARC-2222 slide 59), there are only 8 40GE lanes between Trident 2 chipset and ALE chipset on Nexus 93128TX, which means that you won’t get line rate VXLAN routing on Nexus 93128TX.
Last week we recorded the second part of the 2014 Data Center Fabrics Update webinar. The videos and slide deck are already available to all participant of past webinars, anyone who bought the webinar recording at any time in the past, and webinar subscribers.