Cisco Nexus 3548: A Victory for Custom ASICs?
Autumn must be a perfect time for data center product launches: last week Brocade launched its core VDX switch and yesterday Arista and Cisco launched their new low-latency switches (yeah, the simultaneous launch must have been pure coincidence).
I had the opportunity to listen to Cisco’s and Arista’s product briefings, continuously experiencing a weird feeling of déjà vu. The two switches look like twin brothers … but there are some significant differences between the two:
- Cisco’s Nexus 3548 is narrowly focused on high-performance trading, Arista’s traditional market. Arista’s 7150 is a more generic top-of-rack switch with features targeting private clouds (example: VXLAN termination);
- Arista’s switches use merchant silicon, Nexus 3548 runs on new generation of Cisco’s ASICs (read Colin McNamara’s blog post for more details);
- Both switches have comparable table sizes: 64K MAC addresses and 64K adjacent hosts (ARP/ND table). Arista’s switch has significantly bigger IP forwarding tables (84K IP routes versus 16K IP routes in Nexus 3548);
- 7150S-64 has 64 10GE ports, Nexus 3548 has 48 10GE ports;
- Surprisingly, the typical power draw of Nexus 3548 is almost identical to Arista’s 7152 (52-port switch);
- And finally (my favorite): only one of the two supports IPv6.
Focusing on additional software and hardware features, it’s obvious Cisco was reading Arista’s HPT playbook: both switches can combine four 10GE ports into a single 40GE port (not a LAG), have microburst management, APIs, precision timing with PTP, timestamps in SPAN/mirrored packets, and hardware NAT with ridiculously low latencies.
The latency game
The true difference between the two switches is the packet forwarding latency.
Arista was traditionally a market leader in this space and its new switch raised (actually lowered) that bar significantly to ~380 nanoseconds … but only for a few moments – Nexus 3548 has 250 nanosecond cut-through latency, which can be further reduced to 190 nanosecond in warp mode (yes, you do need an additional software license to enable the warp drive). The trick to reduced latency is reduced MAC table size: 8K addresses in warp mode.
Nexus 3548 also has mindboggling hub-like warp SPAN performance: mirroring packets from input to a set of output ports takes 50 nanoseconds (or ~60 bytes @ 10 Gbps). Obviously this trick only works with cut-through switching (which can’t be done from 10GE to 40GE ports or vice versa) on idle output ports.
Do we care?
A bit of perspective: Speed of light is finite – one meter equals 3 nanoseconds. Signal propagation in fiber or copper is a bit slower; it takes approximately 5 nanoseconds for a meter. 10-meter cables thus introduce ~100 ns latency (50 ns on each leg) … and then there’s latency introduced by SFP+ transceivers.
I am positive there are people out there that think they need this kind of performance and are willing to pay for it. I am also positive almost all of us (particularly those that still have to work with data residing on disk drives) stopped caring a long while ago when the forwarding latencies dropped to a few microseconds.
Both switches also support NAT...
My feeling is that the ASIC is that this is more a customized ASIC based on a merchant ASIC than a custom homemade ASIC.
Last time I checked, the latency crown was held by Gnodal, they advertise sub-150 ns cut-through switching latency.
Not really sure if 50-100 ns makes that much difference, but HFT people are really shaving off microseconds here and there. Actual production network (server-server) is usually Infiniband with RDMA and other stuff.
Some reading on low-latency infra, if you're interested (it's PPT, unfortunately) - http://www.informatix-sol.com/docs/LowLatency101.pdf
A little dated, perhaps, but mostly still relevant.
You say in your blogpost that : "The trick to reduced latency is reduced MAC table size: 8K addresses in warp mode."
What does reducing the MAC table really do on the silicon ? I mean the buffer for indexing the MAC addresses will be smaller but where the extra space is going ?
Short summary: Can you deep dive in this please and enlighten me ? :)
Many thanks !
MAC tables are usually organized as either TCAMs or hash tables. In both cases, accessing larger table might take an extra (hardware) step, resulting in higher latency. Just guessing.
key feature tab,
Arista 7150S Data Sheet (page 3)
Layer 3 Features
21K IPv6 Routes*
2K IPv6 Multicast Routes*
* Supported in a future software release
Also Arista’s older 7124SX and the newer 7150S use the (Intel) Fulcrum ASIC – the key in this is both consistent latency and Jitter across all packet sizes and the same for layer 2 or 3, or any other feature. Would like to see the Cisco tested in that manner.
I was wondering how they managed to cram 64 SFP cages into a 1u chassis :-)
As for vMotion over L3, see http://blog.ipspace.net/2014/09/vmotion-enhancements-in-vsphere.html