Cisco Nexus 3548: A Victory for Custom ASICs?

Autumn must be a perfect time for data center product launches: last week Brocade launched its core VDX switch and yesterday Arista and Cisco launched their new low-latency switches (yeah, the simultaneous launch must have been pure coincidence).

I had the opportunity to listen to Cisco’s and Arista’s product briefings, continuously experiencing a weird feeling of déjà vu. The two switches look like twin brothers … but there are some significant differences between the two:

  • Cisco’s Nexus 3548 is narrowly focused on high-performance trading, Arista’s traditional market. Arista’s 7150 is a more generic top-of-rack switch with features targeting private clouds (example: VXLAN termination);
  • Arista’s switches use merchant silicon, Nexus 3548 runs on new generation of Cisco’s ASICs (read Colin McNamara’s blog post for more details);
  • Both switches have comparable table sizes: 64K MAC addresses and 64K adjacent hosts (ARP/ND table). Arista’s switch has significantly bigger IP forwarding tables (84K IP routes versus 16K IP routes in Nexus 3548);
  • 7150S-64 has 64 10GE ports, Nexus 3548 has 48 10GE ports;
  • Surprisingly, the typical power draw of Nexus 3548 is almost identical to Arista’s 7152 (52-port switch);
  • And finally (my favorite): only one of the two supports IPv6.

Focusing on additional software and hardware features, it’s obvious Cisco was reading Arista’s HPT playbook: both switches can combine four 10GE ports into a single 40GE port (not a LAG), have microburst management, APIs, precision timing with PTP, timestamps in SPAN/mirrored packets, and hardware NAT with ridiculously low latencies.

The latency game

The true difference between the two switches is the packet forwarding latency.

Arista was traditionally a market leader in this space and its new switch raised (actually lowered) that bar significantly to ~380 nanoseconds … but only for a few moments – Nexus 3548 has 250 nanosecond cut-through latency, which can be further reduced to 190 nanosecond in warp mode (yes, you do need an additional software license to enable the warp drive). The trick to reduced latency is reduced MAC table size: 8K addresses in warp mode.

Nexus 3548 also has mindboggling hub-like warp SPAN performance: mirroring packets from input to a set of output ports takes 50 nanoseconds (or ~60 bytes @ 10 Gbps). Obviously this trick only works with cut-through switching (which can’t be done from 10GE to 40GE ports or vice versa) on idle output ports.

Do we care?

A bit of perspective: Speed of light is finite – one meter equals 3 nanoseconds. Signal propagation in fiber or copper is a bit slower; it takes approximately 5 nanoseconds for a meter. 10-meter cables thus introduce ~100 ns latency (50 ns on each leg) … and then there’s latency introduced by SFP+ transceivers.

I am positive there are people out there that think they need this kind of performance and are willing to pay for it. I am also positive almost all of us (particularly those that still have to work with data residing on disk drives) stopped caring a long while ago when the forwarding latencies dropped to a few microseconds.

23 comments:

  1. Hi Ivan,

    Both switches also support NAT...
    My feeling is that the ASIC is that this is more a customized ASIC based on a merchant ASIC than a custom homemade ASIC.

    Fabian
  2. these latency games are all about automated trading. this is probably the most important application for these boxes.
  3. Hi Ivan,

    Last time I checked, the latency crown was held by Gnodal, they advertise sub-150 ns cut-through switching latency.
    http://www.gnodal.com/Products/GS-Series/

    Not really sure if 50-100 ns makes that much difference, but HFT people are really shaving off microseconds here and there. Actual production network (server-server) is usually Infiniband with RDMA and other stuff.
    Some reading on low-latency infra, if you're interested (it's PPT, unfortunately) - http://www.informatix-sol.com/docs/LowLatency101.pdf
    A little dated, perhaps, but mostly still relevant.
  4. Hi Ivan.

    You say in your blogpost that : "The trick to reduced latency is reduced MAC table size: 8K addresses in warp mode."

    What does reducing the MAC table really do on the silicon ? I mean the buffer for indexing the MAC addresses will be smaller but where the extra space is going ?

    Short summary: Can you deep dive in this please and enlighten me ? :)

    Many thanks !

    Nic
    Replies
    1. Can't deep-dive, that's all the information I got.

      MAC tables are usually organized as either TCAMs or hash tables. In both cases, accessing larger table might take an extra (hardware) step, resulting in higher latency. Just guessing.
    2. Actually, imagine a packet shows up on all interfaces at the same time, since you can only look up one packet at a time in the mac table what 'warp' drive does is split the mac table into 8 tables where all the entries are replicated. This way you can lookup 8 packets in parallel. Meaning that for a 48 port switch each replicated table will serve 6 input ports so a given packet at worst case will have to wait for 5 lookup to get it's turn.
    3. Also there are collisions in hash tables (in fact it consists in hash tables where entries are chained lists); then the algorithm is just a sequential lookup in the list bound to a specific entry.
  5. Which one of the two supports IPv6?
    Replies
    1. Is it too hard to check the vendor datasheets? I provided links to both datasheets in the article.
    2. Neither of the datasheets mention IPv6..
    3. http://www.aristanetworks.com/en/products/7150-series/7150-datasheet
      key feature tab,
    4. IPv6 should be no prob for 71xx
  6. The Arista 7150S switch family supports IPv6:

    http://www.bradreese.com/blog/9-20-2012.htm

    Sincerely,

    Brad Reese
    Replies
    1. Supported in a future software release
    2. Yes, you're correct:

      Arista 7150S Data Sheet (page 3)

      Layer 3 Features

      21K IPv6 Routes*
      2K IPv6 Multicast Routes*

      (Page 4)

      * Supported in a future software release

      http://www.aristanetworks.com/media/system/pdf/Datasheets/7150S_Datasheet.pdf

      Sincerely,

      Brad Reese
  7. Latency is key to HFT, BUT consistent jitter AND low latency will remain the winner for HFT.

    Also Arista’s older 7124SX and the newer 7150S use the (Intel) Fulcrum ASIC – the key in this is both consistent latency and Jitter across all packet sizes and the same for layer 2 or 3, or any other feature. Would like to see the Cisco tested in that manner.
  8. BTW, 7150S-64 has "48x1/10GbE and 4 x 10/40GbE", not 64x 10GE. You might want to modify that.

    I was wondering how they managed to cram 64 SFP cages into a 1u chassis :-)
    Replies
    1. It has 48 10GE and 4x40GE ports that you can split into four 10GE ports with a breakout cable. Their data sheet claims a total of 64 ports.
    2. Thanks for pointing that out. They could certainly be clearer about that on the product's page...
  9. Nx3548: Warp mode to further reduce latency to 190 ns for small-to-midsize layer 2 and 3 deployments. That always clicks a two sided argument in my head: is L3 slower than L2 in any account ? Even if doing hardware switching, you need to read more to make the call, and you have to rewrite/recalc. On the other hand, the difference might amount to a couple of meters of fiber ? :)
    Replies
    1. L3 could be slower in theory, I'm not sure it's any slower in practice (and you'd have to be in cut-through switching anyway to notice the difference).
    2. Gee, I've been saying that since the '90s when some cisco course (MPLS) said that routing is slower... at the same time showing how CEF does L3 switching. Vmware did not like/support vMotion because of latency too. Myths ? :)
    3. In those days MPLS was marginally faster, as it took a single linear table lookup versus multiple N-ary tree/trie lookups for IP.

      As for vMotion over L3, see http://blog.ipspace.net/2014/09/vmotion-enhancements-in-vsphere.html
Add comment
Sidebar