Worth Reading: Understanding Table Sizes on the Arista 7050QX-32
Arista published a blog post describing the details of forwarding table sizes on 7050QX-series switches. The description includes the base mode (fixed tables), unified forwarding tables and even the IPv6 LPM details, and dives deep into what happens when the switch runs out of forwarding table entries.
Too bad they’re describing an ancient Trident-2 ASIC (I last mentioned switches using it in 2017 Data Center Fabrics update). Did NDA expire on that one?
Surprisingly detailed documentation on their Trident 4 (including user guide and programming guide):
Going through the blog post, I’ve several questions that most likely will be applicable to other chipsets as well, as they can overlap a lot in terms of architecture.
First off, how come in just about every vendor’s docos about how the FIB works, there’s absolutely no info on how fast the TCAM is, in terms of clock frequency? Buyers need to know this, to have a rough estimate of the maximum lookup capability of their platform. For ex, if the TCAM is 170-200MHZ, that means a LC having even 1 100GE port most likely won’t be able forward at line rate for small packets. NDA could be the reason it’s being hidden here, but at the same time, this opens room for competition if certain vendors can come up with higher-performance chipsets.
Second, how much of the CAM/TCAM is on-chip, and how much is off-chip? Considerable performance penalty can apply to off-chip lookup, due to the high RTT resulting from interconnect delay, and that’s physics so it can’t be changed. It’d be good to know the arrangements of the memory banks. Here, for ex, Jericho LC chipset, which is used on NCS, has enough TCAM for 4m entries, but the bulk of it is off-chip. Broadcom. Cisco either made no mention of performance hit or claimed that it didn’t affect lookup performance on NCS website as I recall, but that’s selling:
I also notice Broadcom’s (and vendors’) docs usually just throw umbrella terms like chipset, ASIC, without specifying whether the ASIC is the LC or the fabric chipset. They do different things and so the distinction is very important. In this blog post, Arista was talking about the functionality of their LC Trident ASIC.
Next, in the blog post, it said:
“ the entry resources in the LPM table are large enough to be able to handle 64 bits worth of prefix information, so we’re able to program two 32 bit IPv4 routes per LPM entry. The first IPv4 route in the routing table consumes an LPM entry, and the next IPv4 route is able to be programmed in the second half of the same entry.”
Unless I’m missing something, I don’t thing that’s how the hardware works. For TCAM LPM search, all of the entries are searched in parallel and then the first one among those having the longest prefix, is returned. So an entry either matches or it doesn’t. How can you write the search logic to search half the entry??? My understanding is 8190 LPM entries are partitioned for maximum-length/double-wide IPv6 entries, meaning 2 entries are reported as one in EOS interface. On the hardware level, that would mean the TCAM chips have 16380 physical TCAM entries, with single-wide IPv6 and Ipv4 prefixes each occupying 1 entry probably 72-80 bit long, and double-wide/up-to-128 netmask IPv6 prefixes each occupying 2 entries for a total of 144-160 bits, making the lookup time twice as long for the double-wide IPv6 ones. The rest of the post seems to fit this description.
Their ALPM trick for FIB compression is cool, but it comes with caveat. As now you have to do one more lookup to get to the precise prefix, lookup latency has increased.
All in all, looking at how complicated things are just for the basic lookup phase, one can see the effect of cut-through switching adds minimal improvement to the total port-to-port delay, due to the many steps involved in getting the packets from ingress to egress, and the resulting latency. If Arista manages to improve nominal device delay, the improvement most likely comes from some of the architectural innovation in the forwarding pipeline or in the fabric, not from cut-through paradigm.
Also Ivan, if you haven't read this one, pls have a quick glance:
Interesting Juniper made no mention of TCAM whatsoever in their QFX10k architecture paper. One gets the feeling that they use HMC for FIB, but that somehow doesn’t make much sense to me. First off, HMC is off-chip, so it’s slow. Second, HMC’s access latency is still slower than TCAM, so don’t think they can do away with TCAM. It’s just weird no info can be found anywhere re their FIB capacity.
And Uncle Jeff finally has decided to jump in and make his own routers, displacing Juniper eventually:
I don’t know what to make of this, but I doubt AWS has the know-how to create an efficient router on the hardware/data-plane level. Writing a NOS is nowhere near has hard as making small packets switch at line rate within a constraint power budget. But these days, as feature-size shrinkage has made things good/fast enough for a lot of use-cases, coupled with lazy people, I think some vendors might shrink a fair bit or go out of business because of AWS’ decision, which will probably be followed suit by MS and other big Cloudy providers. Rapid-fire acquisitions and big consolidations in networking could be underway.
@Oleg: Thanks for the pointers. Downloaded them before someone discovers they were published by accident ;)
@Minh: I don't think we'll get much more information on how QFX10K works. Cisco sometimes told us more in various Cisco Live presentations (but even that was mostly marketing), haven't seen something equivalent from Juniper.
As for AWS router:
In the end, I'd say it's mostly the question of (A) can you limit the scope of what you want to something reasonable (AWS never had a problem with that), (B) can you grab some of the engineers who know how to get things done (ditto) and (C) does it make financial sense? AWS seems to be anchored in reality as opposed to another other hyperscaler who wants to get things done their way just because.
@Oleg: Took a closer look at those PDFs. Dang, you made me too excited ;)) They are just expanded marketing collateral.
@Ivan, thx a lot for your feedback :))! Re small packets, I don't deny it's not too much of an issue on a daily basis, but there're several considerations that put it beyond academia. First one is quite obvious. To benchmark how well a router/switch performs vs the next one, a common ground needs to be used, and as traditionally routers are evaluated using their worst-case performance, small packets are useful for bench-marking.
Second, small-packet performance shows the basic capability of a router (let's call them all routers as most of them do layer 3) for doing plain destination-based lookup. This basic capability deteriorates quickly as we start putting sophisticated features on, like security or QoS. Cisco QuantumFlow processor, when it came out, could handle about 8Mpps and degraded to 2Mpps as more features are activated. So it's always good to know at least the baseline performance capability, so we can have a rough expectation of how well it will perform under stress. I believe back in the early 2000s one of Juniper platforms, was known for ingress-to-egress-port out-of-order delivery of packets, under heavy traffic condition. All routers having buffered-crossbar architecture are susceptible to this of course, but the situation is caused when there's massive backlog of cells in the VOQ and fabric, a situation that happens when the packet lookup/processing performance is less than optimal. So it's always good to have a superior architecture that can handle packets smoothly, and that metric is reflected in how well a platform handles small packets.
Third, real-life traffic is not uniform, but tends to be long range dependent/self-similar. Even Broadcom admits as much on page 5 of this presentation:
The nature of self-similarity/LRD can be due to heavy hitters/elephant-mice flow distribution, the nature of file distribution etc. TCP congestion control also contributes to LRD/self-similarity. With this kind of traffic, congestion spreading is reality, and weak-architecture routers will be killed. So high-performance routing platforms are always good to have. Self-similar traffic also makes big buffer mostly useless, but I digress ;).
I'm aware that sometime ago, there was an argument going back and forth about small packet performance needed or not --someone was nit-picking on Arista if I recall -- and Arista's response was something along the line of we produce switches to meet our customer's use cases, and by far none of them demands superior small-packet handling, so it's basically a non-issue. While on the surface, this seems reasonable, when looking deeper, not so much, for all the above reasons. After all, if your router is so good, why wouldn't it outperform competitors on the most basic of all benchmarks, packet forwarding? If yours is indeed supreme, then it doesn't matter what benchmark right, you'd still outperform. If it can perform well with small packets, then I can sleep well buying it, knowing that I can trust its tenacity even in extreme or hostile conditions.
And speaking of hostile conditions, DDOS is another reason small-packet performance matters, as it translates directly to how much a router can take before it goes down. Of course there're other measures to protect against DDOS, but if all vendors provide similar capability in those aspects, then what stands one apart from another would be packet-handling capacity, as that will decide who's the last man standing in a heavy DDOS attack.
Sometime ago, I came across this paper, from IBM , released in 2013, that surveyed over 30,000 servers, located across more than 50 production data centers, over a two year time span. Their findings, among other things, are that 80%+ of packets are 500 bytes or less, and at that, more small packets coming in than going out. You might find it interesting too Ivan:
An excerpt from it about the packet size: "the average MTU is 1806.81 bytes due to the dominance of the traditional Ethernet MTU value equal to 1500 bytes as shown by the median MTU value. For the network load on each server, we see that: (i) the average server network traffic is roughly 1.16 Kpps and 5.7 Mbps; (ii) the average packet size is 300 bytes, and (iii) the weekend day network traffic is only slightly lower than the week day traffic."
Re AWS router, I completely agree with you that AWS doesn't need all the fanciful add-on features that Juniper (and other vendors) provides -- that's probably why they decide to do it themselves :)) . Vendors pack all those features on in an attempt to differentiate, and in order to charge higher for their devices. A lot of those features go unused, not just by Cloud providers, but the average enterprise/SP as well. By doing it themselves, AWS can use all the chip areas that otherwise would be wasted on unneeded stuff, to optimize for packet performance, which is mainly what the Cloud needs. While their routers are most likely not top-notch in quality due to their inexperience, for utility computing, that's all they care and need, as you rightly pointed out.
At the end of the day, I feel networking (and IT in general ) has commoditized so much over the years -- all the more so since the Cloud became mainstream -- that we can expect to see more and more vendors compress or simply go away, and those who want to survive will have to reinvent themselves and differentiate their products instead of relying on the likes of Broadcom to provide chipsets/ASICs for them, essentially turning themselves into Broadcom resellers. Packing a lot of features that almost no one uses/needs is not the answer though :p. There're always markets for good and competitive products that solve the right problems, just like AMD has proven by going back from the brink of death several times after being dealt almost mortal blows by Intel, who's feeling it more and more with each passing year as their loyal customers either go AMD, ARM or do it themselves, like Apple.
Oh, and I too did go thru the pdf shared by Oleg. Most of it is about physical layer stuff, and there's no treatment of the ASIC architecture, like the amount of CAM/TCAM, the traffic manager, the fabric scheduler, the crossbar architecture etc, the important stuff basically. Typical Broadcom :p. It'd be really unfortunate if the networking industry needs to rely on them to innovate on its behalf and provide vendors with the backbone of their products. Kinda reminds me of how the server/PC industry has stiffened for decades after Intel took center stage and eliminated most of the competition, making us stuck with their crappy CISC for a long, long time up until recent years.