Beware of Vendors Bringing White Papers
A few weeks ago I wrote about tradeoffs vendors have to make when designing data center switching ASICs, followed by another blog post discussing how to select the ASICs for various roles in data center fabrics.
You REALLY SHOULD read the two blog posts before moving on; here’s the buffer-related TL&DR for those of you ignoring my advice ;)
- You don’t need large buffers in non-oversubscribed spine switches;
- You better have some more buffer space in edge switches, in particular when there’s plenty of traffic going from high-speed ports toward low-speed ports.
- You might need deep buffers when there’s a large mismatch between ingress and egress speeds or link latency.
I haven’t received a single comment saying “you’re totally wrong and here’s a good technical article proving that” so I’m assuming I’m not horrendously off the mark.
Would you expect vendor product marketers to agree with the above? Of course not. Years ago, Arista was enamored of deep-buffer switches… until Cisco launched a Jericho-based data center switch. At that point buffers stopped mattering… unless you were reading Cisco white papers.
There were tons of rebuttal blog posts written at that time, so one would hope that the vendors got the message. That’s too much to hope for, one of my readers kindly pointed me to a Juniper white paper claiming just the opposite of the above TL&DR:
An article from Juniper1 I found on the web was saying quite opposite [from what you were saying]: Low On-Chip Memory ASIC for low buffer leafs, Large External Memory ASIC for high speed/high buffer leaf and spine.
No surprise there. Juniper is selling numerous switches based on Broadcom merchant silicon (QFX 5000 series), and deep buffer (100 milliseconds of buffers per port) QFX 10000 switches using in-house silicon. What do you think they want you to buy?
The white paper my reader found compared switches using Broadcom Tomahawk ASIC with switches using Juniper Q5 ASIC, and wrongly concluded that you should use Tomahawk ASIC at the edge and Q5 ASIC at the core.
Tomahawk ASIC is a pretty bad choice for a data center fabric edge – it’s missing a lot of functionality available in Broadcom Trident chipset (for example, VXLAN Routing In and Out of Tunnels), and it has less buffer space than a Trident family ASIC with comparable throughput.
What about deep buffer switches at the spine layer? Do you really think you need tens of milliseconds of buffer space per port on a spine switch? Is that what you want the fabric latency to be?
Will it hurt to have deep buffers on spine switches? Probably not, particularly if you don’t care about latency, but you would be paying through the nose for functionality you might not need. But then, if you have infinite budget, go for it.
To wrap up: when a white paper comparing Tomahawk and Q5 ASICs is saying…
Switching platforms based on low on-chip memory ASICs are best suited for cost-effective, high-speed, high-density server access deployments.
… they really mean:
We want you to buy QFX10K for your spine switches, but we can’t justify it any other way, so we’ll claim shallow-buffer switches are only good for the network edge.
Last question: why wouldn’t Juniper recommend a Trident-based edge switch? Because the white paper was written before they launched QFX5130 and nobody bothered to fix it?
Long Story Short
- Never forget Rule#2 of good network design: beware of vendors bringing white papers2.
- When you decide to design a network based on vendor white papers, you’ll get the network you deserve.
Finally a note for the vendors: I understand you have to present an alternate view of reality that’s focused on what you want to sell, but could you at least fix it when you launch new products – that document was written in 2015, removed from Juniper’s web site in June 2022 and happily confusing unaware networking engineers in the meantime.
- The white paper was removed from Juniper’s web site
Now gone, but I have a downloaded copy somewhere ;) ↩︎
Based on beware of Greeks bearing gifts, in particular when they look like a wooden horse. ↩︎
Any chance the DC landscape of 2014 could have influenced their recommendations?
I think they still had two competing L2 fabric solutions (QFabric and VCF), and VXLAN was either CRB or mcast flood and learn.
Unsure if these would justify their position at the time, that and the fact that they didn't have a Jericho platform until 2021?
I would be acceptable if they would have written "use QFX10K as spine switches to get VXLAN routing, because Broadcom didn't implement it in Tomahawk ASIC". Still bending the truth by omission (because VXLAN routing did work in newer Trident ASICs), but it wouldn't be plain wrong like their buffering argument.
Ivan - Thanks for bringing that ancient white paper to our attention. The strange thing is the link goes to the pre-staging area for stuff before we take it live to our site. The WP definitely was not on our public website. But surprised that you were able to get through the link. Anyway, we just took it down. Sorry for any confusion. BTW, our DC portfolio kinda kicks ass now with Apstra running DF fabric automation & mgmt. We're getting a ton of great customer feedback.
Wow, that was amazingly fast. Thank you! Will update the blog post tomorrow (it's getting late over here)
And yes, I totally agree - Apstra kicks ass ;))
Does mixing and matching upstream / downstream ports with different data lane / speed configurations contribute to the need for buffering? For example: 100-Gig uses 4x 25 Gig lanes while a 40-Gig port uses 10-Gig lanes, etc... Do you think this matters at all or is it just a factor of total port-speed regardless of number or speed of lanes?
I'm inclined to think where we're buffering doesn't really know or care what lane configuration is used as at that point it's abstracted and only knows raw total speed but I'm not totally sure honestly. Any thoughts?
It's my understanding that 40GE or 100GE uses lanes for bit striping, sending a single frame across all four lanes -- that's why it's better to use the four 25GE lanes as a 100GE link instead of a LAG bundle of four 25GE links.
A 40GE/100GE interface thus appears as a single higher-speed interface for the purposes of buffering discussions.
Makes sense! Thanks for responding with your thoughts!
You might want to add another consideration. If you have a lot of traffic aggregation even when the ingress and egress port are roughly at the same speed or when the egress port has more capacity, you could still have congestion. Then you have two strategies, buffer and suffer jitter and delay, or drop and hope that the upper layers will detect it and reduce the sending by shaping. But you are right, deep buffers mean high jitter that might be converted into high delay at some point by dejitter buffers. If you have a traffic pattern that is a kind of distribution into multiple directions, then you do not need to concern about congestion and deep buffers. Even if the distributing interfaces might have less speed but they still fit to the disaggregation patterns. For example, fast ethernet access ports and gigabit uplinks. There are many situations like that. As usual, an optimum network design requires the knowledge of your traffic patterns. Otherwise, you can just overdesign as much as allowed by your budgets. It is also important to take into account the policer/shaper chaining rules. If you want to reduce congestion, than there must be a shaper at the sender for complying with the policer at the receiver. The physical capacity of the link is a natural policer, even if you have not configured a policer explicitly. Inside the router/switch you have a similar pairing, you just have to count with the aggregation factor, too.
Such a chain of shapers is very difficult to manage, so a backpressure mechanism edge-to-edge or end-to-end would be needed for the rescue. Unfortunately, some traffic does not have backpressure, such as certain UDP streams. Then you have a bad luck and either configure all the shapers based on a traffic patterns, or make a compromise by tolerating packet loss and jitter.
I have seen a lot of problems in networks when shapers were forgotten... You cannot make wonders, so you have to consider them...
"Never forget Rule#2 of good network design"
Where do we find the rest of the rules?
Someone should collect them from the hundreds of blog posts I wrote ;)
Anyway, #1 should be "figure out what (business) problem you're trying to solve". Then there's the Russ White rule "if you haven't found the tradeoffs, you haven't looked hard enough"