The Impact of Jumbo Maximum Frame Size on Data Center Switches

Sander Steffann sent me an intriguing question a long while ago:

I was wondering if there are any downsides to setting “system mtu jumbo 9198” by default on every switch? I mean, if all connected devices have MTU 1500 they won’t notice that the switch could support longer frames, right?

That’s absolutely correct, and unless the end hosts get into UDP fights things will always work out (aka TCP MSS saves the day)… but there must be a reason switching vendors don’t use maximum frame sizes larger than 1514 by default (Cumulus Linux seems to be an exception, and according to Sébastien Keller Arista’s default maximum frame size is between 9214 and 10178 depending on the platform).

See to MTU or not to MTU section at the end of the blog post if you wonder about my avoidance of the MTU acronym.

The only reason I could see for the persistent use of ancient 1514-byte maximum frame size might be the hardware buffering architecture. I’ve seen two ways of organizing (hardware) input buffers:

  • A single input buffer must be able to hold the largest possible input packet. Increasing the interface or system maximum frame size results in smaller number of packet buffers1 and reduced ability to absorb input bursts (example: wasting a 9K buffer to hold a 64-byte VoIP packet).
  • An input packet can spill over into several buffers (sometimes called particles) – a mechanism called scatter/gather or vectored I/O. The maximum frame size doesn’t matter, as large packets fill as many particles as needed.

The only particle-based platforms I encountered were Cisco 7200 and VIP line cards in Cisco 7500 (because they used repackaged Cisco 7200s inside). I went through the publicly available Cisco Cloud ASIC presentations and while they’re full of “intelligent buffering” bragging, they don’t mention anything even vaguely resembling scatter/gather architecture.

The two O’Reilly books describing Juniper hardware that I happen to have on my bookshelf are uselessly vague2. Expert Packet Walkthrough on the MX Series 3D has more details; the way I read it, it claims the MX uses parcels (320 bytes) to store the input packets and fixed-size cells (64 bytes) to move the packets between forwarding engines.

Finally, Broadcom hates anyone telling us anything useful, but as both Arista and Cumulus use large default values, it might be a solved problem. In any case, it would be nice to know the problem has been solved in data center switching ASICs (high-end ASICs like Jericho or router platforms are a totally different story), and so far I haven’t seen anything that would look convincing.

To recap: while systems with fixed-size buffers lose the ability to absorb input bursts when you increase the maximum frame size, particle-based systems don’t care. Unfortunately we often have absolutely no idea how the expensive hardware we just bought works… or I’ve overlooked something, in which case I’d appreciate useful pointers in the comments.

To MTU or Not to MTU

Adam Chappell tweeted me a nice warning:

Sometimes there’s benefit in reserving the term MTU for the layer 3 protocols that may have ways of doing their own segmentation/reassembly, and calling that jumbo setting on switches exactly what it is, “maximum frame size”.

I almost became a wiseass replying it’s called Maximum Transmission Unit for a reason but decided to check the ultimate source of all truth which claimed that…

In computer networking, the maximum transmission unit (MTU) is the size of the largest protocol data unit (PDU) that can be communicated in a single network layer transaction.

Ignoring the single network layer transaction elephant cluelessly wandering around the room, the article further explains:

The MTU relates to, but is not identical to the maximum frame size that can be transported on the data link layer, e.g. Ethernet frame.

Here we go. As switches receive Ethernet frames into their buffers, we’re dealing with maximum frame size. Thank you Adam!

Revision History

2022-02-19
Added information about Arista’s defaults and related conclusions
2022-02-17
Changed MTU to maximum frame size

  1. … given the same amount of buffer memory ↩︎

  2. No wonder, one of them is describing a product using Broadcom ASIC. ↩︎

3 comments:

  1. "fixed-size cells (64 bytes) to move the packets between forwarding engineers."

    so, if we allow for an evil oversimplification, ATM actually never died but retreated into the brains?

    Replies
    1. "between forwarding engineers" << damn autocorrect 🤣

      And yes, while ATM died, we still have cell-based transport where latency matters. Did you notice that the cells have sane size this time?

  2. I usually set all my switches to their maximum, but ran into an issues with a Cisco 4500-X which has a maximum MTU of only 9170. OSPF routing with a 3850 wouldn't work with the MTU set to 9198, so I had to move it down to 9170 to match.

    Replies
    1. When I use an IP MTU higher than 1500B, I usually set it to 9000B, since this is widely supported on different networking devices from various vendors, and is the maximum supported by VMware ESXi. This avoids silently losing frames sent from a system with larger MTU to one with lower (Ethernet) MTU, which can easily happen when using the maximum value of a given device and then introducing a different model into the network.

  3. Hi Ivan, according to the MX Walkthrough doco, the parcel is generated by the WI block of the MQ/XM chip to send the first 256 bytes of a packet to the Lookup Unit, if the packet size > 320 bytes, or the entire packet to the LU otherwise (pg 22). Hence the name parcel. The packet itself is stored in the MQ memories. This kind of arrangement seems to be pretty common for VOQ-based (3rd generation) devices. What seems rather inefficient is transporting the parcel back to MQ after the lookup is done. From an engineering POV, since we're talking ns here, duplication and interchip communication should be avoided whenever possible. And since MX is VOQ-based router, the MQ chip is also responsible for cellification of packets to transmit across the crossbar to the egress, in ATM-like cells. The cell headers get stored in VOQs (pg 35). This memory, being on-chip, is much faster accessed than the bulk memory used to store the packets themselves.

    With this kind of buffering, GBs worth of packets can be stored, so whatever the frame size, it's not a problem. Remember, only the headers are in the VOQ, and that's fixed. The problem is the packet memory, being off-chip and having much slower RTT than TCAM, presents a latency problem if lots of packets come in at the same time.

    The MQ's off-chip buffer used to store the packets can follow any buffering paradigm: contiguous or particle-based/scatter-gather. NCS platform has documented their memory architecture, and it's pretty similar. They don't reveal it all because they use Broadcom chipsets, but I doubt Broadcom has much better idea.

    The 7200 and 7500 series were 2nd-generation devices; they still used CPU, not ASIC, to forward packets, so they essentially operate just like computer: CPU for look-up, RAM to store packets, no TCAM. That's why particle buffering is the only form they have, vs the much more sophisticated (and power-hungry) 3rd-gen VOQ devices with different kinds of memory. Again, 7200 and 7500 routers don't have any problem with different frame size.

    In a word, the hardware can deal with any frame size. So the impact of Jumbo frame is mostly potential MTU mismatch and packet loss.

    Replies
    1. As always, thanks for an extensive comment...

      > The MQ's off-chip buffer used to store the packets can follow any buffering paradigm: contiguous or particle-based/scatter-gather.

      Would you happen to have a pointer to something explaining that (or is it in that book but I missed it?)

      > NCS platform has documented their memory architecture, and it's pretty similar.

      All I found was an excellent document describing how the use on-chip/off-chip buffers

      https://xrdocs.io/ncs5500/blogs/2020-09-04-ncs-5500-buffering-architecture/

      It contained nothing about the buffer structure, and described Jericho which is a totally different beast than Trident or Tomahawk.

      It's also interesting that while there were several replies along the lines of "everyone is using jumbo frames" (which is good to hear, so things are working in production), nobody sent me anything along the lines of "hey, that's a solved problem, stop worrying", so I still remain skeptical about the behind-the-scenes implementation details.

    2. Hi Ivan,

      No, there's no mentioning of how the packets get stored on off-chip memory in MX chipset so I speculate. After all, there are only 2 forms of memory storage anyway: contiguous or scatter/gather :p.

      The xrdoc is the same on I referred to. That's what I meant by NCS ASIC's physical memory structure (sorry if I sounded ambiguous), not the off-chip packet storage buffering scheme. I meant the NCS's Jericho's ASIC physical memory structure and the MX's are rather similar.

      On pg 42 of the MX doc, it says "the default maximum data buffer value is 100ms." So looks like if the interfaces are of 10Gbps, MX chipset (Trio I think) stores up to GBs of packets.

      Also, while it's true the hardware buffer can deal with any frame size because of its large size, Jumbo frames have other impacts when one think about it more closely. For ex, Jumbo frame can cause unpredictable delay for smaller packets when there're mix and match of Jumbo and tiny frames in the same queue, potentially delaying certain kinds of traffic more than they can tolerate. So when Jumbo frame is enabled, it's best to use CoS to separate elephant and mice traffic, depending on the levels of criticality.

      Another corner case: with buffered crossbar, where cells arriving at egress out of order is inherent to the architecture, as observed in practice by Andrea:

      https://blog.ipspace.net/2020/05/ip-packet-reordering.html#64

      When this happens with many Jumbo frames in the mix, the total cell waiting time at the egress interface can be exceeded due to cells arriving out of order, causing cell drop and consequently, frame drop, necessitating retransmission. So that's a potential side effect. In bufferless crossbar, this is not an issue.

      Minh

Add comment
Sidebar