Duty Calls: CPU Is Not Designed for Packet Forwarding

Junhui Liu added this comment to my Where Do We Need Smart NICs? blog post:

CPU is not designed for the purpose of packet forwarding. One example is packet order retaining. It is impossible for a multicore CPU to retain the packet order as is received after parallel processing by multiple cores. Another example is scheduling. Yes CPU can do scheduling, but at a very high tax of CPU cycles.

Duty calls.

Nobody can argue with the “CPU is not designed for packet forwarding” argument. It can, however, be good enough in many cases, and the optimal solution for complex packet forwarding like TCP session termination (including retransmissions), defragmentation, or out-of-order packet processing. All these functions can be hard-coded in an ASIC or NPU, but once the packet forwarding functionality requires a complex algorithm, CPUs tend to be cheaper than the alternatives due to economies of scale.

Also, do keep in mind that most low-speed packet forwarding (up to a few gigabits these days) is done in the CPU. Using a reasonable CPU is cheaper than the alternatives.

How about “it’s impossible to retain packet order in multi-core packet processing”. Doing a bit of research (it took me about 10 minutes, but then I knew where to look) before making broad claims usually helps.

Let’s define the problem first. Retaining strict source-to-destination packet ordering across a generic IP network is usually a Mission Impossible, and if your application requires that, you might be using the wrong transport technology. What we’re usually looking for is in-session packet order: packets of a single TCP or UDP session are not reordered while traversing a network.

Now for a tiny dose of reality. I downloaded the Intel Ethernet Controller I350 Datasheet (because I couldn’t be bothered going through the 1700 pages of XL710 data sheet), browsed through it to find Receive Side Scaling (the functionality that assigns incoming packets to multiple queues which can then be assigned to multiple cores) and found this in the section 7.1.2 of the data sheet:

  • Multiple hashing functions are used on incoming packets based on the packet type (TCP, UDP, other IPv4, other IPv6, others). As expected, those hashing functions use the usual 5-tuple for TCP and UDP, and source- and destination IP addresses for other IPv4 and IPv6. Even more, individual hashing functions can be enabled or disabled. For example, you could disable UDP or TCP hash functions if you want to retain strict source-to-destination packet ordering.
  • Low-order 7 bits of the 32-bit hash result are used to select an index in the RSS queue indirection table.
  • RSS queue indirection table maps hash results into eight queues (RSS Output Index) that can then be assigned to multiple CPU cores.

Lesson learned: Intel I350 NIC preserves in-session packet order by storing all packets received from a single session in a single receive queue. If the packet forwarding code in the CPU does not cause packet reordering (and why should it do so if retaining in-session packet order is your goal), you will NOT get in-session packet reordering in a multi-core packet forwarding environment.

What about “very high tax for scheduling”? Even my sense of duty is time-limited (and I have a lunch to make), but if you want to know how people with long-time experience in fast CPU-based packet forwarding solve this problem, listen to these Software Gone Wild podcasts:

TL&DL: CPU cores dedicated to poll-based packet processing

4 comments:

  1. Ivan is right. Even Cisco High-end Firewall/NGFW use only CPU complex for packet filtering and forwarding.Cisco FP9300 SM-56 module contain too many CPU Core (not sure but i think it is 48 Core per socket) and each SM can handle 80Gbps of stateful traffic. but with the help of some special Intel NIC and controller .the problem is if you want DIY and make an opens-source NFV solution , it is hard and you need to have a good level of expertise in that field , something that regular network/security engineer lagging behind.the secret sauce is what big company like Cisco is hiding from the others to their benefit.some software based packet forwarding use VPP in commercial products but there is no real and ready to use open-source NFV based on those technology.Cisco use VPP in most of devices that use CPU for packet forwading (like ISR4K and new boost License that double the performance of router).Another concern is the maximum speed for software-based packet forwarding is around Multi-Gbps and not Tbps.i really really like the idea of Ivan "you only need to switches for most DC".just loading your servers with maximum CPU socket and RAM and Disks and you can have 1500 VM in just a single rack.but how i can effectively put those L4-7 NFV when each TOR switch can handle Tbps of traffic but my NFV is struggling light traffic.Again Ivan had a good topic about "Stateless ACL" years ago and i am totally agree with him.Cisco introduced Nexus 9364-GX2 with 64x400G (2U with 25Tbps) a month ago.how current NFV solution that struggling with only some Gbps traffic are going to cope with the kind of traffic ? If company like VMware let some technology like VEPA work with their VS , i will put in place all Traffic filtering and forwarding back in to the hardware as new hardware is improved . Nexus 9364-GX2 TCAM is much larger than older N9K . it support 72K ACE in hardware and it is large enough for Micro-Segmentation and regular packet filtering.

  2. Checkpoint is heavily CPU based also, but you allocate CPU cores to different purposes. Each core gets its own copy of the firewall kernel to process traffic.

  3. First, I agree with all the facts listed in this article. But my intended use cases are actually not covered. With RSS, packet order is retained per flow (or maybe better, per CPU core). But there're elephant flows that just can't be handled by a single core. This is where a hardware packet order retaining mechanism is needed, which is common in NPUs. ASICs, especially BRCM ASICs, work in a pipeline mode, so packet orders are always retained. For scheduling, I don't think a CPU is capable of multi-queue, multi-discipline scheduling across multiple cores. (Sometimes multiple hierachies are also needed for scheduling, but l'd like to stick to datacenter use cases). Just think about the simple usecase of 8 queues with mixed strict priority, WRR scheduling and per queue bandwidth cap. The point is that some jobs just can't be done easily in a distributed mode.

  4. @Junhui: Agree on megaflows being an exception that probably requires a NPU or smart NIC for packet forwarding (session termination is a different story and left as an exercise for the reader)... but then I never claimed that we can solve all networking problems in this world with a properly-sized x86 server.

    As for queuing on "regular" NIC, check out the Intel XL710 datasheet... and if that's not good enough, you could always dedicate a core to output queue processing.

Add comment
Sidebar