Is Linux TCP/IP Stack Really That Slow?

Most people casually involved with virtual appliances and network function virtualization (NFV) believe that replacing Linux TCP/IP stack with user-mode packet forwarding (example: Intel’s DPDK) boosts performance from meager 1 Gbps to tens of gigabits (and thus makes hardware forwarding obsolete).

Having data points is always better than having opinions; today let’s look at Receiving 1 Mpps with Linux TCP/IP Stack blog post.

2015-07-18: The blog post was updated based on feedback by Kristian Larsson.

Long story short: it’s always been possible to get good packet forwarding performance on Linux, and the solutions have been well known for years.

Before we start

You might miss the bigger picture by focusing solely on packet forwarding performance.

In many cases, 1Gbps of forwarding performance is more than good enough. In others, you cannot use hardware forwarding anyway because the problem cannot be solved in dedicated hardware at reasonable cost (example: large-scale TCP optimization).

Finally, sometimes the amount of processing done on a single packet limits the throughput (example: deep packet inspection), and there’s not much you can do apart from throwing more cores at the problem (Palo Alto has a firewall with 100 Gbps throughput … using 400 cores).

And now let’s see how badly Linux TCP/IP stack did

The author of the blog post I mentioned above used several tricks to achieve the target performance:

  • Sending and receiving multiple messages at the same time instead of a single message per system call (which got him to 350 kpps)
  • Using multi-queue NICs to spread the load across multiple CPU cores, which increased the throughput to 440 kbpps;
  • Multi-threaded application, which finally got him to the 1 Mpps (or 1.4 Mpps on finely-tuned memory architecture).

Not surprisingly, these are the same tricks that tools like ntop and PF_RING used for years to get decent networking performance on Linux.

Where’s the problem?

With all this being said, why don’t we see better forwarding performance in virtual appliances doing simple packet processing?

In most cases, the answer is surprisingly simple: because the vendors ported their existing code to VM format, and replaced direct access to dedicated hardware with calls to Linux kernel (thus making every possible mistake they could). Vendors that spent time optimizing the code (Vyatta, Juniper) got the performance you’d expect (Juniper managed to push 160 Gbps through their vMX).

So what was I trying to say?

Kristian Larsson left a nice comment saying "OK, so what exactly were you trying to say?" Let me try to organize my thoughts at least a bit:

(A) Contrary to what some Software Defined Evangelists think, there’s no magic bullet or universal culprit.

(B) The tricks that people reinvent all the time have been well-known if not exactly well documented for years. See also Scaling in the Linux Networking Stack documentation from kernel.org.

(C) There’s always the weakest link and if you don’t know what it is, you’ll have performance problems no matter what.

(D) Once you work around the weakest link, there’s another one waiting for you.

(E) If you don’t know what you’re doing, you’ll get the results you deserve.

(F) And finally, sometimes good enough is good enough.

And all I need now is a hacker misunderstanding my post and telling me how stupid I am ;)

Interested in virtual forwarding performance?

You’ll find tons of useful information in Software Gone Wild podcast:

I’ll also cover these topics within the Network Function Virtualization webinar (coming in early autumn, just before the SDN workshops).

12 comments:

  1. Sorry to sound like a politician but "here here"
  2. Hey Ivan,
    You should be very interested by this young French company 6Wind which has just been "validated" by Cisco: http://www.prnewswire.com/news-releases/cisco-investments-invests-in-6wind-for-high-performance-networking-300108853.html.
    They design multiple high performance networking softwares, such as a "Virtual Accelerator".

    It leverages DPDK to offer 20 Gbits/s of OVS bandwidth per Intel’s Xeon E5-2600 v2 core, scaling linearly with the number of cores. Some features include VXLAN & VRF. It is supported on all major Linux distributions with a multi-vendor NIC support from Avago/Emulex, Intel and Mellanox: http://www.6wind.com/products/6wind-virtual-accelerator/
    Replies
    1. You're talking about this young French company, right?

      http://blog.ipspace.net/2012/02/6wind-solving-virtual-appliance.html
  3. Great ;) I thought they deserved to appear somehow in your interesting article, although they've not been - yet? - invited in your "Software Gone Wild podcast".
  4. 1.4Mpps mentioned in that article is far from what a Linux stack can actually deliver.

    Here is an article (Google Translate required) - where a guy describes how to make an app that can capture **9** Mpps.

    http://habrahabr.ru/post/261161/
  5. Overall I find this incoherent. You start off about an opinion of many, that switching from kernel to user space forwarding is necessary for good performance while your tone implies this isn't true. The link you post explains how to achieve good performance and you summarize this into three "tricks". All involve using more than one core so we can conclude that the kernel stack scales well at least.

    You move on to an analysis of VM performance. Most don't achieve good performance as they make the mistake of relying on kernel calls and not "optimising it". You mention that Juniper vMX is optimized. Since vMX is using DPDK and SR-IOV to achieve good performance I'm not sure where we stand. Is the kernel good enough or not?

    What is the point of the article? If you don't want to reach a conclusion on one being better than the other I think you need to better explain the pros and cons of each respective solution. For example, DPDK won't help you much if you are trying to optimise your web server but if you are trying to build a virtual router or load balancer it probably is the tool for the job.

    Also, I find the topic of "good performance" to be an interesting one and really quite subjective. You mention 1Gbps being good enough for most but I don't that mean we can talk about it as "good performance" in absolute terms. Is the kernel performing well when you need 10 cores to achieve wire speed 10GE if you can do the same with Snabb switch on a single core?
  6. if you go and name vendors, ALU managed to push 320G on their vSR....
  7. I'm confused.... you are either forwarding your packets using the kernel's IP forwarding feature, or you are using something else.

    My understanding is that Vyatta does the same as other distributions and just sets IPtables rules for any firewall restrictions, and relies on the kernel to do forwarding,

    And Juniper vMX doesn't do that.

    So you're saying you think Vyatta is optimized, and it's OK to just use the kernel's IP forwarding features when implementing a firewall?

    Do you have any pointers then, to how I would go about optimizing it, if I wished to engineer a graphical firewall frontend based on a Linux kernel doing the packet forwarding, since whatever Vyatta does is deemed good enough?

    I don't think Vyatta uses a modified Linux kernel, so If the kernel is doing all the forwarding; I am wondering literally what sort of things you think their optimization consists of.

    Replies
    1. Vyatta 5600 uses DPDK

      https://www.brocade.com/en/products-services/software-networking/network-functions-virtualization/5600-vrouter.html
  8. Regarding letter (D) weakest links....Linux OS was designed to operate end stations / servers, not traffic forwarding nodes like routers / switches. Do you think this will be exposed soon as the "weakest link" in networking industry?
    Replies
    1. True - Unix (and consequently Linux) were designed for servers (or control plane processes). To get around this weakest link, you have to bypass Linux as much as possible - by downloading forwarding information into hardware (the Cumulus way) or doing packet processing in user-mode processes running on dedicated cores (the DPDK/PF_RING way).
  9. The thing is, if I start with that same single-threaded 300kpps of 128B UDP on a Mellanox CX3 card, I get to 3.6Mpps just doing LD_PRELOAD=libvma.so.

    Mellanox engineers I've spoken with claim this will scale to 37Mpps with multi-threaded senders.
Add comment
Sidebar