Can Virtual Routers Compete with Physical Hardware?

One of the participants of the Carrier Ethernet LinkedIn group asked a great question:

When we install a virtual-router of any vendor over an ordinary sever (having general-purpose microprocessor), can it really compete with a physical-router having ASICs, Network Processors…?

Short answer: No … and here’s my longer answer (cross-posted to my blog because not all of my readers participate in that group).

While the software-only forwarding process can reach 200 Gbps or more on a multi-core Xeon server, you cannot get anywhere close to the pps-per-$ price point of equivalent hardware solution.

Before someone starts making list price comparisons, do keep in mind that when you buy a switch or a router from a mainstream manufacturer, you're not paying for the hardware, but (mostly) for software and support, as well as sales and marketing expenses. Hardware is usually less than 30% of the total costs (just look at gross margin from any major networking hardware vendor).

On the other hand, lower-speed routers use CPU-based forwarding anyway - replacing them with VM-based form factor (virtual router) is a no-brainer.

Finally, while it might make sense (from speed-of-deployment perspective) to use virtual routers, many NFV deployments I see today deploy virtual firewalls, protocol translation/termination, load balancers or DPI devices. The appliance version of these devices usually uses CPU-based forwarding anyway (potentially augmented by an internal switch to ensure traffic is distributed deterministically to multiple cores) - yet again making them a perfect fit for VM-based deployment.

The only good reason I found so far for hardware-assisted appliance functionality is RSA key exchange in SSL termination. This process is really slow when done in software, and can be done much faster on dedicated coprocessors.

For more details on NFV forwarding performance, register for my NFV webinar.

10 comments:

  1. Even more if you consider pps-per-tco$

    A recent (even five year old) asic-based vendor router is almost certainly using less power for the same traffic. This might not be as significant or matter that much if you don't need terabits of performance though.

    I do think it's a shame that there's not a vendor (at least that I'm aware of) making a box which is a nice Xeon server with a fully-plumbed EZchip (one of the higher end ones capable of doing internet scale routing) with a decent SW plumbing layer on top. The software to do useful carrier internet routing on linux is finally getting complete enough you could consider deploying it.

    The pluribus ones come closest of those I'm aware of, but have the downside of a second switch chip, and needing to run their OS.
    Replies
    1. Xeon+EZChip would be an awesome combo, particularly with modular OS on top of it (Cumulus comes to mind). We just need someone who wants to buy thousands of these boxes ;))

      As for Pluribus - if I got it right (and I have no idea, because they never got to the technical details in their Tech Field Day presentations), all they have in their proprietary hardware is extra 10GE lanes to the Xeon CPU, making it possible to do more than what you can squeeze on the PCI bus between Trident-2 and CPU. Obviously that advantage goes away the moment you deploy their SW on whitebox HW.
    2. Cumulus do seem best placed on the software side.

      For the hardware the best I've seen is what I have in my personal lab, a 2 slot ATCA chassis with a Xeon box in one slot, and an EZchip based switch in the other. Of course this is all ancient kit, but current gen stuff does exist and is probably feasible.

      Shame there doesn't seem to be any vendors really looking to roll this up, would be a fun thing to build.
  2. Hosting routers on VM poses the same risks as doing any other things in a shared resource environment. I've seen one VM influencing performance and causing network interruptions to other VMs on the same hypervisor. Also latency is generally higher in case of VM routers.
    Replies
    1. Agreed on both counts. However:

      * If you want to have reliable NFV deployment, you _SHOULD_ deploy VNFs (fancy names for VMs) on dedicated infrastructure and carefully manage the oversubscription;

      * While software-based forwarding always incurs more latency than hardware-based forwarding, I don't think it matters then moment the traffic hits the first WAN link.
    2. The latency impact will be mostly felt when Virtual Router is handling local traffic that didn't originate or terminate into VMs. One has to be careful to consider which flows will traverse the VR when replacing physical with virtual. Virtualized environments are generally not great with handling of tasks that require real-time scheduling.

      And for the sake of example, Juniper VSRX (Firefly) adds latency of 5-10 ms on very light loads. E.g. if it serves some lonely VoIP call late at night, quality will suffer.
    3. 5-10 msec latency just for traversing a VM is ridiculous, and has (IMHO) nothing to do with virtualization and all to do with suboptimal implementations.

      Thanks for the data point - it will definitely come handy ;)
      Ivan
  3. With regards to SSL termination... "openssl speed rsa2048" does 829 private-key ops (equivalent to RSA sig) per core per second on my 3-year-old laptop.

    So, let's round and guess 1500 *new* TLS connections per core per second on a modern server. Modern Xeon servers have at least 16 cores...

    Any application that is connecting and disconnecting that frequently is broken by design. The user experience would be terrible even without TLS overhead. Use keep-alive connections for HTTP, along with session resumption, and it really isn't a problem.

    We run a mid-size SaaS application doing SSL termination on just four Xeon cores spread across two load balancing nginx instances. They hover around 15%, and that includes handshakes and bulk crypto.

    CloudFlare, Google, Faceboox, etc. do NOT use hardware for SSL acceleration, because it just doesn't matter if you have HTTP keep-alive and session cache/ticketing enabled. I believe Google said turning on SSL increased their front-end server load by something like 2-3% overall.
  4. Two key things are AES-NI support in processor and modern TLS library that has good support for AES-NI. Without either of those, you're in slow train.
  5. Interesting I wonder how this will change with more use of HTTPV2 or SPDY.
Add comment
Sidebar