Virtual Appliance Performance Is Becoming a Non-Issue

Tuesday, April 23, 2013 06:54 +0200

Virtual Appliance Performance Is Becoming a Non-Issue

Almost exactly two years ago I wrote an article describing the benefits and drawbacks of virtual appliances, where I listed virtualization overhead as one of the major sore spots (still partially true). I also wrote: “Implementing routers, switches or firewalls in a virtual appliance would just burn the CPU cycles that could be better used elsewhere.” It’s time to revisit this claim.

The Easy Ones

A few data points are obvious:

$0.02 CPU used in SoHo routers is good enough for speeds up to ~10 Mbps (see also: OpenWrt), and reasonably-sized x86 platforms are good enough for anything between 100 Mbps and 1 Gbps, depending on the functionality you need and the value of reasonable.
High-speed packet forwarding (e.g. ToR switch @ 1+Tbps) is way cheaper to implement in hardware;
High-end packet forwarding gear (CRS-1, MX-960) will be hardware-based for a very long time;
Hardware encryption is still faster than software encryption, but at least AES is included in instruction set of recent Intel and AMD processors (RSA is still a CPU burner). Is hardware-assisted SSL offload cheaper than throwing more cores at the problem? I don’t know; shop around and do the math.

Vanilla Virtual Appliances

Virtual appliances are clearly good enough for low-volume loads. VMware claims the firewalling performance of vShield Edge Compact (1 vCPU) appliance is ~3 Gbps. Probably true under the ideal conditions (I got similar results testing an older version of vShield Edge with netperf).

HTTP load balancing performance of vShield Edge Large (2 vCPU) appliance is ~2.2 Gbps. F5 claims its BIG-IP LTM VE can do up to 3 Gbps in a 2 vCPU vSphere-hosted VM. Either one should be good enough unless you plan to push most of your data center traffic through a single virtual appliance (hint: don’t ... although I’ve heard F5 VE license isn’t exactly cheap).

Aiming for higher speeds? A10 claims its SoftAX virtual appliance can push up to 8 Gbps of load-balanced traffic. No idea what’s required to get that number, the hardware requirements are in the installation guide, which is hidden behind a regwall. Seems A10 is another one of those companies that never learn.

Getting Beyond 10Gbps

What about even higher speeds? It’s possible to push 50 Gbps through Linux TCP stack and if you do smarter things like custom stack, bypassing the kernel entirely or using Intel’s DPDK (or 6WIND equivalent) you can get the same performance with lower overhead.

However, all the figures quoted in the previous paragraph don’t include the virtualization tax (the performance loss, not this one). To get comparable performance from a VM typically requires some sort of hypervisor bypass, allowing the VM to work directly with the physical NICs, but that approach usually requires dedicated NICs (not really useful) and disables live VM mobility. You can get rid of both problems with Cisco’s VM-FEX and VMware’s vMotion with VMDirectPath, but that’s the only combo I’m aware of that gives you “physical” interfaces (which you need to avoid hypervisor overhead) on a migratable VM.

Good news: the hypervisor landscape seems to be changing rapidly – 6WIND is demonstrating DPDK-accelerated Open vSwitch at Open Networking Summit and they claim they can accelerate both OVS data plane and VXLAN encapsulation, resulting in 50 Mpps performance on a 10-core server. IMIX traffic profile should be pretty relevant when evaluating load balancers and firewalls, and using the IMIX average packet size of 340 bytes 50 Mpps translates into more than 130 Gbps of L2 virtual switching throughput. Good enough I’d say ;)

Finally, Intel just announced their reference architecture (using, among other things, DPDK-accelerated OVS): hardware is available now and DPDK-accelerated OVS in Q3 of this year. Open Networking Platform server is scheduled to enter alpha testing in second half of the year.

Summary: In a year or two, we’ll have plenty of software solutions and/or generic x86 hardware platforms capable of running very high speed virtual appliances. I would strongly recommend considering that in your planning and purchasing process. Obviously some firewall/load balancing vendors will adapt (major load balancing players already did) while others will stick to their beloved hardware and slowly fade in oblivion.

Recent posts in the same categories

security

virtualization

16 comments:

Jon 23 April 2013 09:50

Would like to point out

www.networkworld.com/reviews/2013/022513-cisco-virtual-router-test-266658.html

showing Vyatta pulling off 500Mbps on a single core (even on a Cisco UCS server ;-)

(yes yes, self serving post, but it is still true =)

Replies

Ivan Pepelnjak 23 April 2013 18:40

Thanks for the link. I am not exactly impressed by 500 Mbps and really wondering what Cisco managed to do to burn one vCPU @ 50 Mbps. Are they process switching all the traffic through IOS Linux process?

Jon 24 April 2013 23:00

That is the REALLY weird part

For Vyatta (open source version used in test) they did that 500Mbps on just ONE core

for the Cisco test it took FOUR cores to do just 50Mbps

something is really odd there...

Anonymous 25 April 2013 08:19

Folks are getting 5Gb/s+ @ 1500 byte frame forwarding on VMs with a single core - 500Mb/s isn't very much at all. Not to mention 50Mb/s.

Cristiano Monteiro 23 April 2013 21:29

Hey Ivan,

What about services like IPS and Load balances with rules including L4-L7 parsing. Do you think it´s feasible to implement it in software ?

Good post as usual

Cristiano

Cristiano

Replies

Ivan Pepelnjak 24 April 2013 07:54

As you go higher up in the stack, the algorithms become more complex (compare HTTP-level IDS with packet filter), and thus it makes less and less sense to implement them in hardware.

Most advanced load balancers are implemented primarily in software. For IDS data points, read the erratasec blog posts I linked to.

Anonymous 25 April 2013 08:34

Almost every single IPS box can be easily bypassed w/ variations of exploit payload because of the signature matching in hardware w/ minimal protocol parsing & decoding around the context of the vulnerability. The systems that did most protocol decoding and little hardware offload for signature matching did the best job. Compare ISS Proventia vs Intravert - ISS was mostly software, had best decoding and was much harder to bypass with exploits with custom payload since they decoded up to the vulnerability in many cases, while signature-based hardware accelerated solutions get walked around by hackers all day long w/ polymorphic attacks.

Do you want a system that can easily be updated w/ software and scale out / up as processors get faster, or do you want limited set of features that work really fast in hardware? There pros and cons for both, let's review in a decade from now and see where L4-L7 services get realized. Many arguments can be made one way or the other, but I surely would not bet against software & Intel combo myself...

Unknown 25 April 2013 13:26

Hi, Ivan.
Thanks for the interesting post, as usual ;-)
So, in the near future do you think that all today's standalone physical appliances will become virtual and distributed, having just the portion of state and rules relevant to the local bunch of VMs - say one for hypervisor? With rules and state migration following the VM?

Thanks,
Ariel.

Ivan Pepelnjak 25 April 2013 17:25

Virtual? Yes. Distributed? It depends. You can do distributed firewalling, but not load balancing. Yeah, already in the to-write queue ;)

christoph 29 April 2013 02:36

Citrix NetScaler does distributed load-balancing for people who need that. Also works in a virtual appliance.

Francesco 24 April 2013 11:53

Virtual Appliance Performance is comparable to the equivalent Physical Appliance until the latter use it's own ASICs (for a good reason), e.g. Palo Alto with it's new generation Firewall...

Replies

Anonymous 25 April 2013 08:27

Open up the top vendors' largest appliances that handle L4-L7 services and you will find that unless you are doing crypto offload (Cavium, Safenet), signature matching in hardware (questionable value given the bypass vectors) or micrflow balancing / spreading load across conventional x86 processors, most features are implemented in software on x86. You could also go down the Cavium Octeon or network processor path, but why - given the DPDK/x86 performance capabilities.

Also, do you really want to keep forklifting your Firewall/LB networking gear for the next rev of contract manufactured hardware or does it make sense to align with "Moore's Law Networking" on commodity servers? Servers upgrade cycle is 2-3 years, contract manufactured L4-L7 appliances have typically a lifecycle of 5-7 years. Open your 5 year old top end firewall and there is a good chance your desktop processor is faster...

Will 25 April 2013 13:43

Every time someone says implement your network equipment virtually I think: So what about ternary RAM?

I've seen so many blogs and read so many books that harp on not sending packets to the CPU in a hardware switch/router as it will impact performance and that ternary RAM is needed for large IP tables / various ACLs / etc.

What I really feel when discussing virtual switch/appliances is basic features that the 2600s of ye olde would handle. If you're confortable running your network on a 2600 - virtualize. If not........

Replies

Ivan Pepelnjak 25 April 2013 17:28

Hold on. What I said was "L4-7 in software makes sense, high-speed L2-3 in software is too expensive". Also, keep in mind that a 2600 probably has a $0.02 CPU - 40-50 Gbps packet forwarding through a Xeon-based server (with minimal packet processing) is very doable.

dizkonekdid 01 August 2013 05:52

I think I can agree and disagree on many things. If you are positioning a distributed cloud based enterprise, the vAppliance is "good enough" in most cases. However, cloud providers have caught on to these appliances and are starting to charge for CPU more than disk in some cases.

That said, there are many hardware appliances that far outstrip their virtual brethren simply because of hardware acceleration. In the case of load balancing, SSL stripping and re-encryption CANNOT be done at line-rate without specialized hardware. The same goes for HTTPS inspection of UTM, IPS, and firewall traffic. Packet-forwarding is a quaint topic. Claiming high through-put for routing pretty much means nothing as those functions are increasingly commodity and being stuffed in devices that are capable of much more. If you think stateful firewalls still make your network safe, you need to lift the rock you have been living under.

Sorry to be so harsh, but some of these articles are very myopic and don't really address the issues of a modern network. That is why the state funded phrackers (my word, it is a play on water-fracking for natural gas, because this is analogous to how modern 'hackers' mine data from networks) pwn you.

Replies

Ivan Pepelnjak 02 August 2013 11:29

Be more precise - RSA keying benefits from special hardware; AES-NI is available on all modern x86 CPUs.

You might also want to read the follow-up blog post: http://blog.ipspace.net/2013/05/dedicated-hardware-in-network-services.html - x86 silicon is slower, but also cheaper (per Gbps) than whatever awesomesauce your vendor is selling you.

You might not like my conclusions (most hardware vendors don't) but price lists speak for themselves.

As for perspective problems - I always love constructive feedback, and since you wrote "__some__ of these articles are very myopic" I assume you're a regular reader, and would appreciate a list of articles you disagree with (and why).

Add comment