Build the Next-Generation Data Center
6 week online course starting in spring 2017

Flow-based Forwarding Doesn’t Work Well in Virtual Switches

I hope it’s obvious to everyone by now that flow-based forwarding doesn’t work well in existing hardware. Switches designed for large number of flow-like forwarding entries (NEC ProgrammableFlow switches, Enterasys data center switches and a few others) might be an exception, but even they can’t cope with the tremendous flow update rate required by reactive flow setup ideas.

One would expect virtual switches to fare better. Unfortunately that doesn’t seem to be the case.

A few definitions first

Flow-based forwarding is sometimes defined as forwarding of individual transport-layer sessions (sometimes also called microflows). Numerous failed technologies are a pretty good proof that this approach doesn’t scale.

Other people define flow-based forwarding as anything that is not destination-address-only forwarding. I don’t really understand how this definition differs from MPLS Forwarding Equivalence Class (FEC) and why we need a new confusing term.

Microflow forwarding in Open vSwitch

Initial versions of Open vSwitch were a prime example of ideal microflow-based forwarding architecture: in-kernel forwarding module performed microflow forwarding and punted all unknown packets to the user-mode daemon.

The user-mode daemon would then perform packet lookup (using OpenFlow forwarding entries or any other forwarding algorithm) and install a microflow entry for the newly discovered flow in the kernel module.

Third parties (example: Midokura Midonet) use Open vSwitch kernel module in combination with their own user-mode agent to implement non-OpenFlow forwarding architectures.

If you’re old enough to remember the Catalyst 5000, you’re probably getting unpleasant flashbacks of Netflow switching … but the problems we experienced with that solution must have been caused by poor hardware and underperforming CPU, right? Well, it turns out virtual switches aren't much better.

Digging deep into the bowels of Open vSwitch reveals an interesting behavior: flow eviction. Once the kernel module hits the maximum number of microflows, it starts throwing out old flows. Makes perfect sense – after all, that’s how every caching system works – until you realize the default limit is 2500 microflows, which is barely good enough for a single web server and definitely orders of magnitude too low for a hypervisor hosting 50 or 100 virtual machines.

Why, oh why?

The very small microflow cache size doesn’t make any obvious sense. After all, web servers easily handle 10.000 sessions and some Linux-based load balancers handle an order of magnitude more sessions per server. While you can increase the default OVS flow cache size, one’s bound to wonder what the reason for the dismally low default value is.

I wasn’t able to figure out what the underlying root cause is, but I’m suspecting it has to do with per-flow accounting – flow counters have to be transferred from the kernel module to the user-mode daemon periodically. Copying hundreds of thousands of flow counters over a user-to-kernel socket at short intervals might result in “somewhat” noticeable CPU utilization.

Did I get it all wrong? Please correct me in the comments ;)

How can you fix it?

Isn’t it obvious? You drop the whole notion of microflow-based forwarding and do things the traditional way. OVS moved in this direction with release 1.11 which implemented megaflows (coarser OpenFlow-like forwarding entries) in kernel module, and moved flow eviction from kernel to user-mode OpenFlow agent (which makes perfect sense as kernel forwarding entries almost exactly match user-mode OpenFlow entries).

Not surprisingly, no other virtual switch uses microflow-based forwarding. VMware vSwitch, Cisco’s Nexus 1000V and IBM’s 5000V make forwarding decisions based on destination MAC addresses, Hyper-V and Contrail based on destination IP addresses, and even VMware NSX for vSphere uses distributed vSwitch and in-kernel layer-3 forwarding module.

More information

Check out SDN resources page @ ipSpace.net.

6 comments:

  1. Does this mean we shouldn't use Open vSwitch in any real production deployments? Are there better free/open source alternatives?

    ReplyDelete
    Replies
    1. Absolutely not. Like any other product, OVS has its benefits and drawbacks that you have to understand and deal with. Also, make sure you use a recent version with megaflow support.

      Beyond that, I hope you do pilots and performance testing before using any untested product in real production deployment, and deploying OVS should be no different.

      Delete
  2. Unknown packet punting is a problem... we should discard instead of punting... and have another protocol for discovering destination addresses, that doesn't rely on punting :)

    ReplyDelete
    Replies
    1. Totally agree with you … but it won't happen as long as people are trying to emulate thick coax cable with ever-increasing number of abstraction/indirection layers.

      Delete
  3. Great article, I agree with the overall gist with a couple of minor comments (I'm a major OVS contributor full disclosure).

    That 2500 "flow-eviction-threshold" isn't actually a hard limit. It's the number of datapath flows required before the flow eviction process starts at all. The actual number of datapath flows can grow significantly beyond this depending on your traffic patterns. That said, in older versions of OVS, about 10k is about has high as it can scale. The larger point you're making is correct.

    You mention that the flow eviction handling moved from the kernel to the userspace ovs-vswitchd daemon. This isn't actually true, the flow eviction handling has always been handled by the userspace daemon.

    I'd highly recommend checking out OVS 2.1, we've made some major changes in how all this works. That 2500 limit is long gone, with a more realistic limit being in the 100k range (depending on traffic patterns). All in all, the OVS dev community recognizes that this is a major issue and is therefore spending a great deal of time on it. 2.1 is just the beginning, stay tuned =)

    Ethan

    ReplyDelete
    Replies
    1. Thank you for setting me straight! Much appreciated.

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.