I hope it’s obvious to everyone by now that flow-based forwarding doesn’t work well in existing hardware. Switches designed for large number of flow-like forwarding entries (NEC ProgrammableFlow switches, Enterasys data center switches and a few others) might be an exception, but even they can’t cope with the tremendous flow update rate required by reactive flow setup ideas.
A few definitions first
Flow-based forwarding is sometimes defined as forwarding of individual transport-layer sessions (sometimes also called microflows). Numerous failed technologies are a pretty good proof that this approach doesn’t scale.
Other people define flow-based forwarding as anything that is not destination-address-only forwarding. I don’t really understand how this definition differs from MPLS Forwarding Equivalence Class (FEC) and why we need a new confusing term.
Microflow forwarding in Open vSwitch
Initial versions of Open vSwitch were a prime example of ideal microflow-based forwarding architecture: in-kernel forwarding module performed microflow forwarding and punted all unknown packets to the user-mode daemon.
The user-mode daemon would then perform packet lookup (using OpenFlow forwarding entries or any other forwarding algorithm) and install a microflow entry for the newly discovered flow in the kernel module.
Third parties (example: Midokura Midonet) use Open vSwitch kernel module in combination with their own user-mode agent to implement non-OpenFlow forwarding architectures.
If you’re old enough to remember the Catalyst 5000, you’re probably getting unpleasant flashbacks of Netflow switching … but the problems we experienced with that solution must have been caused by poor hardware and underperforming CPU, right? Well, it turns out virtual switches aren't much better.
Digging deep into the bowels of Open vSwitch reveals an interesting behavior: flow eviction. Once the kernel module hits the maximum number of microflows, it starts throwing out old flows. Makes perfect sense – after all, that’s how every caching system works – until you realize the default limit is 2500 microflows, which is barely good enough for a single web server and definitely orders of magnitude too low for a hypervisor hosting 50 or 100 virtual machines.
Why, oh why?
The very small microflow cache size doesn’t make any obvious sense. After all, web servers easily handle 10.000 sessions and some Linux-based load balancers handle an order of magnitude more sessions per server. While you can increase the default OVS flow cache size, one’s bound to wonder what the reason for the dismally low default value is.
I wasn’t able to figure out what the underlying root cause is, but I’m suspecting it has to do with per-flow accounting – flow counters have to be transferred from the kernel module to the user-mode daemon periodically. Copying hundreds of thousands of flow counters over a user-to-kernel socket at short intervals might result in “somewhat” noticeable CPU utilization.
Did I get it all wrong? Please correct me in the comments ;)
How can you fix it?
Isn’t it obvious? You drop the whole notion of microflow-based forwarding and do things the traditional way. OVS moved in this direction with release 1.11 which implemented megaflows (coarser OpenFlow-like forwarding entries) in kernel module, and moved flow eviction from kernel to user-mode OpenFlow agent (which makes perfect sense as kernel forwarding entries almost exactly match user-mode OpenFlow entries).
Not surprisingly, no other virtual switch uses microflow-based forwarding. VMware vSwitch, Cisco’s Nexus 1000V and IBM’s 5000V make forwarding decisions based on destination MAC addresses, Hyper-V and Contrail based on destination IP addresses, and even VMware NSX for vSphere uses distributed vSwitch and in-kernel layer-3 forwarding module.
Check out SDN resources page @ ipSpace.net.