Update: Performance of Hash Table Lookups

In the Myths That Refuse to Die: Scalability of Overlay Virtual Networking blog post I wrote “number of MAC addresses has absolutely no impact on the forwarding performance until the MAC hash table overflows”, which happens to be almost true.

As always, the devil is in the details: assuming you dedicate a CPU core (or more of them) to the forwarding process like Intel DPDK or Snabb Switch are doing, a small MAC table might fit into the CPU cache, while a hash lookup done on large MAC table inevitably results in accesses to main memory, which are way slower than reads from CPU cache.

I don’t have any vSwitch results that would corroborate this line of thinking, but engineers doing properly designed tests of hash arrays in scripting languages came to the same conclusion – see Real Measurement Is Hard for details (and enjoy the beauty of the Perl code, which proves that Perl doesn’t have to look like line noise).

If you understood the line noise reference above, you’re way too old. Welcome to the “grumpy old engineers” club ;)

On a slightly tangential note, OVS developers try really hard to implement fast hash table lookups in recent OVS releases – if you like reading Donald Knuth books, you’ll thoroughly enjoy Optimistic Concurrent Cuckoo Hash (HT: highscalability.com – an absolute must-read if you’re interested in any aspect of scalability).

5 comments:

  1. CPU L1 cache has a lot of similarities to TCAM, so the performance scalability concerns make a lot of sense. Dedicating a core or 2 to forwarding gives you a dedicated cache.

    Keeping the table sizes small is important - really not too dissimilar to the same concerns you'd have in a ToR switch. Some tricks like conversational-based L2 & L3 push to cache might help here.
  2. Ivan.

    Reading both the threads, You suggest that

    1) Scale is not a problem as there is always unlimited memory on hypervisor.

    2) There is no performance difference between tunnels (NVO) & non-tunnels (VLANs or no VLANs) but Performance depends on CPU cache size. If size-of-table > CPU-cache then there is performance hit.

    3) Another angle to the discussion is cost. Which of the following is less expensive ?
    - Burning extra CPUs for forwarding.
    - NIC offloads
    - Is dedicated hardware better ?
    Replies
    1. #1 - Almost. Memory is always limited unless you have an ideal Turing machine ;) ... but I don't expect anyone to build a virtual network that would be large enough to exceed the memory size of modern servers ;))

      #2 - Coming to that one (tomorrow). The actual performance hit depends on whether NIC hardware supports TCP offload over tunnels or not, but there won't be much difference in performance (if any - due to all other things that need to be done) based on the lookup table sizes.

      #3 - As always, the answer is "it depends". In most environments, the hypervisors have some spare CPU capacity, but of course it's always better to have dedicated hardware (particularly if it's bundled with the default server hardware configuration).
  3. I believe the Linux/BSD kernels have the concept of a "forwarding cache", which is a small hash table based on 5 or 7-tuple that fits in a few cache lines. This holds forwarding information for recently seen flows.

    Wouldn't this mean that forwarding latency on software based hosts does in fact scale with the number of *flows*, decreasing rapidly when there are enough flows to require hash table accesses to L2/L3 cache?

    OVS, DPDK, and other user-space forwarding techniques could be different than kernel-based forwarding I suppose.
  4. Love that joke with line noise!
Add comment
Sidebar