Build the Next-Generation Data Center
6 week online course starting in spring 2017

OpenFlow and Fermi Estimates

Fast advances in networking technologies (and the pixie dust sprinkled on them) blinded us – we lost our gut feeling and rule-of-thumb. Guess what, contrary to what we love to believe, networking isn’t unique. Physicists faced the same challenge for a long time; one of them was so good that they named the whole problem category after him.

Every time someone tries to tell you what your problem is, and how their wonderful new gizmo will solve it, it’s time for another Fermi estimate.

Let’s start with a few examples.

Data center bandwidth. A few weeks ago a clueless individual working for a major networking vendor wrote a blog post (which unfortunately got pulled before I could link to it) explaining how network virtualization differs from server virtualization because we don’t have enough bandwidth in the data center. A quick estimate shows a few ToR switches have all the bandwidth you usually need (you might need more due to traffic bursts and number of server ports you have to provide, but that’s a different story).

VM mobility for disaster avoidance needs. A back-of-the-napkin calculation shows you can’t evacuate more than half a rack per hour over a 10GE link. The response I usually get when I prod networking engineers into doing the calculation: “OMG, that’s just hilarious. Why would anyone want to do that?

And now for the real question that triggered this blog post: some people still think we can implement stateful OpenFlow-based network services (NAT, FW, LB) in hardware. How realistic is that?

Scenario: web application(s) hosted in a data center with 10GE WAN uplink.

Questions:

  • How many new sessions are established per second (how many OpenFlow flows does the controller have to install in the hardware)?
  • How many parallel sessions will there be (how many OpenFlow flows does the hardware have to support)?

Facts (these are usually the hardest to find)

Using facts #3 and #4 we can estimate the total number of sessions needed for a single web page. It’s anywhere between 20 and 120, let’s be conservative and use 20.

Using fact #1 and the previous result, we can estimate the amount of data transferred over a typical HTTP session: 50KB.

Assuming a typical web page takes 5 seconds to load, a typical web user receives 200 KB/second (1.6 mbps) over 20 sessions or 10KB (80 kbps) per session. Seems low, but do remember that most of the time the browser (or the server) waits due to RTT latency and TCP slow start issues.

Assuming a constant stream of users with these characteristics, we get 125.000 new sessions over a 10GE every 5 seconds or 25.000 new sessions per second per 10Gbps.

Always do a reality check. Is this number realistic? Load balancing vendors support way more connections per second (cps) @ 10 Gbps speeds. F5 BIG-IP 4000s claims 150K cps @ 10 Gbps, and VMware claims its NSX Edge Services Router (improved vShield Edge) will support 30K cps @ 4 Gbps. It seems my guestimate is on the lower end of reality (if you have real-life numbers, please do share them in comments!).

Modern web browsers use persistent HTTP sessions. Browsers want to keep sessions established as long as possible, web servers serving high-volume content commonly drop them after ~15 seconds to reduce the server load (Apache is notoriously bad at handling very high number of concurrent sessions). 25.000 cps x 15 seconds = 375.000 flow records.

Trident-2-based switches can (supposedly, see also comments) handle 100K+ L4 OpenFlow entries (at least BigSwitch claimed so when we met @ NFD6). That’s definitely on the low end of the required number of sessions at 10 Gbps; do keep in mind that the total throughput of a typical Trident-2 switch is above 1 Tbps or three orders of magnitude higher. Enterasys switches support 64M concurrent flows @ 1Tbps, which seems to be enough.

The flow setup rate on Trident-2-based switches is supposedly still in low thousands, or an order of magnitude too low to support a single 10 Gbps link (the switches based on this chipset usually have 64 10GE interfaces).

Now is the time for someone to invoke the ultimate Moore’s Law spell and claim that the hardware will support whatever number of flow entries in not-so-distant future. Good luck with that; I’ll settle for an Intel Xeon server that can be pushed to 25 mpps. OpenFlow has its uses, but large-scale stateful services is obviously not one of them.

More information

If you wonder where I got the HTTP and TCP numbers I used in the guestimate, it’s high time you watch my TCP, HTTP & SPDY webinar (also available on Udemy).

5 comments:

  1. Google published some metrics about average page sizes and number of elements/hosts etc. here: https://developers.google.com/speed/articles/web-metrics

    ReplyDelete
  2. "Trident-2-based switches can handle 100K+ L4 OpenFlow entries". How can they do it? maybe it's LPM? Otherwise Trident2 don't have such large flow tables.

    ReplyDelete
    Replies
    1. As I wrote, that was what BigSwitch engineers told me @ NFD6. Whenever I tried to get something more specific from them, they started hiding behind the NDA curtain. Maybe it was all a fairy tale ...

      Delete
  3. "Trident-2-based switches can (supposedly, see also comments) handle 100K+ L4 OpenFlow entries (at least BigSwitch claimed so when we met @ NFD6)."

    Does this apply to a switch acting as a (L3) VTEP for a SDN like Nuage Networks? (next Broadcom chipset, or the ones that HP uses in their 7900 series)

    From my understanding, in this case, only "static" rules would have been pushed from the controllers, and not with each new flow (1:1 NAT, routes to VTEP for each VM, ...)

    HP seems to push their 7900 as VTEP gateway in datacenters. With their 65k flows (from an example in the docs), it wouldn't be much of a use if each connection would require a new flow.

    ReplyDelete
    Replies
    1. Obviously it's impossible to answer your question without knowing the details of Trident-2 chipset (and you can't get them unless you sell your soul and sign their NDA in blood ;), but I would hope they can use MAC address table for L2 VXLAN entries.

      L3 VTEP on Nuage box uses internal packet recirculation (see my VXLAN webinar and VXLAN routing blog posts for details), so the L3 lookup uses traditional L3 tables (LPM table and ARP table).

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.