OpenFlow and Fermi Estimates
Fast advances in networking technologies (and the pixie dust sprinkled on them) blinded us – we lost our gut feeling and rule-of-thumb. Guess what, contrary to what we love to believe, networking isn’t unique. Physicists faced the same challenge for a long time; one of them was so good that they named the whole problem category after him.
Every time someone tries to tell you what your problem is, and how their wonderful new gizmo will solve it, it’s time for another Fermi estimate.
Let’s start with a few examples.
Data center bandwidth. A few weeks ago a clueless individual working for a major networking vendor wrote a blog post (which unfortunately got pulled before I could link to it) explaining how network virtualization differs from server virtualization because we don’t have enough bandwidth in the data center. A quick estimate shows a few ToR switches have all the bandwidth you usually need (you might need more due to traffic bursts and number of server ports you have to provide, but that’s a different story).
VM mobility for disaster avoidance needs. A back-of-the-napkin calculation shows you can’t evacuate more than half a rack per hour over a 10GE link. The response I usually get when I prod networking engineers into doing the calculation: “OMG, that’s just hilarious. Why would anyone want to do that?”
And now for the real question that triggered this blog post: some people still think we can implement stateful OpenFlow-based network services (NAT, FW, LB) in hardware. How realistic is that?
Scenario: web application(s) hosted in a data center with 10GE WAN uplink.
Questions:
- How many new sessions are established per second (how many OpenFlow flows does the controller have to install in the hardware)?
- How many parallel sessions will there be (how many OpenFlow flows does the hardware have to support)?
Facts (these are usually the hardest to find)
- Size of an average web page is ~1MB
- An average web page loads in ~5 seconds
- An average web page uses ~20 domains
- An average browser can open up to 6 sessions per hostname
Using facts #3 and #4 we can estimate the total number of sessions needed for a single web page. It’s anywhere between 20 and 120, let’s be conservative and use 20.
Using fact #1 and the previous result, we can estimate the amount of data transferred over a typical HTTP session: 50KB.
Assuming a typical web page takes 5 seconds to load, a typical web user receives 200 KB/second (1.6 mbps) over 20 sessions or 10KB (80 kbps) per session. Seems low, but do remember that most of the time the browser (or the server) waits due to RTT latency and TCP slow start issues.
Assuming a constant stream of users with these characteristics, we get 125.000 new sessions over a 10GE every 5 seconds or 25.000 new sessions per second per 10Gbps.
Always do a reality check. Is this number realistic? Load balancing vendors support way more connections per second (cps) @ 10 Gbps speeds. F5 BIG-IP 4000s claims 150K cps @ 10 Gbps, and VMware claims its NSX Edge Services Router (improved vShield Edge) will support 30K cps @ 4 Gbps. It seems my guestimate is on the lower end of reality (if you have real-life numbers, please do share them in comments!).
Modern web browsers use persistent HTTP sessions. Browsers want to keep sessions established as long as possible, web servers serving high-volume content commonly drop them after ~15 seconds to reduce the server load (Apache is notoriously bad at handling very high number of concurrent sessions). 25.000 cps x 15 seconds = 375.000 flow records.
Trident-2-based switches can (supposedly, see also comments) handle 100K+ L4 OpenFlow entries (at least BigSwitch claimed so when we met @ NFD6). That’s definitely on the low end of the required number of sessions at 10 Gbps; do keep in mind that the total throughput of a typical Trident-2 switch is above 1 Tbps or three orders of magnitude higher. Enterasys switches support 64M concurrent flows @ 1Tbps, which seems to be enough.
The flow setup rate on Trident-2-based switches is supposedly still in low thousands, or an order of magnitude too low to support a single 10 Gbps link (the switches based on this chipset usually have 64 10GE interfaces).
Now is the time for someone to invoke the ultimate Moore’s Law spell and claim that the hardware will support whatever number of flow entries in not-so-distant future. Good luck with that; I’ll settle for an Intel Xeon server that can be pushed to 25 mpps. OpenFlow has its uses, but large-scale stateful services is obviously not one of them.
More information
If you wonder where I got the HTTP and TCP numbers I used in the guestimate, it’s high time you watch my TCP, HTTP & SPDY webinar.
Does this apply to a switch acting as a (L3) VTEP for a SDN like Nuage Networks? (next Broadcom chipset, or the ones that HP uses in their 7900 series)
From my understanding, in this case, only "static" rules would have been pushed from the controllers, and not with each new flow (1:1 NAT, routes to VTEP for each VM, ...)
HP seems to push their 7900 as VTEP gateway in datacenters. With their 65k flows (from an example in the docs), it wouldn't be much of a use if each connection would require a new flow.
L3 VTEP on Nuage box uses internal packet recirculation (see my VXLAN webinar and VXLAN routing blog posts for details), so the L3 lookup uses traditional L3 tables (LPM table and ARP table).