Response: Are Open-Source Controllers Ready for Carrier-Grade Services?
My beloved source of meaningless marketing messages led me to a blog post with a catchy headline: are open-source SDN controllers ready for carrier-grade services?
It turned out the whole thing was a simple marketing gig for Ixia testers, but supposedly “the response of the attendees of an SDN event was overwhelming”, which worries me… or makes me happy, because it’s easy to see plenty of fix-and-redesign work in the future.
Anyway, let’s walk through the presentation.
What was the testbed? Ixia software emulated numerous OpenFlow switches connecting to a single instance of an open source OpenFlow controller. The switches were connected in a linear topology (N 2-port switches in sequence), which is the least likely topology you’ll ever see in a network.
What were they measuring? Pretty useless stuff that’s easy to measure:
- How many OpenFlow switches can connect to a single controller instance?
- How long does it take the controller to install a single flow across all switches?
- How long does it take a controller to discover network topology?
Also, it’s impossible (from the presentation published on Ixia web site) to figure out what exactly they were measuring, and whether it's relevant. For example, they assume the controller discovered the network topology when the LLDP packets generated by the controller where delivered back to the controller.
Why are those metrics useless? Let’s go through them one-by-one:
- How many OpenFlow switches can connect to a controller? A single OpenFlow domain is a single failure domain, and unless you plan to use overlay virtual networking (= mimic wireless controllers) you don’t want your failure domain to be too large. Also, a decent carrier-grade controller would have a scale-out architecture (no, not a cluster of two controllers, but a real scale-out architecture with eventual consistency), which would make this metric moot.
- How long does it take the controller to install a single flow? This one might expose internal workings of a controller (is the controller programming flows in switch-by-switch sequence or in parallel), but measuring anything beyond a few dozens of switches (= number of hops across the network) is plain ridiculous. Not surprisingly, the “interesting” behavior emerges in the totally-ridiculous territory (500+ switches in sequence), so let’s put that on the slide and claim victory.
- How long does it take to discover network topology? Measuring this on a chain of 100 switches in linear topology is absolutely meaningless. What would make sense are questions like “how quickly is a topology change that is not signaled via an interface down message detected?” or “how quickly are N thousand flows rerouted after a topology change?” We still don’t know.
Finally, while it seems (at least from the presentations like this one) that the main focus of SDN is reinventing bridges (because dynamic MAC learning really needs to get reinvented), everyone conveniently ignores the scalability challenges of running linecard protocols across hundreds of switches from a central controller. BFD anyone?
What has this to do with readiness for carrier-grade services? Absolutely nothing. The setup is irrelevant (no carrier would use a single-instance controller), the switches used (2-port switches) and the linear topology are meaningless, and the metrics they measured don’t reflect real-time scenarios.
The only link to carrier-grade services I could find is the need for a catchy headline.
Ready for a dose of reality?
- Start with the free Introduction to SDN webinar if you need the answer to the “What is SDN?” question.
- Read the SDN and OpenFlow (the Harsh Reality) digital book, because it’s easier to read a book than recursively read over 350 blog posts;
- Watch the OpenFlow Deep Dive webinar to discover true OpenFlow scalability limitations.
I have seen these slides before and I agree that it is useless marketing bla-bla.
This is just an attempt to redirect service providers back to ODL from ONOS. It will not happen. Even academic networks are starting using ONOS on a global scale.
The important differences between ODL and ONOS are in development philosophy.
Performance does not matter that much, since small scale-out clusters will be used that should be organized into a hierarchy. So how an individual controller instance performs is totally unimportant.
Huawei is also a too big organization, so the right hand does not know what the left hand does. The departments are not talking to each other, redeveloping the same things multiple times, they do not ask help from each other, even they fight against each other. I have seen a lot of this...
Ixia is just a testing tool and if you're an experienced tester then you can simulate and test practically anything. As a former system test engineer I used Ixia to simulate T1 scale of routers and test every possible protocol. I agree that a 500x500 grid of ISIS routers doesn't look like any real network, but if you want to see if your brand new router is capable of handling such database of nodes, then it's a very good test it.
IMO, in this case the physical topology is meaningless because the Device Under Test (DUT) is the controller. With Ixia you can send million topology change messages while having a single simulated switch with one port. It's just a CPU generating packets, nothing more. The idea is to simulate mass event, send control packets to DUT, verify correct response and measure the response time. This can be a pretty good indication whether the controller is capable of handling carrier-grade tasks or not.
With different types of events, messages and scale one can test most of controller's functionalities in a pretty good way and compare between different types of controllers.
If you have better suggestion how to do benchmark tests to SDN controller, please share.
I agree with everything you wrote. However, my points were that:
A) some of the tests were ridiculous and have nothing to do with real life (for example, there's no need to ever establish a flow across a sequence of 1000 switches controlled by a single controller);
B) claiming anything about readiness for carrier-grade services based on the tests described in the presentation makes no sense.