NEC+IBM: Enterprise OpenFlow you can actually touch

I didn’t expect we’d see multi-vendor OpenFlow deployment any time soon. NEC and IBM decided to change that and Tervela, a company specialized in building messaging-based data fabrics, decided to verify their interoperability claims. Janice Roberts who works with NEC Corporation of America helped me get in touch with them and I was pleasantly surprised by their optimistic view of OpenFlow deployment in typical enterprise networks.

A bit of a background

Tervela’s data fabric solutions typically run on top of traditional networking infrastructure, and an underperforming network (particularly long outages triggered by suboptimal STP implementations) can severely impact the behavior of the services running on their platform.

They were looking for a solution that would perform way better than what their customers are typically using today (large layer-2 networks), while at the same time being easy to design, provision and operate. It seems that they found a viable alternative to existing networks in a combination of NEC’s ProgrammableFlow Controller and IBM’s BNT 8264 switches.

Easy to deploy?

As long as your network is not too big (NEC claimed their controller can manage up to 50 switches in their Networking Tech Field Day presentation), the design and deployment isn’t too hard according to Tervela’s engineers:

  • They decided to use out-of-band management network and connected the management port of BNT8264 to the management network (they could also use any other switch port).
  • All you have to configure on the individual switch is the management VLAN, a management IP address and the IP address of the OpenFlow controllers.
  • The ProgrammableFlow controller automatically discovers the network topology using LLDP packets sent from the controller through individual switch interfaces.
  • After those basic steps, you can start configuring virtual networks in the OpenFlow controller (see the demo NEC made during the Networking Tech Field Day).

Obviously, you’d want to follow some basic design rules, for example:

  • Make the management network fully redundant (read the QFabric documentation to see how that’s done properly);
  • Connect the switches into a structure somewhat resembling a Clos fabric, not in a ring or a random mess of cables.

Test results – Latency

Tervela’s engineers ran a number of tests, focusing primarily on latency and failure recovery.

They found out that (as expected) the first packet exchanged between a pair of VMs experiences a 8-9 millisecond latency because it’s forwarded through the OpenFlow controller, with subsequent packets having latency they were not able to measure (their tool has a 1 msec resolution).

Lesson#1 – If the initial packet latency matters, use proactive programming mode (if available) to pre-populate the forwarding tables in the switches;

Lesson#2 – Don’t do a full 12-tuple lookups unless absolutely necessary. You’d want to experience the latency only when the inter-VM communication starts, not for every TCP/UDP flow (not to mention that capturing every flow in a data center environment is a sure recipe for disaster).

Test results – Failure recovery

Very fast failure recovery was another pleasant surprise. They tested just the basic scenario (parallel primary/backup links) and found that in most cases the traffic switches over to the second link in less than a millisecond, indicating that NEC/IBM engineers did a really good job and pre-populated the forwarding tables with backup entries.

If it takes 8-9 milliseconds for the controller to program a single flow into the switches (see latency above), it’s totally impossible that the same controller would do a massive reprogramming for the forwarding tables in less than a millisecond. The failure response must have been preprogrammed in the forwarding tables.

There were a few outliers (10-15 seconds), probably caused by lack of failure detection on the physical layer. As I wrote before, detecting link failures via control packets sent by OpenFlow controller doesn’t scale – you need distributed linecard protocols (LACP, BFD) if you want to have a scalable solution.

Finally, assuming their test bed allowed the ProgrammableFlow controller to prepopulate the backup entries, it would be interesting to observe the behavior of a four-node square network, where it’s impossible to find a loop-free alternate path unless you use virtual circuits like MPLS Fast Reroute does.

Test results – Bandwidth allocation and traffic engineering

One of the interesting things OpenFlow should enable is the bandwidth-aware flow routing. Tervela’s engineers were somewhat disappointed to discover the software/hardware combination they were testing doesn’t meet those expectations yet.

They were able to reserve a link for high-priority traffic and observe automatic load balancing across alternate paths (which would be impossible in a STP-based layer-2 network), but they were not able to configure statistics-based routing (route important flows across underutilized links).

Next steps?

Tervela’s engineers said the test results made them confident in the OpenFlow solution from NEC and IBM. They plan to run more extensive tests and if those test results work out, they’ll start recommending OpenFlow-based solutions as a Proof-of-Concept-level alternative to their customers.

A huge thank you!

This blog post would never happen without Janice Roberts who organized the exchange of ideas, and Michael Matatia, Jake Ciarlante and Brian Gladstein from Tervela who were willing to spend time with me sharing their experience.

4 comments:

  1. Anonymous coward22 February, 2012 12:24

    Totally irelevant tech. 8-9 MILIseconds on first frame, and using a tool that only has granularity down to the ms to measure performance. You know better than this Ivan, we measure these things n microseconds in modern networks.
    Thanks for letting me know about solutions I can implement in 1985.

    ReplyDelete
  2. Wow... they actually used an existing protocol in the setup: LLDP. That makes me happy.

    ReplyDelete
  3. Hi Coward, "we" measure in microseconds. Who is "we?" While I definitely agree microseconds and even nanoseconds are used by vendors to compare speeds and feeds mainly for those in a low latency vertical like FSI or HPC, there are MANY customers and MANY applications that wouldn't care about 8-9ms.

    ReplyDelete
  4. I'm not sure why it would matter if the delay occurs only on the first frame between a pair of VMs.

    Also, it's nice to hear you were able to implement such a solution in 1985. You don't know how lucky you've been ... we had to deal with 64 kbps modems, routers driven by Motorola 68000 CPU, AppleTalk and the like.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.