FIB update challenges in OpenFlow networks

Last week I described the problems high-end service provider routers (or layer-3 switches if you prefer that terminology) face when they have to update large number of entries in the forwarding tables (FIBs). Will these problems go away when we introduce OpenFlow into our networks? Absolutely not, OpenFlow is just another mechanism to download forwarding entries (this time from an external controller) not a laws-of-physics-changing miracle.

NEC, the only company I’m aware of that has production-grade OpenFlow deployments and is willing to talk about them admitted as much in their Networking Tech Field Day 2 presentation (watch the ProgrammableFlow Architecture and Use Cases video around 12:00). Their particular controller/switch combo can set up 600-1000 flows per switch per second (which is still way better than what researchers using HP switches found and documented in the DevoFlow paper – they found the switches can set up ~275 flows per second).

Now imagine a core of a simple L2 network built from tens of switches and connecting hundreds of servers and thousands of VMs. Using traditional L2 forwarding techniques, each switch would have to know the MAC address of each VM ... and the core switches would have to update thousands of entries after a link failure, resulting in multi-second convergence time. Obviously OpenFlow-based networks need prefix-independent convergence (PIC) as badly as anyone else.

OpenFlow 1.0 could use flow matching priorities to implement primary/backup forwarding entries and OpenFlow 1.1 provides a fast failover mechanism in its group tables that could be used for prefix-independent convergence (the general public is still left wondering what OpenFlow 1.2 does – 45 days after the “open-mostly-in-name” spec was ratified it’s still not publicly available) ... but it's questionable how far you can get with existing hardware devices, and PIC doesn't work in all topologies anyway.

It’s possible to build OpenFlow 1.1-compliant switches with existing hardware, but you probably wouldn’t be willing to pay for them. As long as everyone is using merchant silicon from Broadcom, we’ll be stuck where we are until Broadcom decides it’s time to expand their feature list.

Just in case you’re wondering how existing L2 networks work at all – their data plane performs dynamic MAC learning and populates the forwarding table in hardware; the communication between the control and the data plane is limited to the bare minimum (which is another reason why implementing OpenFlow agents on existing switches is like attaching a jetpack to a camel).

Is there another option? Sure – it’s called forwarding state abstraction, or for those more familiar with MPLS terminology Forwarding Equivalence Class (FEC). While you might have thousands of servers or VMs in your network, you have only hundreds of possible paths between switches. The trick every single OpenFlow controller vendor has to use is to replace endpoint-based forwarding entries in the core switches with path-indicating forwarding entries. Welcome back to virtual circuits and BGP-free MPLS core. It’s amazing how the old tricks keep resurfacing in new disguises every few years.

Need more information?

You might want to read a few other OpenFlow posts I wrote or register for the OpenFlow webinar sponsored by BigSwitch that Greg Ferro and I will run in a few days.

The fun part – Where am I going?

Tim did a great job guessing what the next article after the Prefix Independent Convergence will be. Now that you can see more of the path, can you guess where it will end or is it still too foggy?

6 comments:

  1. Dmitri Kalintsev31 January, 2012 09:05

    > Where am I going?

    "Why do we need OF if we can do most of that magic with MPLS"? ;) I think at least certain Kireeti is with you on this one, judging by the contents of a presentation slide pack I saw a couple days back. :)

    But then again, MPLS can't help if you want to direct traffic received at the edge into FECs with arbitrary granularity, which I'm perceiving is one of the major attractive parts of the OF value proposition.

    Anyway - tradeoffs, tradeoffs everywhere, nor any drop to drink. :)

    ReplyDelete
  2. Dmitri Kalintsev31 January, 2012 09:15

    Pic mildly related :)

    ReplyDelete
  3. No surprise we're in sync with a certain Kireeti (and Juniper's party line). Martin Casado came to the same conclusions in one of his blog posts (although he did not mention MPLS).

    BTW, I guess I already wrote about the whole hybrid concept: http://blog.ioshints.info/2011/11/openflow-deployment-models.html

    ReplyDelete
  4. I'm still stuck on flow recognition at high speed in all the OF switches. How can that be reasonably cost-effective until it is in merchant silicon? And how close is that to happening? That's where I can relate to MPLS: label recognition at high speed seems a lot easier to build. And building LSP's for flows seems easier to troubleshoot.

    The info about flow update speed is very interesting. I've just been reading about TCAM update speeds being a limiting factor to convergence in some situations. I advised on a very bizarre system design a while back where someone was trying to use central SNMP to control a switch fabric. I pointed out that at the time you'd be lucky to get maybe 10's of SNMP sets per second; the spec called for 1000+. Decentralized control and some other tricks (pre-configured static IGMP joins, trading bandwidth for speed) came a whole lot closer to meeting their needs!

    Anyway, for OF, central programming means updating N flow items x M switches in the fabric, that doesn't sound like it'd scale well!

    I'm watching with interest to see how OF overcomes these challenges. Am I exhibiting aging in the form of resistance to new ideas, or are there indeed challenges there?

    ReplyDelete
  5. It would be nice to see what exactly the switches can recognize in hardware (and what the fallback mechanisms are - software switching or failure to inject flow), but the assumption has always been that the flow recognition will be done by the silicon and at least the very basic operations (matching on destination MAC, VLAN tag or source/destination IP) can be done in hardware.

    I'm not that concerned about the MxN problem (although with 600 flows/sec and 50 switches, that's 30K updates/sec ... hmmm ...), I'm even more concerned about all the other details (including fast feedback loops that the controller can't possibly cope with). Anyhow, speaking with the OpenFlow realists (the people who are developing production-focused solutions), I got the feeling that they're aware of the limitations and challenges and are working hard to address them and have the opportunity to do things in a different way.

    As for the "aging" part, we're in the same boat, and I like to think we've become somewhat immune to the reality distortion fields around us, not resistant to change :-P

    ReplyDelete
  6. This is where the software switching overlay starts to make sense. Switch/link failures in the physical network do not change the topology of the virtual network. The physical network doesn't need to carry the flow state of the VMs, nor maintain any tables about where VMs are located.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.