Build the Next-Generation Data Center
6 week online course starting in spring 2017

Exception Routing with BGP: SDN Done Right

One of the holy grails of data center SDN evangelists is controller-driven traffic engineering (throwing more leaf-and-spine bandwidth at the problem might be cheaper, but definitely not sexier). Obviously they don’t call it traffic engineering as they don’t want to scare their audience with MPLS TE nightmares, but the idea is the same.

Interestingly, you don’t need new technologies to get as close to that holy grail as you wish; Petr Lapukhov got there with a 20 year old technology – BGP.

The Problem

I’ll use a well-known suboptimal network to illustrate the problem: a ring of four nodes (it could be anything, from a monkey-designed fabric, to a stack of switches) with heavy traffic between nodes A and D.

In a shortest-path forwarding environment you cannot spread the traffic between A and D across all links (although you might get close with a large bag of tricks).

Can we do any better with a controller-based forwarding? We definitely should. Let’s see how we can tweak BGP to serve our SDN purposes.

Infrastructure: Using BGP as IGP

If you want to use BGP as the information delivery vehicle for your SDN needs, you MUST ensure it’s the highest priority routing protocol in your network. The easiest design you can use is a BGP-only network using BGP as a more scalable (albeit a bit slower) IGP. EBGP is better than IBGP as it doesn't need an underlying IGP to get reachability to BGP next hops.

BGP-Based SDN Controller

After building a BGP-only data center, you can start to insert controller-generated routes into it: establish an IBGP session from the controller (cluster) to every BGP router and use higher local preference to override the EBGP-learned routes. You might also want to set no-export community on those routes to ensure they aren’t leaked across multiple routers.

Obviously I’m handwaving over lots of moving parts – you need topology discovery, reliable next hops, and a few other things. If you really want to know all those details, listen to the Packet Pushers podcast where we deep dive around them (hint: you could also engage me to help you build it).

Results: Unequal-Cost Multipath

The SDN controller in our network could decide to split the traffic between A and D across multiple paths. All it has to do to make it work is to send the following IBGP routing updates for prefix D:

  • Two identical BGP paths (with next hops B and D) to A (to ensure the BGP route selection process in A uses BGP multipathing);
  • A BGP path with next hop C to B (B might otherwise send some of the traffic for D to A, resulting in a forwarding loop between B and A).

You can get even fancier results if you run MPLS in your network (hint: read the IETF draft on remote LFA to get a few crazy ideas).

More information


  1. Thanks for the great write-up, Ivan! BGP SDN looks awkward at first... but works like a champ ;)

  2. Nice write-up!
    I have some ideas regarding using BGP for "exception routing":

    - There is no great need to run eBGP as IGP. A real IGP (ospf...) will be more suitable. As far as (i/e)BGP is configured with a higher preference in all routers, SDN injected routes will always take precedence.
    Advantages are:
    - Faster IGP
    - No need to use BGP communities to limit the scope of the injected BGP routes (No BGP peering is needed between the routers)
    - More "standard" deployment as using a real IGP as an IGP :) :)
    - Easy to introduce to a network already in production...


    2. And here's the same article without the paywall

    3. There is still a major difference - in RCP you had to parse topological information from OSPF LSDB, while in our approach is explicitly visible via BGP peering structure. It is arguable which approach is simpler, but I believe that with the small amount of code that we have implementing the controller and compared with OSPF's inherently complex design things our approach gets closer to the goal.

  3. Great and easy to follow write up and it shows that there are many ways to instruct switches and routers how to forward packets, even using protocols that have been around for ever (and I love BGP).

    However, the key to SDN is (or should be) *what* do you tell your routers and switches, not so much *how* you tell them. The *what* is what implements the use case, the actual deployment scenario. Unfortunately the *how* has received most of the hype and attention (hello OpenFlow).

    Marten @ Plexxi

  4. It seams to me that all of this backwards. The purpose of SDN to minimalize the hardware. In compute and storage, software definitions are utilized to standardize the hardware and remove its administration. Rack and stack, baseline, and then orchestrate in software. SDN is trying to do the same with datacenter networks. Provide a physical bus and allow software to utilize the bus as needed.

    So, when I deploy a new service, I want to select 'so much' compute, 'so much' storage, and 'so much' networking. Submit this request and have the operating systems and applications come online, the network segments with routers, firewall policy, and logging become available; and destroy it all when finished.

    Compute and storage have already enabled the consumer to hold the keys to service delivery, networking needs to catch up.

    I also need to ask, in your diagrams above, why would you ever want traffic to traverse B and C unless there is a failure?

    1. - What is presented here does not conflict with what you say. You described what SDN should deliver and the post described how to deliver it (one solution of so many possible solutions addressing traffic forwarding and engineering...)

      To answer your question: Why wouldnt you want to traverse B and C if doing so can double your global DC throughput or if A<-->B<-->C<-->E provides a low delay path required by and dedicated to some applications or the opposite or, or...?

      It s all about the use case!

    2. If your primary concern is to make the network more "plug and play" and you are willing to accept IGP shortest paths, then you do not need exception routing at all. You can just let OSPF or IS-IS do its job. If you want to deploy policy then you will need some mechanism to populate the forwarding tables on the routers (or router software running on a server, if that is your model) to implement the policy. In this case BGP is used as the protocol by software running on both the controller and routers.

      One of the challenges for SDN, in my opinion, is that network resources are much more complex than compute or storage resources in that raw bandwidth is not the only (or even in many cases the most important) quantity to optimize. As others have said, it is all about the use case.

  5. How can do the same but routing by source?
    I can't find anything except SDN

    1. SDN will not magically solve that problem. OpenFlow might ;)

      Of course you should start by asking yourself "Do I REALLY need routing by source?"

    2. I'm thinking in a security feature avoiding SPAN ports and FW bottlnecks. You could have only one (IDS|NGFW|...) and redirect all suspicious traffic to it.
      Imagine you have the IDS in C node of the above scenario. You can have a list of "bad" IPs and redirect all the traffic to this IDS with the BGP Controller.
      In the other hand, imagine you think that some IP of your internal network is infected, if you can route by source with the BGP Controller, you could redirect all traffic to the IDS.

    3. In theory you can do all that with BGP FlowSpec. In practice, you need Juniper gear to do it (because they are the only vendor that implemented FlowSpec).

    4. The ALU 7750 also supports BGP FlowSpec (although I have only seen it used in conjunction with Arbor).

  6. Petr mentioned in the podcast that the controller sets up IBGP session with switches via in-band. How does the controller learn about the switches' IP addresses, and vice versa? They cannot be learning each others' IP addresses via BGP otherwise the session wont come up.

    1. Actually, I'll take that back. However, I would still like to know how the controller can form IBGP session with switches in different ASN - dont think the "local-as" feature is applicable to IBGP sessions.

    2. Two ways

      1) You know of the IP -> ASN mapping via boostrap file (static config)
      2) Switches initiate iBGP sessions to Anycast IP's of the controller(s), and controller responds with BGP OPEN reflecting the incoming ASN number

    3. Ahh...that makes sense. You can only do something like this with a code that does not have a unique ASN, unlike Cisco and other vendors' code where the router belongs to a specific ASN.

  7. We deployed something similar at Boeing about 10 years ago.

    Of course, people who are scared of MPLS-TE also will not like LFA, and for that matter are typically afraid of doing complex routing tricks with BGP.

    1. Hi John, thanks for sharing! :) I missed that one when doing historical research on similar projects, RCP was the closest. I would say that your idea is mainly using BGP for route injection, while we attempted to build and overlay link-state protocol with an API to control its multi-topology databases.

  8. One of the comments made during the presentation was about BGP FlowSpec. But reading about it a bit tells me that you can do the same thing with BGP FLowSpec because BGP FlowSpec goes through a validation process (to avoid spoofing) that essentially means that only the originator of the route can change the Flow...unless I am reading it wrong.

    1. I meant, you CANNOT do the same thing with BGP FlowSpec.

    2. Flowspec is getting the support for proper forward action. There is also wide communities. However, one of our intended goals was to keep the routing model destination based, to maintain design simplicity. Flow-based forwarding is just too low-level and hence gives too much freedom of implementation and hence complexities :)

  9. Stumbled upon this article today a little late.

    Years ago I played around with a simple datacenter topology where each rack was a different confederation Sub-AS. I know confederations are totally out of style these days but seems like it would work as well in this instance.

    What you describe with regards to next-hop control, etc. is exactly how Internap's MIRO routing control works at their edge, and has since probably the late 90s. Slightly different scenario since they are using it to control outbound traffic paths on upstream transit providers. Their MIRO controller is internally written and has lots of knobs in order to optimize the routing and place constraints on paths, would be cool if they open sourced it.

    Juniper QFabric architecture is also based on BGP...

    While I'm agnostic to it, I know like we've seen in the comments most feel source routing has to be a part of any SDN solution. In reality on my network today I source route high priority traffic by using different RSVP LSPs on the network.

    I think in a DC context using Segment Routing along with a scalable distribution protocol like BGP may work extremely well. In the future a protocol like I2RS may be a better fit than BGP as a control protocol.

  10. Thanks for the write up Ivan. If controller sends the next-hop C to B to use for D , We assume controller is taking source or somehow service information while it is computing the path ? Otherwise if B would use C as next hop always to reach to D , then we would lose the ability of multipath for B.


You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.