Using BGP in Data Center Fabrics

While the large data centers increasingly use BGP as the routing protocol within their fabrics, the enterprise engineers tend to shy away from that idea because they think BGP is too complex/scary/hard-to-configure/obsolete/unknown/whatever.

It’s time to fix that.

A few reality checks first:

  • Regardless of what pundits claim on podcasts, BGP is alive and well. After all, the Internet is still working;
  • BGP tends to get very complex when you have to start playing with the routing policies. No need to do that in a data center environment – all you need is propagation of routing information;
  • While BGP sessions have to be configured, that’s not necessarily a bad idea. After all, you want to know what’s going on in your fabric, right?
  • Traditional BGP implementations were very slow because it’s a bad idea to get widespread oscillations across world-wide network. However, you can get convergence times comparable to OSPF or IS-IS with a bit of tuning and BFD magic;
  • BGP configuration might be longer than a similar OSPF configuration, but that’s easily solved with network automation.

Furthermore, vendors like Cumulus Networks did a great job simplifying BGP configurations to the bare minimum necessary to make it work in a data center fabric.

Dinesh Dutt and myself will talk about that later today in a free webinar organized by Cumulus Networks, and in an update session of my Leaf-and-Spine Fabrics webinar on March 3rd.

More information

Read the IETF draft authored by Petr Lapukhov if you want to know more about using BGP in large data center fabrics. I also mentioned his design in BGP-Based SDN Solutions webinar.


  1. I totally agree with you Ivan. Also I believe that with a reliable design plus what you called automation magic, there won't be any complexity.
    This even might be helpful when connecting the DC block to the Core and having data transfer with WAN. Besides, nowadays more Enterprises are becoming SPs and using BGP as the Core routing protocol.
    The level of granularity and traffic engineering is incredible.
    The only drawback I see is the staff the knowledge in BGP area; though they can be educated in that too, hence at the end of the day they want to be Network engineers!

    And of course we know that it's not for everywhere and business needs should be taken into account.

    Looking forward to the webinar today.
  2. One more example: Cisco relies on BGP as routing protocol for its "Dynamic Fabric Automation" (DFA).
  3. This shouldn't come as a surprise to anyone. Petr Lapukhov had a NANOG presentation about BGP as the DC IGP a few years back - they implemented it at MS Bing.

    Brocade will be rolling out an IP fabric leveraging BGP (EVPN) in March.

    Personally, I'd like to see more literature on BGP timer tuning. I think that's often misunderstood, by me at least.
  4. I think the question is if we need to use link-state or distance vector in DC's today. Link state probably made sense when fabric is asymmetric interms of different ecmps, link-speeds etc. With symmetric fabric where all things equal with leaf-spine arch, does it make sense for a node to know every bit of fabric info or just reach-ability is suffice ?.

    As far as tooling is concerned, BGP/OSPF/ISIS have been developed organically based on their role in the network, so can we can always achieve what one is not intended to be.
  5. BGP was considered a WAN technology and slowly got in to DC's for sometime now.

    It is starting to penetrate in to servers as well. What are your thoughts on having BGP running from the servers itself ?
  6. Good article Ivan,

    personally I'm a big fan of simple (KISS) and using BGP as the only routing protocol makes sense to me (even in deployments not fitting the "large DC" description in Petr's draft).

    Especially if we consider that products on the systems-side (think chassis' running NSX or Azure-stack) support BGP as a means of interconnecting to the fabric (whether injecting host-routes from the chassis to allow for VM-mobility is worth considering is probably another discussion but at least BGP has proven itself to be able to scale to support this).


    Alan Wijntje
  7. Hi Ivan,

    Due to the limitation of BGP not being able to use unnumbered interfaces we must use tricks. It is an interesting trick using the IPv6 Link local address for peering proposed by Cumulus.

    My question is: having Nexus switches as spine and leaf, N9K, why we couldn’t use a single SVI on each switch where all the SVIs are in the same subnet and the fabric ports being used as access switchports for the SVI vlan. BGP peerings will be done through the SVI IP addresses and only 1 IPv4 address per device would be used.
    Why this would not be a suitable solution as a Nexus switch can act as a layer 3 and layer 2 interface on the same port?

    1. Of course you can do that - it would work on (almost?) any DC switch. However, you'd have to configure different BGP neighbors on different servers.
    2. Hi, is it true that you can't use unnumbered and BFD simultaneously?
    3. While I remember seeing something on BFD-over-MAC, I can't find a related Internet Draft, and I haven't seen anyone implementing something along these lines.

      You _might_ use multihop BFD between loopbacks (again, not sure whether anyone implemented it), but there's a simpler option used by Cumulus implementation: use BFD between IPv6 LLA.
Add comment