Using BGP in Data Center Fabrics
While the large data centers increasingly use BGP as the routing protocol within their fabrics, the enterprise engineers tend to shy away from that idea because they think BGP is too complex/scary/hard-to-configure/obsolete/unknown/whatever.
It’s time to fix that.
A few reality checks first:
- Regardless of what pundits claim on podcasts, BGP is alive and well. After all, the Internet is still working;
- BGP tends to get very complex when you have to start playing with the routing policies. No need to do that in a data center environment – all you need is propagation of routing information;
- While BGP sessions have to be configured, that’s not necessarily a bad idea. After all, you want to know what’s going on in your fabric, right?
- Traditional BGP implementations were very slow because it’s a bad idea to get widespread oscillations across world-wide network. However, you can get convergence times comparable to OSPF or IS-IS with a bit of tuning and BFD magic;
- BGP configuration might be longer than a similar OSPF configuration, but that’s easily solved with network automation.
Furthermore, vendors like Cumulus Networks did a great job simplifying BGP configurations to the bare minimum necessary to make it work in a data center fabric.
Dinesh Dutt and myself will talk about that later today in a free webinar organized by Cumulus Networks, and in an update session of my Leaf-and-Spine Fabrics webinar on March 3rd.
More information
Read the IETF draft authored by Petr Lapukhov if you want to know more about using BGP in large data center fabrics. I also mentioned his design in BGP-Based SDN Solutions webinar.
This even might be helpful when connecting the DC block to the Core and having data transfer with WAN. Besides, nowadays more Enterprises are becoming SPs and using BGP as the Core routing protocol.
The level of granularity and traffic engineering is incredible.
The only drawback I see is the staff the knowledge in BGP area; though they can be educated in that too, hence at the end of the day they want to be Network engineers!
And of course we know that it's not for everywhere and business needs should be taken into account.
Looking forward to the webinar today.
Brocade will be rolling out an IP fabric leveraging BGP (EVPN) in March.
Personally, I'd like to see more literature on BGP timer tuning. I think that's often misunderstood, by me at least.
As far as tooling is concerned, BGP/OSPF/ISIS have been developed organically based on their role in the network, so can we can always achieve what one is not intended to be.
It is starting to penetrate in to servers as well. What are your thoughts on having BGP running from the servers itself ?
personally I'm a big fan of simple (KISS) and using BGP as the only routing protocol makes sense to me (even in deployments not fitting the "large DC" description in Petr's draft).
Especially if we consider that products on the systems-side (think chassis' running NSX or Azure-stack) support BGP as a means of interconnecting to the fabric (whether injecting host-routes from the chassis to allow for VM-mobility is worth considering is probably another discussion but at least BGP has proven itself to be able to scale to support this).
regards,
Alan Wijntje
Due to the limitation of BGP not being able to use unnumbered interfaces we must use tricks. It is an interesting trick using the IPv6 Link local address for peering proposed by Cumulus.
My question is: having Nexus switches as spine and leaf, N9K, why we couldn’t use a single SVI on each switch where all the SVIs are in the same subnet and the fabric ports being used as access switchports for the SVI vlan. BGP peerings will be done through the SVI IP addresses and only 1 IPv4 address per device would be used.
Why this would not be a suitable solution as a Nexus switch can act as a layer 3 and layer 2 interface on the same port?
Thanks.
You _might_ use multihop BFD between loopbacks (again, not sure whether anyone implemented it), but there's a simpler option used by Cumulus implementation: use BFD between IPv6 LLA.