BGP as a Better IGP? When and Where?

A while ago I helped a large enterprise redesign their data center fabric. They did a wonderful job optimizing their infrastructure, so all they really needed were two switches in each location.

Some vendors couldn’t fathom that. One of them proposed to build a “future-proof” (and twice as expensive) leaf-and-spine fabric with two leaves and two spines. On top of that they proposed to use EBGP as the only routing protocol because draft-lapukhov-bgp-routing-large-dc – a clear case of missing the customer needs.

Since then I started wondering whether we’re overselling the benefits of BGP. It’s clearly the right choice if you have a huge fabric, or if you want to run IPv4, IPv6 and EVPN at the same time… but if all you need is IP transport across a half-dozen switches to run VMware NSX on top of that BGP is clearly an overkill.

I tried to summarize my thoughts on whether it makes sense to use BGP in a data center fabric in a short overview document, starting with how would you decide whether it makes sense to use BGP as the only routing protocol in your data center fabric.

Keep in mind that this is just an overview document. You’ll find more details in the leaf-and-spine fabric architectures webinar, and master data center design in building next-generation data center online course.

Latest blog posts in BGP in Data Center Fabrics series

10 comments:

  1. Hi Ivan

    Few quick thoughts

    1. In your design , you have suggested Two tor of the rack [ TOR ] switches per rack which is opposed to single ToR which many H.W vendors and service vendors use.

    If it is two tor switches, we need , L2 Multipathing which is ECMP or NIC Teaming on the servers to the upstream links, vPC in Cisco , and we need Peer link between both the tors if we are decided to use vPC .

    Two TORs connected to switches is always little quick tricky when it comes to load sharing across the two switches from same server and thats the reason vendors use single tor of rack switch with downstream servers are reside in single VLAN model maintained by spanning tree port-fast edge ports and SVI for downstream server default routes.

    Two ToRs connected to the servers traffic forwarding path would be based on hashing across the tuples on the LACP or vPC .


    2. Second thought would be why are we extending Layer 2 from ToR to Aggregation even if we have around <200 servers.
    Every design we do is with respect to number of servers , redundancy paths & fault domain maintenance across the fabric design ,
    Doesnt matter whether we have <200 and > 200 , we should prefer using Layer 3 from ToR to aggregation to Core and Layer 2 stands downstream on VLAN interfaces [ SVI ] for Default addressing on the TORs .

    Use DHCPrelay for Ipv4 address and SLAAC for IPv6 addressing and peer& dynamic BGP peering downstream to maintain the content prefixes.

    3. Prefix Limit and Prefix Aggregation is something which is required to be done on the aggregation on number of prefixes based on TCAM sizing and also Overlaying one over the other.

    4. Equidistant forwarding topology [ folded CLOS or fat tree ] which is important in maintaining end to end delays or RTTs. Scaling horizontally or scaling north-south or Scale Out is what is preferred over Scale Up for the sheer reason that the failure domain is covered with north-south link/nodes and still maintain the same equidistant hop and delay which nowadays is <2ms for VM to VM traffic.

    Customer needs are simple as they need predictable delay in their application with ever improving goodput.


    Replies
    1. Just one or two comment...

      "In your design , you have suggested Two tor of the rack [ TOR ] switches per rack which is opposed to single ToR which many H.W vendors and service vendors use."

      Approximately 95% of the deployments our there are different from what you seem to be seeing in your daily job.

      "Prefix Limit and Prefix Aggregation is something which is required to be done on the aggregation on number of prefixes based on TCAM sizing"

      Most environments will never hit 100K+ MAC or ARP entries available in recent chipsets... unless they badly mangle container deployments.
    2. Agreed totally. Does L2 solutions are preferred over L3. Can you specify a L2 usecase which i would like a take a look.

      768k MAC is the numbers that is supported on recent chipsets shouldnt be a problem.
  2. What is the use case for EVPN.

    Are we trying to run Ethernet VPN as overlay between ToR and Core. Infra has to support MPLS . IP based EVPN implementation is something which is not there if i am not wrong.

    Always EVPN, Bridged services has to run on top of infrastructure services such as MPLS. IP services are more ride on top of GRE,IPSec , IPinIP which doesnt require MPLS support.

    EVPN certainly requires MPLS as infra service.

    Can you brief more on EVPN use case on the DC Fabric Infra.
    Replies
    1. Guess what - some people actually have to support L2 domains spanning a whole data center and VXLAN (with or without EVPN) is the least horrible option.

      Please don't tell me that makes no sense and get used to the idea that different environments have different requirements.
    2. Agreed. I would love to see the L2 use case where the environments absolutely need Layer 2 aggregation and why that problem canot be solved with Layer 3.

      Can you decribe the difference in the environments of L2/L3 environments.
  3. Another important point to consider when designing is Oversubscription ratio on Tor to aggregation in your design and also Oversubscription ratio on Three stage CLOS or five stage CLOS.

    TOR to CLuster Spine is 3:1 and CLuster SPine to DC Spine is 4:1 and DC Spine to Regional Spine is 2:1 .

    2.Port density [ 10G,25G,40G,50G,100G ] has to be scaled on the ports , deep buffers for different set of applications for mixed mode traffic , incast , anycast , elephant flows and mice flows.

    3. Burst , Buffer & B.W considerations , Flow mappings within each queues and strict prioritization across the flows are the features that is required in the ToR. TOR layer design to spine always features that is required to handle burst and handle failure points and also redundancy model on the tors
    Replies
    1. And now please tell me how this set of considerations relates to the topic of this blog post (or my article). Thank you!
    2. BGP works on per prefix topology and per prefix capacity.
      Per Prefix Capacity is determined by advertisements over that particular link/neighbor and the neighbor uses the link which the neighbor is connected and that link has to have the burst/buffer considerations.

      BGP doesnt work on links but the nighbor use the links where it received the prefix by next hop population to an interface and thats the thought that i brought in for oversubscription per prefix rite.

      Oversubscription per prefix advertised by BGP.

      One more question

      1. Does MLAG scenarios support the peer link between the two ToR switches.
  4. I've always thought this was such a silly statement if you didn't have at least 1,000 network nodes in the fabric.

    A lot of times, it's vendor-centric goobers that will swear that EIGRP is never an option just because it was made by Cisco (even though their shops are mostly Cisco, LOL).

    Then you have the Cisco goons saying that EIGRP can route your entire worldwide network and scale huge (which is true, but they never stop to think if another solution is available).

    Then you have the BGP noobies who forget that BGP was created on a napkin and was never intended to be used on large scale DC fabrics (1000+ nodes).

    The Big 4 (IS-IS, OSPF, EIGRP, BGP) have their time and place, but it seems to me that this is the revolutionary time for a new IGP to be created, specifically tailored for NGDC fabrics.
Add comment
Sidebar