Your browser failed to load CSS style sheets. Your browser or web proxy might not support elliptic-curve TLS

Building network automation solutions

6 week online course

Start now!
back to overview

Is EBGP Really Better than OSPF in Leaf-and-Spine Fabrics?

Using EBGP instead of an IGP (OSPF or IS-IS) in leaf-and-spine data center fabrics is becoming a best practice (read: thing to do when you have no clue what you’re doing).

The usual argument defending this design choice is “BGP scales better than OSPF or IS-IS”. That’s usually true (see also: Internet), and so far, EBGP is the only reasonable choice in very large leaf-and-spine fabrics… but does it really scale better than a link-state IGP in smaller fabrics?

There are operators running single-level IS-IS networks with thousands of devices, and yet most everyone claims you cannot use OSPF or IS-IS in a leaf-and-spine fabric with more than a few hundred nodes due to inordinate amount of flooding traffic caused by the fabric topology.

This is the moment when a skeptic should say “are you sure BGP works any better?” and the answer is (not surprisingly) “not exactly”, at least if you get your design wrong.

EBGP or IBGP?

Most everyone understanding how BGP really works agrees that it makes more sense to use EBGP between leaf and spine switches than trying to get IBGP to work without underlying IGP, so we’ll ignore IBGP as a viable design option for the rest of this blog post.

You can run IBGP without IGP, and we had to make it work a long while ago because customers, but it’s somewhat counter-intuitive and requires additional configuration tweaks.

Let’s make it simple…

Not knowing any better, let’s assume a simplistic design where every switch has a different AS number:

If your first thought was “didn’t you tell us we shouldn’t do it this way in the Leaf-and-Spine Fabric Architectures webinar,” you’re absolutely right. There’s a reason why this is a bad idea. Keep reading…

Now imagine a leaf switch advertising a prefix:

  • It advertises the prefix to all spine switches;
  • Spine switches advertise the prefix to all other leaf switches;
  • Properly-configured leaf switches use all equal-cost BGP prefixes for traffic forwarding, but still select one of the as the best BGP path (that’s how BGP works when you don’t configure add-path functionality);

It’s worth mentioning that by default some common BGP implementations don’t do ECMP across EBGP paths, and don’t accept paths coming from different autonomous systems as equal-cost paths. You need two nerd knobs to get it working.

  • Leaf switches advertise their best BGP path to all spine switches apart from the one that advertised the best path to them. Some BGP implementations might actually advertise the best path to the router they got the best path from.
  • Every single spine switch installs all the alternate BGP paths received from all leaf switches in BGP table… just in case it might be needed.

To recap: on most spine switches you’ll see N entries for every single prefix in the BGP table (where N is the number of leaf switches) – one from the leaf switch with the prefix, and one from every other leaf switch that didn’t select the path through the current spine switch as the best BGP path.

Figuring out what happens when a leaf switch revokes one of its prefixes and how many unnecessary BGP updates are sent is left as an exercise for the reader.

Compare that to how OSPF flooding works and you’ll see that there’s not much difference between the two. In fact, BGP probably uses even more resources in this setup than OSPF because it has to run BGP path selection algorithm whenever BGP table changes, whereas OSPF separates flooding and path computation processes.

Fixing the mess we made…

Obviously, we need to do something to reduce the unnecessary flooding. There’s not much one could do in OSPF or IS-IS (don’t get me started on IS-IS mesh groups), which is the real reason why you can’t use them in larger fabrics, and why smart engineers work on RIFT and OpenFabric.

What about BGP? There are two simple ways to filter the unnecessary announcements:

  • Configure outbound update filtering on leaf switches (or inbound update filtering on spine switches) to discard paths that traverse more than one leaf switch;
  • Use the same AS number on all spine switches and let BGP’s default AS-path-based filtering do its job.

Now you know why smart network architects use the same AS number on all spine switches and why RFC 7938 recommends it.

The Dirty Details…

The effectiveness of default AS-path-based filtering depends on the BGP implementation:

  • Some implementations (example: Cisco IOS) send BGP updates to all EBGP neighbors regardless of whether the neighbor AS is already in the AS path. The neighbor has to process the update and drop it;
  • Other implementations (example: Junos) filter BGP updates on the sender side and don’t send BGP updates that would be dropped by the receiver (as always, there’s a nerd knob to twiddle if you want those updates sent).

Finally, it’s interesting to note that using IBGP without IGP, with spine switches being BGP route reflectors tweaking BGP next hops in outbound updates, results in exactly the same behavior – another exercise for the reader if you happen to be so inclined.

Want to know more?

10 comments:

  1. To answer your question: yes

    ReplyDelete
  2. I don't know if one is absolutely better than the other but we chose eBGP because we leverage the underlay not only for announcing VTEPs /32 but also for many other things (multicast, avoid doubling vxlan encaps., filtering some prefixes, providing local subnets per leafs for services that don't require stretched L2 => multicast for instance :) etc...). Our MAN is based on IGP/iBGP, it was simpler to interconnect with eBGP our underlay than playing with OSPF/BGP redistribution and keeping an end to end bgp-based network. Our fabrics scaling haven't been a requirement when it came to choose the underlay protocol.

    ReplyDelete
  3. Maybe for small/ mid and semi large fabrics Open EIGRP can have an opportunity. Topology table has all the successors ready to go. Limit your DUAL boundaries or use different ASs. There are always options and nerd nobs.

    ReplyDelete
    Replies
    1. No, it's not an option. We want multi vendor interoperability. So only OSPF and IS-IS left.

      Delete
  4. Hannes Gredler05 June, 2018 05:41

    BGP gives you some pacing for free as flow-control is receiver driven (you close your TCP Window, Sender stops) - IS-IS and OSPF have no flow-control, so if you assume a naive implementation then you could say that BGP gives you some edge. However if you know what you're doing (read: get the I/O module right) then you can make it work - regardless of the protocol.

    ReplyDelete
  5. BGP avoids flooding . Its all unicast based update policy mechanism and its withdrawn routes can be softly send to neighbors for unnecessary control plane B.W. wastages.
    It is policy based and it can scale well for more than 100k routes.
    Convergence can be improved with BFD with some non-aggressive timers.
    Reliability and flow control is taken care by TCP .

    ReplyDelete
    Replies
    1. "BGP avoids flooding" << You did read the blog post, right? It does almost exactly the same thing as OSPF with the additional drawback of recomputing BGP tables every now and then during the convergence process.

      "Its all unicast based update policy mechanism" << Which is relevant how exactly?

      "and its withdrawn routes can be softly send to neighbors for unnecessary control plane B.W. wastages." << Please explain in more details. Thinking about how OSPF or BGP would revoke a route, I can't figure out what you're trying to tell me.

      "It is policy based" << hopefully not relevant in data center fabric underlay.

      "and it can scale well for more than 100k routes" << if that's relevant in underlay fabric you shouldn't be reading my blog posts but hire an architect who knows what he's doing ;))

      "Convergence can be improved with BFD with some non-aggressive timers." << Ever heard of BFD for OSPF?

      "Reliability and flow control is taken care by TCP" << Now that's the only relevant one. However, as Hannes wrote, it's all the question of whether you get the I/O module right. You can't miss much if you're forced to use TCP, but there's an LSA acknowledgement mechanism in OSPF that you could use for pacing (not saying anyone does that).

      Delete
    2. Missed one: "BGP avoids flooding" << it does avoid _periodic_ flooding caused by database refreshes. Not configurable in OSPF (because RFC authors know better than you do what you need in your network), configurable in IS-IS.

      Delete
  6. Hi Ivan

    Regarding the dirty details, by default Cisco IOS-XR will also prevent sending BGP updates containing the AS configured on the remote peer. The feature is called as-path-loopcheck and is enabled by default. One important detail is if several BGP neighbors belong to the same BGP update-group, this will be disabled.

    Here is the command reference:

    https://www.cisco.com/c/en/us/td/docs/routers/crs/software/crs_r4-2/routing/command/reference/b_routing_cr42crs/b_routing_cr42crs_chapter_01.html#wp3145726977

    If for some reasons customers want to disable this optimization (I know some McGyvers), they can use "as-path-loopcheck out disable" command.

    Best
    Fred

    ReplyDelete
  7. it should be interesting to see what comes of the dynamic flooding (https://tools.ietf.org/html/draft-li-dynamic-flooding-04) work that's currently underway in LSR. imho, this is the most interesting work in this space in recent memory. dense graphs are no longer constrained to DC clos environments and these problems will likely start to manifest themselves in some WAN environments.

    this has the added perk of not relying on specific topological behaviors as some of the aforementioned approaches do.

    ReplyDelete

Constructive courteous comments are most welcome. Anonymous trolling will be removed with prejudice.

Sidebar