The EVPN/BGP Saga Continues

Aldrin wrote a well-thought-out comment to my EVPN Dilemma blog post explaining why he thinks it makes sense to use Juniper’s IBGP (EVPN) over EBGP (underlay) design. The only problem I have is that I forcefully disagree with many of his assumptions.

He started with an in-depth explanation of why EBGP over directly-connected interfaces makes little sense:

Following blog captures in detail EVPN over EBGP using Junos dating back to 2016. You can see the multi-hop EBGP sessions for EVPN required to set the loopback as the VTEP IP (i.e. BGP protocol next-hop). [More about BGP next hop handling in the original comment].

I agree with Aldrin that the best way to get the desired BGP next hop in Junos EVPN implementation is to use a BGP session between loopback interfaces (to be more precise, you have to use the same loopback interface for BGP session and VTEP IP)… but that’s just because they decided not to set the correct BGP next hop when inserting the local EVPN route into the local BGP table.

Setting BGP next hop to a specific value when originating a BGP route is not a novel idea - it took me about 30 seconds to find it in RFC 1403 (OK, I’m old enough to know where to look). Assuming an EVPN device wants to receive VXLAN packets with a specific destination IP address, I can’t figure out a good reason why it wouldn’t set that IP address as the BGP next hop when originating the route. Asking the network operator to run BGP sessions between loopback interfaces just so the BGP next hop would be set to the right value is stupid.

There might be something in Junos EVPN implementation that prevents them from setting the BGP next hop on route origination, but if that’s the case please document that limitation and move on. Trying to persuade me why I should consider a workaround a sane design won’t work.

Moving on, Aldrin mentioned BGP next hop handling on EBGP sessions:

Hopefully folks also know that when using EBGP with EVPN, it’s important that intermediate EBGP hops do not rewrite the protocol next-hop set by the egress PE since we need the VXLAN tunnels to be addressed to the correct egress PE and not somewhere short of that.

Please make your mind up. Your product marketing could either promote EBGP as the intra-domain routing protocol, and your engineering could support that by implementing sane defaults supporting that assumption, or you could tell people to stick with more traditional IGP+IBGP design and get rid of all the complexities.

For example, FRRouting implementation decided to go all-in, and uses next-hop-unchanged as default behavior on EBGP EVPN sessions (which nicely matches Cumulus’ EBGP-only design). Most other vendors behave like deer in headlights.

All considered, IMO it’s more straight-forward to use IBGP with RRs for EVPN, rather than to force-fit EVPN into hop-by-hop EBGP route propagation.

If you want to use the same defaults you used for the last 30 years, then YES, please use IBGP… but also stick to the designs we’ve been using during those 30 years, like running IBGP over OSPF or IS-IS. Those designs were good enough for largest ISPs on this planet… and now all of a sudden they’re not good enough for enterprise data centers with a few dozen switches? Could someone please stop this lemming run before everyone hits the cliff?

Anyway it has been a long-standing convention to use IBGP for overlays within an instance of a transport domain.

And it has been a long-standing convention to use OSPF or IS-IS as the underlying routing protocol in said transport domain. Unfortunately most data center switching vendors think they need to be hip, and try to persuade us how well star-shaped pegs fit into round holes (hint: they do if you have a big-enough hammer).

FYI, Junos leans toward being explicit about configuration leaving it to automation to simplify management of network.

The best reply to this line of reasoning is a comment I got from an attendee of my recent VMware NSX, Cisco ACI or Standard-Based EVPN workshop:

For medium businesses (IT dept of 5-10 people) who have never heard of BGP, VNID, multicast… EVPN is simply scary as hell. If they have to choose they go for ACI - presented as single point of management and magically solves world hunger and is “anywhere” versus BGP EVPN being marketed as being complex (people don’t want hear that they need to configure 70 lines of code per VXLAN per switch in the fabric versus a few clicks on ACI).

I would understand why someone within Cisco would have a vested interest to make EVPN configurations complex, but I fail to see why Arista and Juniper are playing along.

Long story short: I’ve seen two simple and successful ways of deploying EVPN in non-hyperscale VXLAN-based data center fabrics - running IBGP over OSPF/IS-IS, or running EBGP between directly-connected interfaces with sane BGP next-hop defaults on BGP route origination and propagation. Everything else is just piles of unnecessary complexity that scares people away from what is otherwise the best technology to use if you really want to build layer-2 fabrics (and your vendor doesn’t believe in SPB).

Final note: Want to get beyond vendor arguments and marketing? You might want to watch our EVPN Technical Deep Dive webinar - over 10 hours of (mostly) vendor-neutral fact-based material.

Latest blog posts in BGP in Data Center Fabrics series

14 comments:

  1. I fully agree. The classic IGP+iBGP combination is well tested and works fine in many scenarios. Inventing new solutions when encountering new problems is fine, but inventing stuff to be "hip" is just asking for trouble.
  2. Just my personal opinion, but I fail to understand why we need to build the underlay with BGP. I get that manufacturers have gone to great lengths to improve BGP failover times, but everyone acts like OSPF and IS-IS are boring dead protocols that confuse people.

    I actually prefer IS-IS because it’s existence before IP pervasiveness meant it doesn’t rely on IP to function. If VXLAN is the new layer2, why burn a bunch of time addressing interfaces in your underlay. OSPFv3 a la IPv6 Link Local SLAAC or the simplicity of a loopback and ip unnumbered on IS-IS makes Zero Touch way simpler. Less state for the fabric to maintain, and Link state protocols provide a explicit border because when you touch another fabric or the border of your lab you use EBGP anyhow... but maybe I’m just old school.
    Replies
    1. This is just because you happen to have apps that need to be routed in the underlay and you do not want to deal with IGP/BGP redistribution because it adds no value to the network infrastructure. Just to name a few (NSX, multicast apps ). And it will be overkill to run geneve upon vxlan for the first or not been able to make the apps working for the last (while waiting to get a working EVPN RT-6 to 8 implementation).
  3. eBGP in underlay (and overlay) makes a lot of sense, we could go into great depth into fast convergence/route resolution and recursiveness.
    May you want to run an IGP for the underlay - you don’t have to stick to vanilla LS protocols.
    There’s RIFT (I’m biased - but this is really great work), there are extensions to LS protocols that reduce flooding (and this is the real issue of using IGP’s at scale) - draft-ietf-lsr-dynamic-flooding, implemented in NX-OS and EOS
    Replies
    1. Would love to hear why eBGP in the underlay would converge faster and/or better than IS-IS or OSPF (assuming similar quality of implementation)... and thanks for the dynamic flooding pointer!
  4. Hi there,
    I reckon Juniper has just applied its native SP knowledge and way of architecting L2/L3 transport services to the DC but within the boundaries of possibly some of the boxes' limitation. Some others such as Cumulus had the ability and the chance/will to go beyond an SP-to-DC configuration adaptation and managed if am not mistaken to ultra-simplify the fabric overlay and underlay configuration/complexity with just inter-IPv6_Link-Local address unnumbered eBGP sessions carrying multiple BGP families (tipically IPv4 for the underlay and VPNv4/EVPN for the overlay) which I find marvellously elegant and simple/compact and therefore less expensive to manage and operate.

    Regarding the eBGP building the underlay, it is just the WRONG protocol for that purpose but it is the only viable option currently in big DC (Medium/small DCs I reckon are not excused !!!). The underlay IS THE job for an IGP but the industry needs an IGP thought/designed for dense topologies such those present in big fabrics. Furthermore, with the IGP in the DC/Fabric, Segment Routing would follow naturally/natively too and it would also be possible to think of deploying Fabric topologies other than CLOS that can provide even densier topologies for, say, even lower-latencies requirements.

    Cheers and keep smiling
    Andrea
    Replies
    1. Hi Andrea,

      I wonder where "only viable option in big DCs" comes from? Well-designed OSPF was good enough for networks with tens of thousands of nodes, and as Dinesh Dutt explained in Software Gone Wild episode 92, there are very large environments happily running OSPF or IS-IS.

      Kind regards,
      Ivan
    2. Hello Ivan,
      To be honest with you, when I saw all that effort in defining new routing protocols for ultra-dense topologies, I assumed there must had been a clear and strong requirement within the industry/academia fuelling all that effort. My assumption got even stronger with the evidence of seeing so many deployments with eBGP as the CLOS underlay protocol. What I did not know (I have just listened Dinesh on the Software Gone Wild episode 92 as you pointed out) is that well engineered multi-level/multi-area ISIS and OSPF implementations build the underlay of pretty big/dense multi-pod/IGP_Area CLOS production networks today. Well, this makes me even happier then as I am a big fan of the IGP+BGP approach to the underlay/overlay L2/L3 transport layout !!

      Having said that though Ivan and thinking of OSPF here, I am guessing that a better IGP for dense topologies would have allowed those very same implementations that Dinesh mentioned in the podcast to actually be built as single area IGP Fabric which is way easier than multi-area design – not to mention how happy traditional TE and SR-TE would be too. I also wonder if the OSPF implementation on some of the newer vendors in this space (i.e. not the usual suspect 4/5 vendors) is mature enough for a production network especially acting as the OSPF ABRs for instance. So, a ‘dense-topology’ IGP would therefore I guess make life easier (and thus cheaper) for the code as well in this regard as it would allow a single-area design ??
      The other aspect I am not sure about regarding the eBGP as the underlay is that it has always seemed to me as if it was yet a further chance for the vendor X to sell you an automation engine (culture and products) too in order to deal with the additional proliferation of BGP policies and config….
      Having said that, a better IGP might not be currently needed (we should see what the guys working on it think...Jeff ??) but I guess we should still take this as an opportunity to start working now on an a better IGP for future use ?

      Cheers and keep smiling
      Andrea
  5. Ivan, It’s your blog and your narrative. I clearly didn’t think that part through. Wasn’t expecting this kind of vitriol.
    Replies
    1. Dear Aldrin,

      Rereading what I wrote, I have to admit that even though I waited a long while before writing the blog post (and then toned it down), I used more colorful language than usual... but it makes me immensely sad and bitter when I see excellent engineers making spurious technical arguments to justify (what seems to me to be) product marketing decisions that I simply can't agree with.

      I apologize if I offended anyone with a direct and somewhat harsh opinion (I was never known for my diplomatic skills), and hope we'll find a technical topic in the future where our opinions won't be so opposite, as I always enjoyed the technical discussions we had.

      Kind regards,
      Ivan
    2. There’s only so much of the picture that can fit in the comments section. The grander EVPN story starts with its background and continued evolution across multiple domains and use cases. What is being called out here is a very specific data center reference design doc — there’s a story behind that too, and it’s not marketing.

      For example, a key goal of that doc is to expose what our fabric controller is driving under the covers. It starts with the basic use case of a simple IP fabric. Some folks don’t need overlays. OSPF/ISIS is not ideal for very large fabrics so EBGP was chosen to avoid deploying different solutions for different size IP fabrics in the same company (think large enterprise or SP with many fabrics of different sizes geographically dispersed). Not perfect either, so meanwhile Tony P (Juniper) invented RIFT, which sparked other fabric optimized IGP efforts in the IETF (Juniper was behind EVPN too — https://tools.ietf.org/html/draft-raggarwa-mac-vpn-01). When one of these land we would swap out EBGP in the solution. However operators are free to replace EBGP with OSPF or ISIS if they see fit (and understand the flooding inefficiencies in large dense topologies). I started my comment in the original blog with this statement, which you ignored in this blog.

      Then we added the overlay use case on top of the IP fabric use case solution. Many of our larger customers don’t want ANY overlay/tenant state in P routers (control or data plane). So instead of the controller pushing different solutions for different customer types and fabric sizes (this too is complexity) we chose to keep it consistent given the outcome is the same in every case. In fact some customers have hosts in overlays and hosts in the underlay simultaneously, since only a subset of their endpoints need to be segmented away from the larger set, and/or they are migrating to host-based overlays or cloud-native application models.

      The controller hides the verbosity (explicit config), but when operator has to troubleshoot, detail is there under the covers. No magic that leads to incorrect assumptions at a time of sheer panic. I ran a very large production network for 20 years and rejected special sauce for this reason. We provided a doc that exposes what we do under the controller. That’s all. It’s not marketing. You too have blogged many times about knowing your network. How can you know it if you can’t see it?

      There’s more to the story that I hope we can talk about first. It’s not all black and white.
  6. Let me pose a problem for you. This time I hope to know your own recommendations.

    How does the network know that a VTEP is actually alive? (1) from the point of view of the control plane and (2) from the point of view of the data plane? And how do you ensure the that control and data plane liveness monitoring has the same view? BFD for BGP is a possible solution for (1) but it’s not meant for 3rd party next hops, i.e. it doesn’t address (2). For (2) there is BFD for VXLAN which doesn’t address (1), i.e. the control plane doesn’t know of the failure and so hasn’t offered an alternative. Moreover, solutions for (2) tend to break down without ASIC assist when the number of tunnels are huge, so solutions can’t assume that the HW support exists.
  7. Ah, another series of never-ending saga. And again some mythical Junos limitations... But wait, what am I doing wrong here? http://jncie.tech/2020/02/26/juniper-evpn-bgp-options-ebgp-only-design/
    Replies
    1. It wasn’t me making those arguments... what do I know what’s really going on...
Add comment
Sidebar