The EVPN Dilemma

I got an interesting set of questions from a networking engineer who got stuck with the infamous “let’s push the **** down the stack” challenge:

So, I am a rather green network engineer trying to solve the typical layer two stretch problem.

I could start the usual “friends don’t let friends stretch layer-2” or “your business doesn’t need that” windmill fight, but let’s focus on how the vendors are trying to sell him the “perfect” solution:

One thing I hear over and over from everyone (vendors especially) is how EVPN will solve all of my problems.

Now and then vendors go on a lemming run promoting a miraculous technology. A few years ago, it was either TRILL or SPB (depending on which chipset you were trying to sell); now it’s EVPN… which is a shame because EVPN is a decent technology.

The “solving all your problems” is the necessary component of this fairy tale. You would never buy from a vendor who would drop by and say, “we can solve one of your problems, and you have to restructure your applications to get rid of the other 100”, right?

All I need to do is ditch my current IGP in favor of BGP

Another lemming run, this time along the lines of “if Petr Lapukhov did it at Microsoft it must be good”. While you could get a pretty minimalistic and simple design if you make BGP the only routing protocol in your fabric, you better do that with an implementation that was adapted to the new use of BGP, not decades-old code base that needs a gazillion of tweaks and just the right values of nerd knobs to make it work.

Oh, and some vendors messed up their implementation really badly, so they started promoting IBGP-over-EBGP (EVPN address family on IBGP sessions running between loopbacks advertised with IP address family on EBGP sessions running on point-to-point links) and using schizophrenic local-as mechanisms just to make it work. Oh, and then another vendor told the customers to run EBGP sessions on point-to-point links to exchange loopback prefixes and another set of multihop EBGP sessions between the loopback interfaces of the same boxes to exchange the EVPN prefixes.

… and well, BGP is hard to configure, so I also need to invest in an automation solution.

That’s another thing vendors are really good at - promoting the right stuff for the wrong reasons. Network automation is the right way to go, but if it’s sold as the only way to build BGP configurations for your data center fabrics (because of the copious amount of nerd knob settings you need), you chose the wrong vendor.

There are vendors focusing on making data center EVPN+BGP+MLAG configurations as simple as possible, but they lack the marketing muscles of the big guys and glitzy customer events that CIOs love to mingle at. Just saying…

It's also worth mentioning that most open-source BGP products like BIRD and goBPG support similar BGP configuration simplifications as FRR, so it's not that hard to implement them.

One other thing… EVPN doesn’t play well between vendors, so there’s probably going to be lock-in.

Well, the vendors are telling me they’re running interoperability workshops to make sure the least-common-denominator EVPN implementations interoperate. But honestly, why would you want to build your data center fabric with switches from two vendors?

Unless you’re a member of the FANG club (in which case you’d probably run your own software on top of standardized products from two sources anyway), you’ll probably lose more money than you saved dealing with operational complexity of running two platforms with two operating systems (I would, however, avoid using proprietary vendor features as much as possible). It’s like mixing AIX, Solaris, and Linux on your servers. Who would ever want to do that unless a database company forces them to do it due to their licensing and litigation practices?

Oh and your current network equipment will need to be replaced as well.

Like when you’re trying to figure out whether to buy a new car, you have two options:

  • Stick with the old stuff and live with the lack of features available in the new models;
  • Invest in the new model and get the new features.

Funnily, if you happen to have a decent-sized installation under a vendor support contract, it might be cheaper to ditch the old stuff and buy the new switches. We had customers who would make money just on that swap in a few years due to cheaper boxes and (consequently) lower support costs.

What’s the problem with a solution like GRE? I can leverage my current IGP, all of my equipment already supports it… and it works between vendors.

While there are plenty of vendors doing whatever-over-GRE (but maybe not on recent data center switches), and I'm told at least some Broadcom ASICs support NVGRE (but how would we know), I’m not aware of anyone shipping bridging-over-GRE in hardware. If you plan to stretch your layer-2 domain over 100 Mbps or 1 Gbps link (so you could use software-based forwarding), I have just one word for you: DON’T.

The question makes perfect sense once you replace GRE with VXLAN (see below).

Maybe trying to “tunnel” away all of our problems is the wrong solution to begin with. What are your thoughts on this?

There’s always RFC 1925 Rule 6A, but in the case of layer-2 segments artificially stretched beyond recognition (= beyond a single cable) tunneling makes perfect sense.

You could either try to bend the laws of physics and make bridging-with-STP work in an environment it was never designed for (what data center vendors tried to do with large-scale MLAG using proprietary technologies like VSS, vPC, IRF, VCF…), or you could give up, realize a routed fabric will always be more stable than a bridged hodgepodge, and start looking for a way to implement one.

In theory, you could build a routed fabric using MAC addresses (SPBM), yet another layer-3 protocol (TRILL), or IP (VXLAN). I would go for VXLAN, as we’ve been debugging IP routing protocols and IP forwarding for decades, and thus, they tend to work pretty well.

You could be smart and use VXLAN with preconfigured flooding lists and dynamic MAC learning (and I know people doing that in large-scale environments with great success), or you could buy into another vendor fairy tale that VXLAN with EVPN solves every problem you ever had.

Yet again, I’m not saying that EVPN is a flawed technology or that you wouldn’t benefit from using it (it might be very handy in larger fabrics or if you still insist on stretching the VLANs across WAN links), but in some cases the simplest solution is all you need, and VXLAN on top of whatever IP routing protocol you’re familiar with (even RIP would work) gets you pretty close to that goal.

More Information

You might find these webinars (part of ipSpace.net subscription) useful if you want to master the technologies I mentioned in this blog post:

These webinars and many more are included in our Building Next-Generation Data Center online course.


Many thanks to Dinesh Dutt and Nicola Modena for fact-checking and improving the blog post.

Latest blog posts in BGP in Data Center Fabrics series

13 comments:

  1. And of course the issues with MTU mismatch when doing any sort of tunneling. All the VMs will need changes to communicate efficiently.
    Replies
    1. Nobody sane goes down that rabbit hole. The only realistic option is to increase the underlay (transport) MTU.
    2. Not sure if you meant jumbo frames:
      https://netcraftsmen.com/just-say-no-to-jumbo-frames/
    3. There are two reasons for jumbo frames:

      * Because some people believe they increase TCP/IP throughput (usually not true unless you're dealing with suboptimal TCP stacks)
      * Because you don't want to deal with client MTU size in tunneling environments.

      And yes, I completely agree with everything Peter wrote, but sometimes you have to choose the lesser of two evils.
    4. Yes, sure. It is much better than other tricks like aligning hosts with smaller mtu, pmtud, mss adjustment, and so on. Like every other tool has its good & bad sides.
  2. Honestly the future looks to be running some IGP on the servers themselves so they can keep their IP and can move anywhere without issue. Of course if your stupid app needs L2 to another host you are always going to be screwed.
  3. Disagree, you are just changing a default value, and it doesn’t harm toy in any way...if you can’t keep your configs in sync with your intent (larger but consistent MTU on EVERY link) you have got bigger problems...
  4. This is where routing protocols like IS-IS can shine. Leveraging it for link discovery, link mtu validation, etc... can be advantageous. I like having choices for IGP's, but if we build solutions that only converge on one routing protocol isn't that harmful?
    Replies
    1. That's probably why it was used in Fabricpath, unfortunately the best tech isn't always the prevailing candidate.
  5. Pretty much all EVPN implementations support multiple routing protocols. Including IS-IS, OSPF and EBGP as IGP. There’s a difference between a reference design and feature support. Also, with all due respect, there are statements here that are hearsay, or at best very old news, and so do unnecessary harm. This stuff is mostly software and software is not static.
    Replies
    1. Hi Aldrin,

      "Pretty much all EVPN implementations support multiple routing protocols." << Correct. That's not necessarily what the vendor SEs are telling the customers.

      "There’s a difference between a reference design and feature support." << Agree. One of the differences is how many bugs you'll encounter when using a supported feature that is rarely used.

      "there are statements here that are hearsay" << OK, tell me more. Would love to hear what you consider hearsay (is it hearsay if a customer tells me how badly he was burned?) and very old news.

      I'm guessing you know how to contact me directly if you want to take the discussion offline ;)

      Kind regards, Ivan
  6. Hi Ivan,

    Following blogger captures in detail EVPN over EBGP using Junos dating back to 2016.
    https://blog.noc.grnet.gr/2016/09/28/lab-on-evpn-vxlan-on-juniper-qfx5100-switches-3/
    You can see the multi-hop EBGP sessions for EVPN required to set the loopback as the VTEP IP (i.e. BGP protocol next-hop).

    If you want to set the BGP protocol next-hop to a physical port address you could share session with underlay EBGP. However, with IP-based overlays like VXLAN, having more than one uplink IP interface can be a problem for implementing EVPN-native active-active multi-homing which requires the source IP of a VXLAN tunnel from a PE to match the VTEP IP used on its EVPN ES routes to perform split-horizon filtering (as opposed to MLAG with anycast VTEP IP). Having more than one VTEP IP per PE makes EVPN-native multi-homing more complex to implement and requires more state to do the SH filtering.

    Hopefully folks also know that when using EBGP with EVPN, it's important that intermediate EBGP hops do not rewrite the protocol next-hop set by the egress PE since we need the VXLAN tunnels to be addressed to the correct egress PE and not somewhere short of that.

    All considered, IMO it's more straight-forward to use IBGP with RRs for EVPN, rather than to force-fit EVPN into hop-by-hop EBGP route propagation. Anyway it has been a long-standing convention to use IBGP for overlays within an instance of a transport domain. Messing with that means operators have to really understand what they are doing and why, and to me that isn't worth the fewer lines of configuration.

    FYI, Junos leans toward being explicit about configuration leaving it to automation to simplify management of network. We think it is best for operators to know their network.

    I'm told there was an issue with EVPN using EBGP prior to 14.x, but we're at 19.x now. We can take it offline from here if the discussion is not helpful for the community.

    Hopefully this sets the record straight.

    Cheers,
    Aldrin
    Replies
    1. Thank you for the extensive reply!

      While I agree with many of your arguments, I think it doesn't make sense to use the same default behavior for EVPN address family as we did for IPv4 address family. Even in VPNv4 address family some vendors have an option of specifying the BGP next hop on route origination, having a sane default (loopback VTEP) in VXLAN environment should be a no-brainer.

      More in a blog post sometime in January 2020 - too many other things to publish before Christmas break.

      Enjoy the holidays and wish you all the best in 2020!
      Ivan
Add comment
Sidebar