The EVPN Dilemma
Got an interesting set of questions from a networking engineer who got stuck with the infamous “let’s push the **** down the stack” challenge:
So I am a rather green network engineer trying to solve the typical layer two stretch problem.
I could start the usual “friends don’t let friends stretch layer-2” or “your business doesn’t really need that” windmill fight, but let’s focus on how the vendors are trying to sell him the “perfect” solution:
One thing I hear over and over from everyone (vendors especially) is how EVPN will solve all of my problems.
Every now and then vendors go on a lemming run promoting a miraculous technology. A few years ago it was either TRILL or SPB (depending on which chipset you were trying to sell), now it’s EVPN… which is a shame because EVPN is a decent technology.
The “solving all your problems” is the necessary component of this fairy tale. You would never buy from a vendor who would drop by and say “we can solve one of your problems, and you have to restructure your applications to get rid of the other 100”, right?
All I need to do is ditch my current IGP in favor of BGP
Another lemming run, this time along the lines of “if Petr Lapukhov did it at Microsoft it must be good”. While you could get a pretty minimalistic and simple design if you make BGP the only routing protocol in your fabric, you better do that with an implementation that was adapted to the new use of BGP, not decades-old code base that needs a gazillion of tweaks and just the right values of nerd knobs to make it work.
Oh, and some vendors messed their implementation really badly, so they started promoting IBGP-over-EBGP (EVPN address family on IBGP sessions running between loopbacks advertised with IP address family on EBGP sessions running on point-to-point links) and using schizophrenic local-as mechanisms just to make it work. Oh, and then there was another vendor telling the customers to run EBGP sessions on point-to-point links to exchange loopback prefixes, and another set of multihop EBGP sessions between the loopback interfaces of the same boxes to exchange the EVPN prefixes.
… and well BGP is hard to configure so I also need to invest in an automation solution.
That’s another thing vendors are really good at - promoting the right stuff for the wrong reasons. Network automation is the right way to go, but if it’s sold as the only way to build BGP configurations for your data center fabrics (because of the copious amount of nerd knob settings you need) you chose a wrong vendor.
There are vendors focusing on making data center EVPN+BGP+MLAG configurations as simple as possible, but they lack the marketing muscles of the big guys and glitzy customer events that CIOs love to mingle at. Just saying…
One other thing… EVPN doesn’t play well between vendors so there’s probably going to be lock in.
Well, the vendors are telling me they’re running interoperability workshops making sure the least-common-denominator EVPN implementations interoperate… but honestly, why would you want to build your data center fabric with switches from two vendors?
Unless you’re a member of the FANG club (in which case you’d probably run your own software on top of standardized products from two sources anyway), you’ll probably lose more money than you saved dealing with operational complexity of running two platforms with two operating systems (I would, however, avoid using proprietary vendor features as much as possible). It’s like mixing AIX, Solaris and Linux in your servers. Who would ever want to do that unless a database company forces them to do it due to their licensing and litigation practices?
Oh and your current network equipment will need to be replaced as well.
Like when you’re trying to figure out whether to buy a new car, you have two options:
- Stick with the old stuff and live with the lack of features available in the new models;
- Invest in the new model and get the new features.
Funnily, if you happen to have a decent-sized installation under vendor support contract, it might be cheaper to ditch the old stuff and buy the new switches. We had customers that would make money just on that swap in a few years' time due to cheaper boxes and consequently lower support costs.
What’s the problem with a solution like GRE? I can leverage my current IGP, all of my equipment already supports it… and it works between vendors.
While there are plenty of vendors doing whatever-over-GRE (but maybe not on recent data center switches), and I'm told at least some Broadcom ASICs support NVGRE (but how would we know), I’m not aware of anyone shipping bridging-over-GRE in hardware, and if you plan to stretch your layer-2 domain over 100 Mbps or 1 Gbps link (so you could use software-based forwarding), I have just one word for you: DON’T.
The question does make perfect sense though once you manage to replace GRE with VXLAN (see below).
Maybe trying to “tunnel” away all of our problems is the wrong solution to begin with. What are your thoughts on this?
There’s always RFC 1925 Rule 6A, but in the case of layer-2 segments artificially stretched beyond recognition (= beyond a single cable) tunneling makes perfect sense.
You could either try to bend the laws of physics and make bridging-with-STP work in an environment it was never designed for (what data center vendors tried to do with large-scale MLAG using proprietary technologies like VSS, vPC, IRF, VCF…), or you could give up, realize a routed fabric will always be more stable than a bridged hodgepodge, and start looking for a way to implement one.
In theory, you could build a routed fabric using MAC addresses (SPBM), yet-another layer-3-protocol (TRILL), or IP (VXLAN). I would go for VXLAN as we’ve been debugging IP routing protocols and IP forwarding for decades and thus they tend to work pretty well.
You could be smart and use VXLAN with preconfigured flooding lists and dynamic MAC learning (and I know people doing that in large-scale environments with great success) or you could buy into another vendor fairy tale that VXLAN with EVPN solves every problem you ever had.
Yet again, I’m not saying that EVPN is a bad technology, or that you wouldn’t benefit from using it (it might come very handy in larger fabrics, or if you still insist on stretching the VLANs across WAN links), but in some cases the simplest solution is all you need, and VXLAN on top of whatever IP routing protocol you’re familiar with (even RIP would work) gets you pretty close to that goal.
You might find these webinars (part of ipSpace.net subscription) useful if you want to master the technologies I mentioned in this blog post:
- Data Center Infrastructure for Networking Engineers
- EVPN Technical Deep Dive
- VXLAN Technical Deep Dive
- Data Center Interconnects
- Designing Active-Active and Disaster-Recovery Data Centers
- Data Center Fabrics
- Leaf-and-Spine Fabric Architectures (includes routing protocol selection)
- FRRouting Architecture and Features
All these webinars and much more are included in our Building Next-Generation Data Center online course.
Many thanks to Dinesh Dutt and Nicola Modena for fact-checking and improving the blog post.
* Because some people believe they increase TCP/IP throughput (usually not true unless you're dealing with suboptimal TCP stacks)
* Because you don't want to deal with client MTU size in tunneling environments.
And yes, I completely agree with everything Peter wrote, but sometimes you have to choose the lesser of two evils.
"Pretty much all EVPN implementations support multiple routing protocols." << Correct. That's not necessarily what the vendor SEs are telling the customers.
"There’s a difference between a reference design and feature support." << Agree. One of the differences is how many bugs you'll encounter when using a supported feature that is rarely used.
"there are statements here that are hearsay" << OK, tell me more. Would love to hear what you consider hearsay (is it hearsay if a customer tells me how badly he was burned?) and very old news.
I'm guessing you know how to contact me directly if you want to take the discussion offline ;)
Kind regards, Ivan
Following blogger captures in detail EVPN over EBGP using Junos dating back to 2016.
You can see the multi-hop EBGP sessions for EVPN required to set the loopback as the VTEP IP (i.e. BGP protocol next-hop).
If you want to set the BGP protocol next-hop to a physical port address you could share session with underlay EBGP. However, with IP-based overlays like VXLAN, having more than one uplink IP interface can be a problem for implementing EVPN-native active-active multi-homing which requires the source IP of a VXLAN tunnel from a PE to match the VTEP IP used on its EVPN ES routes to perform split-horizon filtering (as opposed to MLAG with anycast VTEP IP). Having more than one VTEP IP per PE makes EVPN-native multi-homing more complex to implement and requires more state to do the SH filtering.
Hopefully folks also know that when using EBGP with EVPN, it's important that intermediate EBGP hops do not rewrite the protocol next-hop set by the egress PE since we need the VXLAN tunnels to be addressed to the correct egress PE and not somewhere short of that.
All considered, IMO it's more straight-forward to use IBGP with RRs for EVPN, rather than to force-fit EVPN into hop-by-hop EBGP route propagation. Anyway it has been a long-standing convention to use IBGP for overlays within an instance of a transport domain. Messing with that means operators have to really understand what they are doing and why, and to me that isn't worth the fewer lines of configuration.
FYI, Junos leans toward being explicit about configuration leaving it to automation to simplify management of network. We think it is best for operators to know their network.
I'm told there was an issue with EVPN using EBGP prior to 14.x, but we're at 19.x now. We can take it offline from here if the discussion is not helpful for the community.
Hopefully this sets the record straight.
While I agree with many of your arguments, I think it doesn't make sense to use the same default behavior for EVPN address family as we did for IPv4 address family. Even in VPNv4 address family some vendors have an option of specifying the BGP next hop on route origination, having a sane default (loopback VTEP) in VXLAN environment should be a no-brainer.
More in a blog post sometime in January 2020 - too many other things to publish before Christmas break.
Enjoy the holidays and wish you all the best in 2020!