BGP in EVPN-Based Data Center Fabrics

EVPN is one of the major reasons we’re seeing BGP used in small and mid-sized data center fabrics. In theory, EVPN is just a BGP address family and shouldn’t impact your BGP design. However, suboptimal implementations might invalidate that assumption.

I've described a few EVPN-related BGP gotchas in BGP in EVPN-Based Data Center Fabrics, a section of Using BGP in Data Center Leaf-and-Spine Fabrics article.

Alex raised several valid points in his comments to this blog post. While they don’t fundamentally change my view on the subject, they do warrant a more nuanced explanation.
keep reading

23 comments:

  1. """In these cases, run EVPN over IBGP sessions assuming you can use an IGP as your routing protocol. Run away from vendors that try to sell you the idea of running EBGP between leaf and spine switches, and IBGP between leaf switches on top of intra-fabric EBGP."""

    Can you please explain your logic behind this statement?
    What's the difference between "underlay IGP + overlay iBGP" and "underlay eBGP + overlay iBGP" cases?
    In latter case you just need to use two different AS numbers on leaf switch - one unique eBGP AS per each ToR and one single AS for iBGP overlay.
    Or, in other words, you need two independent BGP sessions on ToR switch - one for underlay and another for overlay.
    Replies
    1. Would it work? Probably.

      Is it sane? We probably disagree on this aspect.

      Is it supported? I would love to see how many vendors officially support this (apart from the one that I've seen using this design), but I won't waste my time investigating it.

      Is it easy to understand? Yet again, I would love to see how you explain this to the guy that has to do troubleshooting @ 2AM on Sunday morning.

      Will someone go and design a customer network this way? Sure.

      Will they be upset when the customer reads my blog post? Probably.
    2. In other words, Juniper isn't big enough player in this market for you to pay attention? So sad...
    3. This design (eBGP + iBGP) looks confusing only in the "industry-standard CLI".
      In JunOS it is simple and logical - clear separation of underlay and overlay:

      alex@vQFX1# show protocols bgp
      group underlay {
      type external;
      export direct;
      local-as 65011;
      family inet {
      unicast;
      }
      multipath multiple-as;
      neighbor 192.168.0.0 { ### SPINE1
      peer-as 65001;
      }
      neighbor 192.168.0.4 { ### SPINE2
      peer-as 65002;
      }
      }
      group overlay {
      type internal;
      local-as 65000;
      local-address 11.11.11.11; ### loopback
      family evpn {
      signaling;
      }
      multipath;
      neighbor 2.2.2.2; ### EVPN RR1
      neighbor 1.1.1.1; ### EVPN RR2
      }

      I don't think it's more complex and confusing than no-nexthop-change option in case of eBGP only design.


      I hope you'll find some time to look at Juniper design options for EVPN fabrics, they have some pretty good stuff there.
      For example, this book https://www.juniper.net/us/en/training/jnbooks/day-one/data-center-technologies/data-center-deployment-evpn-vxlan/ part 3 regarding eBGP+iBGP design.
      Apart from that, they managed to implement proper EVPN multihoming, not MLAG-dependent cludges.
    4. @Alex: "In other words, Juniper isn't big enough player in this market for you to pay attention? So sad..."

      Just because someone has a non-zero market share doesn't mean that

      (A) I'll actively track everything they do. There are too many more interesting things out there, and I never claimed to be an industry analyst. With many vendors I have friendly SEs who send me an occasional email saying "read this".

      (B) I agree with what they're doing just because they are Vendor X. Every vendor got upset with my vies every now and then. Now it looks it's Juniper's turn ;)

      (C) What they're doing makes sense. Remember that we're talking about reasonably-sized data center fabrics... and even if you have a data center fabric use case that needs EVPN at scale where IGP is no longer a viable underlay option, I'd guess it's an outlier due to very peculiar circumstances.

      Finally, I was never talking only about configuration complexity (and related IOS or EOS configuration has similar complexity as Junos) but the complexity of what's going on behind the scenes. Also, you might have missed the "ignore AS-path check" tweak (or is Junos turning off BGP loop prevention logic by default?)

      As for "MLAG-dependent kludges", I'm pointing that out every time I talk about EVPN and MLAG, but because they did one thing right doesn't mean that everything else they do makes equal sense.

      As I wrote above, I think we can agree that we disagree on whether this is sane and move on. At least I will.
    5. "Also, you might have missed the "ignore AS-path check" tweak (or is Junos turning off BGP loop prevention logic by default?)"

      I used to think about eBGP and iBGP in this design as two completely independent protocols, which doesn't share routes between each other. So I don't really understand your concern about loop prevention here... Maybe I'm missing something.

      But anyway, thank you for detailed explanation of your point of view. I really appreciate your hard work of clarifying so complex (sadly) world of modern networking.
    6. Hmm, I thought you had this running in a life/PoC network.

      If you want to run IBGP between ToR switches, they all have to be in the same AS. If the spine is in a different AS, you have the "I'm receiving EBGP prefixes originating from my own AS" problem, which usually requires "allowas-in" tweak or whatever it's called on a specific platform.

    7. We have this design running successfully for one of our customers. And I also played a lot with this config during my JNCIE-DC preparation. By "missing something" I mean in the BGP internal implementation side.

      And, sorry, again I don't understand your logic... In this design spine does not receive any EVPN routes at all. All that spine see is just eBGP IPv4 routes from leafs, like in simple L3-only fabric.

      Leaf switch doesn't need "to be" in one single AS - it uses one AS number for eBGP, and another completely independent AS number for iBGP. Look for config example above. This is complete BGP config - AS number is not configured under routing-options stanza. So in this example leaf switch in AS 65011 for eBGP and in AS 65000 for iBGP simultaneously.
    8. Read the Junos documentation on "local-as" and it seems to me this feature is the usual "lie what my AS number is", not "I can run two BGP instances with different AS numbers". I wouldn't say that masquerading your AS number reduces complexity ;)
    9. Summarized my views regarding this subject in this blog post: http://jncie.tech/2018/01/28/bgp-design-options-for-evpn-in-data-center-fabrics/
    10. And we're mostly in agreement ;) Thank you - will make sure to add the link to your blog post to the article once I find the time to update it.
  2. IMHO eBGP only design is the one that should be avoided for one simple reason - why bother your spine switches with EVPN routes from all connected leaf switches (and there are a LOT of routes in EVPN).
    In your own words - "complexity should belong to edges".
    If you need RR for iBGP (of course you do) - just use your DC GW routers or, even better, virtual routers like vMX.
    Replies
    1. I agree with this part - but then you should go for IBGP + IGP. Running a single layer-2 domain across a data center fabric that is big enough to require EBGP as the routing protocol (= more than a few hundred switches) is asking for disaster.

      Most organizations that deal with networks of this size try to keep layer-2 domains as small as possible or move the problem to the real network edge - the hypervisors.
  3. Hi -

    I have read documentation from a Vendor that claims they offer two variants of EVPN;
    IGP underlay + iBGP Overlay
    eBGP underlay + eBGP Overlay

    The latter would seem neat as a single eBGP session from leaf to spine would support two "address families" (IPv4 for underlay and EVPN for Overlay). However, as Alex suggests above, it would seem to result in a bloated forwarding tables on the spines as they have to retain (and process) all the EVPN routes.

    However, the vendor claims to have a config "switch" that prevents this - does this ring true? If so, it would seem to suggest their eBGP model would be preferable?


    Regards,
    James
    Replies
    1. This magic config "switch" prevents EVPN routes from installing into forwarding table on spines, but what about routing tables on spines and all related memory/CPU load?
      And also think about the possibility of EVPN overlay implementation in already running eBGP IP fabric - would you want to add another family to already running eBGP sessions?
    2. Fair point about adding a new AF to an existing BGP session - but I was thinking in the context of a new build.
      Do you think PBB-EVPN has a part to play within the DC?
  4. There's nothing wrong with eBGP in place of a traditional IGP protocol with iBGP for EVPN overlay. This is consistent with using ISIS or other IGP with iBGP for EVPN overlay. Here we just swap out ISIS/OSPF for RFC7938. Any other approach seems divergent.
  5. Context of the original blog is incorrect. I’ve built multiple fabrics with eBGP underlays and iBGP overlays hosting 50K MACs and it works perfectly (assuming you know what you’re doing). Disappointing to see people stating incorrect views/interpretations as fact.
    Replies
    1. In case you're referring to my "Run away from vendors that try to sell you the idea of running EBGP between leaf and spine switches, and IBGP between leaf switches on top of intra-fabric EBGP" you might have noticed it's my view or opinion, based not on the technical viability of the solution (yes, it works) but of its unnecessary complexity.

      You might consider it incorrect, and I have no problem with that. I might also consider anonymous comments saying "my concoction works perfectly assuming you know what you're doing" highly irrelevant as it clearly applies to a zillion of "just because you could doesn't mean that you should" ideas, and your "assuming you know what you're doing" remark only validates my opinion. Thanks for that ;)
  6. No worries. I have no idea who you are or what you know (google brought me here in a search), but if you believe segregating control and data planes via eBGP and iBGP is complex, then running an IP fabric isn’t for you (and there’s nothing wrong with that). Finding good talent is incredibly hard. There way more mediocre network engineers out there who have a limited working knowledge of BGP vs truly talented ones who really understand the protocol inside and out. Apologies for the anonymous posts, but circumstances dictate. Best of luck for the future with your site.
    Replies
    1. Nicely played mr. Anonymous, but I'm not biting ;)
  7. Interesting discussion guys, and there’s nothing "technically" wrong with using a single instance of eBGP in an IP fabric. But …. with a link failure you could potentially have millions of MAC routes withdrawn and relearned (or at least we do on the scale we build/deploy to). When using eBGP underlays and iBGP overlays, link failures don’t cause the iBGP overlay (with our ~1M MAC routes) to re-converge and impact performance. I understand it’s added complexity on the front-end, but BGP control/data plane segregation will save you issues down the road when dealing wth large IP fabrics. Granted our IP fabric networks are larger than most and not typical, but after looking at this problem space for 3 years this is the solution that works for us. As always, your mileage may vary.
    Replies
    1. Absolutely agree: at your scale it's either IBGP overlay on top of EBGP (I'm guessing you have hundreds of switches) or overlay networking in the hypervisor (which seems not to be an option).

      Would love to know more if you could share the details (offline).
Add comment
Sidebar