Using 4-Byte BGP AS Numbers With EVPN on Junos

After documenting the basic challenges of using EBGP and 4-byte AS numbers with EVPN automatic route targets, I asked my friends working for various vendors how their implementation solves these challenges. This is what Krzysztof Szarkowicz sent me on specifics of Junos implementation:

Four byte ASN can be used with EVPN (from Junos 13.x), including within an EVPN RT community, where the AS number takes four bytes, leaving two bytes for the VPN ID. Here is an example of EVPN advertisement with four-byte ASN from Junos EVPN PE:

root@R1# run show route advertising-protocol bgp 192.168.0.4 table RI-EVPN-1.evpn.0 detail    
W15: Sun 2018-04-15T16:32:18 CEST (UTC+0200)
 
RI-EVPN-1.evpn.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
* 1:192.168.0.1:201::0201222222000000::0/192 AD/EVI (1 entry, 1 announced)
 BGP group IBGP-TO-RR type Internal
     Route Distinguisher: 192.168.0.1:201
     Route Label: 37
     Nexthop: Self
     Localpref: 100
     AS path: [4200000000] I ← Local AS is 4200000000
     Communities: target:4200000000L:1111 ← RT with 4-byte ASN

However, 4B ASN doesn’t currently (18.1) works with automatic RT. When Automatic RT is used together with 4B ASN, 16 least significant bits from 4B ASN are used to populate automatic RT, e.g.:

* 2:192.168.0.1:201::201::00:11:11:11:11:11/304 MAC/IP (1 entry, 1 announced)
 BGP group IBGP-TO-RR type Internal
     Route Distinguisher: 192.168.0.1:201
     Route Label: 81
     ESI: 00:11:22:33:44:55:66:00:00:00
     Nexthop: Self
     Localpref: 100
     AS path: [4200000000] I ← Local AS is 4200000000
     Communities: target:59904:201 mac-mobility:0x1:sticky (sequence 0). ← 59904 is used in automatic RT

Where:

4200000000:    0xFA56EA00
     59904:        0xEA00

More Details on Automatic Route Targets in EBGP Environment

Automatic RT on MX platforms works since Junos 18.1. The ASN part is taken from global AS configuration (‘set routing-options autonomous-system <ASN>’).

In case of four-byte ASN, only 16 least significant bits are taken from global AS to generate automatic RT. In Junos, EBGP IP fabric configuration with automatic RT is quite simple, since we can define different local AS for iBGP (EVPN Overlay, RT autogeneration) and eBGP (Underlay).

Also, there is one more aspect, which are important from RT perspective. EVPN Type 4 routes have auto generated RTs encoded as ES-Import RT communities (RFC 7432, Section 7.6). Here’s an example:

* 4:192.168.0.1:0::112233445566000000:192.168.0.1/296 ES (1 entry, 1 announced) ← encoded ESI: Type 0
 BGP group IBGP-TO-RR type Internal
     Route Distinguisher: 192.168.0.1:0
     Nexthop: Self
     Localpref: 100
     AS path: [4200000000] I
     Communities: es-import-target: 11-22-33-44-55-66 ← ES-Import RT

These ES-Import RTs are generated from ESI, e.g.:

set interfaces ge-0/0/4 unit 201 esi 00:11:22:33:44:55:66:00:00:00

ESI has 10 bytes; ES-Import RT has space for 6 bytes payload (useful info). Therefore, first 6 most significant bytes from ESI payload are taken to generate ES-Import RT. In ESI, first byte is ESI Type (e.g. Type 0: manually configured), Therefore in the above example 11:22:33:44:55:66 are taken from configured ESI and used to populate advertised ES-Import RT. ES-Import RT enables all the PEs connected to the same multihomed site to import the Ethernet Segment (Type 4) routes.

Now, if I have following ESIs configured on some interfaces:

00:00:00:00:11:22:33:44:55:66, connected to PE1/PE2
00:00:00:00:11:22:33:44:55:77, connected to PE2/PE3
00:00:00:00:11:22:33:44:55:88, connected to PE3/PE1 

ES-Import RT will be:

00:00:00:00:11:22:33:44:55:66, —> 00:00:00:11:22:33 
00:00:00:00:11:22:33:44:55:77, —> 00:00:00:11:22:33
00:00:00:00:11:22:33:44:55:88, —> 00:00:00:11:22:33

That is, it is the same for all above ESIs. So, all PEs will import these Type 4 routes, but later, based on the actual ESI advertised in this Type 4 routes, and actual ESI configured on some local interface, if there is no match, the route will not be used.

So, while from EVPN machinery perspective, it is OK, it is not most optimal. Most optimal (i.e. for control plane optimization), is to differentiate ESIs in the first 6 useful bytes (rather than use last 3 bytes), so that generated ES-Import RT prevents these routes from even being imported on PEs that don’t need them.

Master EVPN and Data Center Fabrics

15 comments:

  1. Juniper's solution seems to me like a bricolage with no duct tape.
  2. "In Junos, EBGP IP fabric configuration with automatic RT is quite simple, since we can define different local AS for iBGP (EVPN Overlay, RT autogeneration) and eBGP (Underlay)."
    It seems I've seen this before :)
    I still can't find any reason why EBGP-only EVPN design is viable. It looks like doing MPLS L3VPN Option B between all your PE routers instead of time-proved iBGP design. If you ever had to manage large-scale option-B peering, you know how fun is this.

    On a side note, I think it's worth to point out that Type-4 ES-Import RT communities is special case only for Type-4 EVPN routes, doesn't really relate to previous discussion and doesn't use AS numbers.
    Replies
    1. "I still can't find any reason why EBGP-only EVPN design is viable" << and people who got it right (the FRR crowd) can't figure out why anyone would want to run IGBP on top of EBGP (potentially between the same boxes) and pretend each box belongs to two ASNs.

      In my personal (highly biased, as always) opinion, two designs make sense: IBGP-over-IGP, or EBGP only. There are corner cases that might require IGBP-over-EBGP (humongous number of prefixes on hundreds of switches), everything else seems like vendors fixing their implementation idiosyncrasies with "interesting" designs.
    2. s/IGBP/IBGP/gi - seems like I'm too old to get the acronyms right :(
    3. What exactly they got right in FRR that others didn't? (apart from not implementing proper multihoming yet)
    4. There's no right or wrong, it's always: it depends. His friends at this turtle company found a way to discover BGP neighbors through IPv6 RA messages. They also automatically derive RDs and RTs. Very revolutionary!
    5. Disclaimer: I made it my mission to help network engineers build stable networks, so I have a certain bias that's totally different from the perspective of people trying to sell hot new gizmos.

      "There's no right or wrong, it's always: it depends." << Agree. However, there's is better or worse for a particular use case.

      "His friends at this turtle company found a way to discover BGP neighbors through IPv6 RA messages. They also automatically derive RDs and RTs. " << I prefer easy-to-understand configurations with as few parameters and data duplication as possible, and I keep wondering why nobody else bothered. You might prefer convoluted configurations and IP addresses that have to be managed somewhere. We might agree to disagree, but that means at least one of us is not completely rational (see https://en.wikipedia.org/wiki/Aumann%27s_agreement_theorem)

      "Very revolutionary!" << Revolutionary, Innovative and Disruptive is usually marketing bullshit, and it often means "you'll have to deal with a zillion bugs nobody found in our half-baked unicorn-powered **** yet". I prefer Useful and Reliable. As I said above, YMMV.
    6. After reading your comment I thought that FRR has really "got right" something special about EVPN than others didn't, which I'm not aware of. But unfortunately after digging into Cumulus EVPN documentation I didn't find any special magic sause.

      BGP Unnumbered is just the fancy way of establishing BGP session. We might argue about its usefulness - some people might say that it's the best thing ever since sliced bread, but others might object that it's just hiding of complexity away from end user. "I would love to see how you explain this to the guy that has to do troubleshooting @ 2AM on Sunday morning" - I've been told before. I agree that this feature can really help you drastically simplify configuration ( == simplify your automation during initial deployment), but that only half of the picture. Anyway, in the end it's just simple BGP session, nothing special about it. I think it's a little aside of e/i BGP design topic.

      "They also automatically derive RDs and RTs." - Nothing special here, according to CL3.6 EVPN documentation everything done exactly the same as in Junos. All major vendors do this already.

      I keep wondering what you'll say when FRR guys finally implement iBGP+eBGP design option in their code and start presenting it as another revolution.
    7. Hi Alex,

      Between the lines your comments sound like "you became a Cumulus fanboy". As you might know, I practice equal-opportunity snarkiness (= treating every vendor doing something suboptimal the same way). This time Cumulus happens to have the cleanest implementation.

      More in a separate blog post. In the meantime, try to figure out what's really wrong with the IBGP-over-EBGP approach and why Juniper pushes it. As a bonus challenge: figure out why Cisco pushes running EVPN EBGP sessions between loopback interfaces.
    8. Hint: you'll find the answer to one of the questions in the blog post from may 2nd 2018
  3. "figure out why Cisco pushes running EVPN EBGP sessions between loopback interfaces."

    Is this not because you want to be able to use ECMP? If you use the interface address and your network looks like:

    A ----- B
    | \ / |
    ToR1 ToR2
    \ /
    Rack

    Then if A advertises the EVPN routes from ToR1 with the next hop as ToR1's interface to A, you won't be able to split traffic to ToR1 via both A and B (as sending it via A will be a shorter path). So instead, you use the loopback address of the nodes for the eBGP sessions.

    Is that not true?
    Replies
    1. Cisco is doing ECMP based on IGP (OSPF or IS-IS). But IGPs don't scale well (as you might have heard) except for RIFT and Openfabric. The others are trying to do ECMP based on BGP.
    2. trying and succeeding i might add
  4. Uhm, is it just me or is this whole "EVPN w/IBGP or EBGP" debate solved the same way you'd scale up any other large network?

    - each, say, metro area (but IGP area choice is arbitrary depending on the situation; maybe it's datacenters, maybe countries/regions/whatever) is an ASN, within which you have a single IGP area & IBGP "core full-mesh/edge RR client" hierarchy. IGP is the limit on scale, which to stay quick & reactive you want to keep comfortably under 1000 nodes per IGP domain (however well-tuned link-state IGPs will scale well beyond this, especially if the worst-case latency within the area is low).

    - between e.g. metro areas (again, IGP area choice depends on requirements) you run EBGP (or BGP confeds, which amounts to pretty much the same thing) between ASBRs.
    Replies
    1. Life would be as easy as you describe it (and you're absolutely right) if only:

      * People wouldn't start using BGP as an IGP replacement, and then inventing crazy schemes like EBGP-over-EBGP or even worse IBGP-over-EBGP when they hit a wall;
      * VXLAN ASICs would work like MPLS where you can stitch labels as AS boundary (Inter-AS Option B). Only Cisco can do VXLAN-to-VXLAN bridging, everyone else requires the next hop to remain unchanged.

      Finally, there's the "small" problem of automatic RT/RD - great idea, until someone rips out IBGP-over-IGP (the scenario for which EVPN was designed) and plugs in hop-by-hop EBGP (where every PE-switch has a different AS).
Add comment
Sidebar