Using 4-Byte BGP AS Numbers With EVPN on Junos
After documenting the basic challenges of using EBGP and 4-byte AS numbers with EVPN automatic route targets, I asked my friends working for various vendors how their implementation solves these challenges. This is what Krzysztof Szarkowicz sent me on specifics of Junos implementation:
Four byte ASN can be used with EVPN (from Junos 13.x), including within an EVPN RT community, where the AS number takes four bytes, leaving two bytes for the VPN ID. Here is an example of EVPN advertisement with four-byte ASN from Junos EVPN PE:
root@R1# run show route advertising-protocol bgp 192.168.0.4 table RI-EVPN-1.evpn.0 detail
W15: Sun 2018-04-15T16:32:18 CEST (UTC+0200)
RI-EVPN-1.evpn.0: 4 destinations, 4 routes (4 active, 0 holddown, 0 hidden)
* 1:192.168.0.1:201::0201222222000000::0/192 AD/EVI (1 entry, 1 announced)
BGP group IBGP-TO-RR type Internal
Route Distinguisher: 192.168.0.1:201
Route Label: 37
Nexthop: Self
Localpref: 100
AS path: [4200000000] I ← Local AS is 4200000000
Communities: target:4200000000L:1111 ← RT with 4-byte ASN
However, 4B ASN doesn’t currently (18.1) works with automatic RT. When Automatic RT is used together with 4B ASN, 16 least significant bits from 4B ASN are used to populate automatic RT, e.g.:
* 2:192.168.0.1:201::201::00:11:11:11:11:11/304 MAC/IP (1 entry, 1 announced)
BGP group IBGP-TO-RR type Internal
Route Distinguisher: 192.168.0.1:201
Route Label: 81
ESI: 00:11:22:33:44:55:66:00:00:00
Nexthop: Self
Localpref: 100
AS path: [4200000000] I ← Local AS is 4200000000
Communities: target:59904:201 mac-mobility:0x1:sticky (sequence 0). ← 59904 is used in automatic RT
Where:
4200000000: 0xFA56EA00
59904: 0xEA00
More Details on Automatic Route Targets in EBGP Environment
Automatic RT on MX platforms works since Junos 18.1. The ASN part is taken from global AS configuration (‘set routing-options autonomous-system <ASN>’).
In case of four-byte ASN, only 16 least significant bits are taken from global AS to generate automatic RT. In Junos, EBGP IP fabric configuration with automatic RT is quite simple, since we can define different local AS for iBGP (EVPN Overlay, RT autogeneration) and eBGP (Underlay).
Also, there is one more aspect, which are important from RT perspective. EVPN Type 4 routes have auto generated RTs encoded as ES-Import RT communities (RFC 7432, Section 7.6). Here’s an example:
* 4:192.168.0.1:0::112233445566000000:192.168.0.1/296 ES (1 entry, 1 announced) ← encoded ESI: Type 0
BGP group IBGP-TO-RR type Internal
Route Distinguisher: 192.168.0.1:0
Nexthop: Self
Localpref: 100
AS path: [4200000000] I
Communities: es-import-target: 11-22-33-44-55-66 ← ES-Import RT
These ES-Import RTs are generated from ESI, e.g.:
set interfaces ge-0/0/4 unit 201 esi 00:11:22:33:44:55:66:00:00:00
ESI has 10 bytes; ES-Import RT has space for 6 bytes payload (useful info). Therefore, first 6 most significant bytes from ESI payload are taken to generate ES-Import RT. In ESI, first byte is ESI Type (e.g. Type 0: manually configured), Therefore in the above example 11:22:33:44:55:66 are taken from configured ESI and used to populate advertised ES-Import RT. ES-Import RT enables all the PEs connected to the same multihomed site to import the Ethernet Segment (Type 4) routes.
Now, if I have following ESIs configured on some interfaces:
00:00:00:00:11:22:33:44:55:66, connected to PE1/PE2
00:00:00:00:11:22:33:44:55:77, connected to PE2/PE3
00:00:00:00:11:22:33:44:55:88, connected to PE3/PE1
ES-Import RT will be:
00:00:00:00:11:22:33:44:55:66, —> 00:00:00:11:22:33
00:00:00:00:11:22:33:44:55:77, —> 00:00:00:11:22:33
00:00:00:00:11:22:33:44:55:88, —> 00:00:00:11:22:33
That is, it is the same for all above ESIs. So, all PEs will import these Type 4 routes, but later, based on the actual ESI advertised in this Type 4 routes, and actual ESI configured on some local interface, if there is no match, the route will not be used.
So, while from EVPN machinery perspective, it is OK, it is not most optimal. Most optimal (i.e. for control plane optimization), is to differentiate ESIs in the first 6 useful bytes (rather than use last 3 bytes), so that generated ES-Import RT prevents these routes from even being imported on PEs that don’t need them.
Master EVPN and Data Center Fabrics
- Want to know more about EVPN technology? Watch the EVPN Technical Deep Dive webinar.
- Want to know how to use EVPN in data center fabric designs? Watch the Mixed Layer-2+3 Fabrics section of Leaf-and-Spine Fabric Architectures webinar.
- Want to learn the basics of data center fabrics and figure out what individual vendors are doing? Check out the Data Center Fabrics webinar.
- Looking for a guided and mentored tour with plenty of peer- and instructor support? You probably need Designing and Building Data Center Fabrics online course.
It seems I've seen this before :)
I still can't find any reason why EBGP-only EVPN design is viable. It looks like doing MPLS L3VPN Option B between all your PE routers instead of time-proved iBGP design. If you ever had to manage large-scale option-B peering, you know how fun is this.
On a side note, I think it's worth to point out that Type-4 ES-Import RT communities is special case only for Type-4 EVPN routes, doesn't really relate to previous discussion and doesn't use AS numbers.
In my personal (highly biased, as always) opinion, two designs make sense: IBGP-over-IGP, or EBGP only. There are corner cases that might require IGBP-over-EBGP (humongous number of prefixes on hundreds of switches), everything else seems like vendors fixing their implementation idiosyncrasies with "interesting" designs.
"There's no right or wrong, it's always: it depends." << Agree. However, there's is better or worse for a particular use case.
"His friends at this turtle company found a way to discover BGP neighbors through IPv6 RA messages. They also automatically derive RDs and RTs. " << I prefer easy-to-understand configurations with as few parameters and data duplication as possible, and I keep wondering why nobody else bothered. You might prefer convoluted configurations and IP addresses that have to be managed somewhere. We might agree to disagree, but that means at least one of us is not completely rational (see https://en.wikipedia.org/wiki/Aumann%27s_agreement_theorem)
"Very revolutionary!" << Revolutionary, Innovative and Disruptive is usually marketing bullshit, and it often means "you'll have to deal with a zillion bugs nobody found in our half-baked unicorn-powered **** yet". I prefer Useful and Reliable. As I said above, YMMV.
BGP Unnumbered is just the fancy way of establishing BGP session. We might argue about its usefulness - some people might say that it's the best thing ever since sliced bread, but others might object that it's just hiding of complexity away from end user. "I would love to see how you explain this to the guy that has to do troubleshooting @ 2AM on Sunday morning" - I've been told before. I agree that this feature can really help you drastically simplify configuration ( == simplify your automation during initial deployment), but that only half of the picture. Anyway, in the end it's just simple BGP session, nothing special about it. I think it's a little aside of e/i BGP design topic.
"They also automatically derive RDs and RTs." - Nothing special here, according to CL3.6 EVPN documentation everything done exactly the same as in Junos. All major vendors do this already.
I keep wondering what you'll say when FRR guys finally implement iBGP+eBGP design option in their code and start presenting it as another revolution.
Between the lines your comments sound like "you became a Cumulus fanboy". As you might know, I practice equal-opportunity snarkiness (= treating every vendor doing something suboptimal the same way). This time Cumulus happens to have the cleanest implementation.
More in a separate blog post. In the meantime, try to figure out what's really wrong with the IBGP-over-EBGP approach and why Juniper pushes it. As a bonus challenge: figure out why Cisco pushes running EVPN EBGP sessions between loopback interfaces.
Is this not because you want to be able to use ECMP? If you use the interface address and your network looks like:
A ----- B
| \ / |
ToR1 ToR2
\ /
Rack
Then if A advertises the EVPN routes from ToR1 with the next hop as ToR1's interface to A, you won't be able to split traffic to ToR1 via both A and B (as sending it via A will be a shorter path). So instead, you use the loopback address of the nodes for the eBGP sessions.
Is that not true?
- each, say, metro area (but IGP area choice is arbitrary depending on the situation; maybe it's datacenters, maybe countries/regions/whatever) is an ASN, within which you have a single IGP area & IBGP "core full-mesh/edge RR client" hierarchy. IGP is the limit on scale, which to stay quick & reactive you want to keep comfortably under 1000 nodes per IGP domain (however well-tuned link-state IGPs will scale well beyond this, especially if the worst-case latency within the area is low).
- between e.g. metro areas (again, IGP area choice depends on requirements) you run EBGP (or BGP confeds, which amounts to pretty much the same thing) between ASBRs.
* People wouldn't start using BGP as an IGP replacement, and then inventing crazy schemes like EBGP-over-EBGP or even worse IBGP-over-EBGP when they hit a wall;
* VXLAN ASICs would work like MPLS where you can stitch labels as AS boundary (Inter-AS Option B). Only Cisco can do VXLAN-to-VXLAN bridging, everyone else requires the next hop to remain unchanged.
Finally, there's the "small" problem of automatic RT/RD - great idea, until someone rips out IBGP-over-IGP (the scenario for which EVPN was designed) and plugs in hop-by-hop EBGP (where every PE-switch has a different AS).