Layer-3-Only EVPN: Behind the Scenes

In the previous blog post, I described how to build a lab to explore the layer-3-only EVPN design and asked you to do that and figure out what’s going on behind the scenes. If you didn’t find time for that, let’s do it together in this blog post. To keep it reasonably short, we’ll focus on the EVPN control plane and leave the exploration of the data-plane data structures for another blog post.

The most important thing to understand when analyzing a layer-3-only EVPN/VXLAN network is that the data plane looks like a VRF-lite design: each VRF uses a hidden VLAN (implemented with VXLAN) as the transport VLAN between the PE devices.

Don’t believe me? Let’s start the lab (using Arista cEOS containers) and check whether that’s true ;)

Device Configuration

Before starting our journey, let’s review the relevant parts of the Arista cEOS device configuration (taken from S1):

vrf instance blue
   rd 65000:2
!
vrf instance red
   rd 65000:1
!
interface Ethernet1
   description s1 -> s2
   mtu 1600
   mac-address 52:dc:ca:fe:01:01
   no switchport
   ip address 10.1.0.1/30
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
!
interface Ethernet2
   description s1 -> h1 [stub]
   mac-address 52:dc:ca:fe:01:02
   no switchport
   vrf red
   ip address 172.16.0.1/24
!
interface Ethernet3
   description s1 -> h3 [stub]
   mac-address 52:dc:ca:fe:01:03
   no switchport
   vrf blue
   ip address 172.16.2.1/24
!
interface Loopback0
   ip address 10.0.0.1/32
   ip ospf area 0.0.0.0
!
interface Vxlan1
   vxlan source-interface Loopback0
   vxlan udp-port 4789
   vxlan vrf blue vni 200001
   vxlan vrf red vni 200000
!
ip routing
ip routing vrf blue
ip routing vrf red
!
router bgp 65000
   router-id 10.0.0.1
   no bgp default ipv4-unicast
   bgp advertise-inactive
   neighbor 10.0.0.2 remote-as 65000
   neighbor 10.0.0.2 update-source Loopback0
   neighbor 10.0.0.2 description s2
   neighbor 10.0.0.2 send-community standard extended large
   !
   address-family evpn
      neighbor 10.0.0.2 activate
   !
   !
   vrf blue
      rd 65000:2
      route-target import evpn 65000:2
      route-target export evpn 65000:2
      redistribute connected
   !
   vrf red
      rd 65000:1
      route-target import evpn 65000:1
      route-target export evpn 65000:1
      redistribute connected
  • Lines 1-5: We have two VRFs (red and blue)
  • Lines 7-14: The link between the switches is a P2P link with a larger MTU. We’re running P2P OSPF to speed up the convergence.
  • Lines 16-28: The host-to-switch links are layer-3 links in VRFs red and blue
  • Lines 37-38: We need transit VNI for the blue and red VRF
  • Lines 44-51: We have an IBGP neighbor
  • Lines 53-54: We’re exchanging EVPN routes with that IBGP neighbor
  • Lines 57-67: Defining RD/RT values for the two VRFs. We’re also redistributing connected subnets into BGP.

Let’s Start Exploring

Using the show vlan command, it’s trivial to confirm that the switches use two VLANs (one per EVPN transit VNI) on the VXLAN interface:

s1>show vlan
VLAN  Name                             Status    Ports
----- -------------------------------- --------- -------------------------------
1     default                          active    Mt1
4093* VLAN4093                         active    Cpu, Vx1
4094* VLAN4094                         active    Cpu, Vx1

* indicates a Dynamic VLAN

Want to check that these are VXLAN-backed VLANs? Sure (notice how the VNI values match the values we defined in the vxlan vrf vni command):

s1>show vxlan vni
VNI to VLAN Mapping for Vxlan1
VNI       VLAN       Source       Interface       802.1Q Tag
--------- ---------- ------------ --------------- ----------

VNI to dynamic VLAN Mapping for Vxlan1
VNI          VLAN       VRF        Source
------------ ---------- ---------- ------------
200000       4094       red        evpn
200001       4093       blue       evpn

Now we know what the data-plane topology looks like. Next, let’s focus on the forwarding tables:

s1>show ip route vrf red bgp
...

 B I      172.16.1.0/24 [200/0]
           via VTEP 10.0.0.2 VNI 200000 router-mac 00:1c:73:eb:d5:13 local-interface Vxlan1

As expected, the BGP (EVPN) route for 172.16.1.0/24 uses the VXLAN interface and the next-hop MAC address 00:1c:73:eb:d5:13. There is no next-hop IP address (apart from remote VTEP), as the switches don’t assign IP addresses to the transit VLAN.

Where does S1 get the remote MAC address? Glad you asked ;) Let’s explore the EVPN routes for the red VRF:

s1#show bgp evpn rd 65000:1 detail
BGP routing table information for VRF default
Router identifier 10.0.0.1, local AS number 65000
BGP routing table entry for ip-prefix 172.16.0.0/24, Route Distinguisher: 65000:1
 Paths: 1 available
  Local
    - from - (0.0.0.0)
      Origin IGP, metric -, localpref -, weight 0, tag 0, valid, local, best, redistributed (Connected)
      Extended Community: Route-Target-AS:65000:1 TunnelEncap:tunnelTypeVxlan EvpnRouterMac:00:1c:73:ff:68:31
      VNI: 200000
BGP routing table entry for ip-prefix 172.16.1.0/24, Route Distinguisher: 65000:1
 Paths: 1 available
  Local
    10.0.0.2 from 10.0.0.2 (10.0.0.2)
      Origin IGP, metric -, localpref 100, weight 0, tag 0, valid, internal, best
      Extended Community: Route-Target-AS:65000:1 TunnelEncap:tunnelTypeVxlan EvpnRouterMac:00:1c:73:eb:d5:13
      VNI: 200000

The EVPN transit tunnel encapsulation (VXLAN), VNI, and remote MAC address are encoded as extended BGP communities in every EVPN RT5 (IP prefix) update.

But why do we need the EVPN router MAC addresses on the transit VXLAN segment? Wouldn’t the VNI and remote VTEP be good enough? Unfortunately, VXLAN is nothing more than a transport mechanism that carries Ethernet frames over UDP, and every Ethernet frame must have a source- and a destination MAC address.

Want to see those MAC addresses in the forwarding tables? Let’s explore the VXLAN- and VLAN MAC address tables:

s1#show vxlan address-table
          Vxlan Mac Address Table
----------------------------------------------------------------------

VLAN  Mac Address     Type      Prt  VTEP             Moves   Last Move
----  -----------     ----      ---  ----             -----   ---------
4093  001c.73eb.d513  EVPN      Vx1  10.0.0.2         1       2:16:21 ago
4094  001c.73eb.d513  EVPN      Vx1  10.0.0.2         1       2:16:21 ago
s1#show mac address-table
          Mac Address Table
------------------------------------------------------------------

Vlan    Mac Address       Type        Ports      Moves   Last Move
----    -----------       ----        -----      -----   ---------
4093    001c.73eb.d513    DYNAMIC     Vx1        1       2:19:02 ago
4094    001c.73eb.d513    DYNAMIC     Vx1        1       2:19:01 ago

But where do the local MAC addresses come from? As you know, every VLAN (including the internal VLANs) has an associated VLAN interface on a layer-3 switch:

s1#show interfaces vlan4093
Vlan4093 is up, line protocol is up (connected)
  Hardware is Vlan, address is 001c.73ff.6831 (bia 001c.73ff.6831)
  No Internet protocol address assigned
  IPv6 link-local address is fe80::21c:73ff:feff:6831/64
  No IPv6 global unicast address is assigned
  IP MTU 9164 bytes (default)
  Up 2 hours, 20 minutes, 10 seconds
s1#show interfaces vlan4094
Vlan4094 is up, line protocol is up (connected)
  Hardware is Vlan, address is 001c.73ff.6831 (bia 001c.73ff.6831)
  No Internet protocol address assigned
  IPv6 link-local address is fe80::21c:73ff:feff:6831/64
  No IPv6 global unicast address is assigned
  IP MTU 9164 bytes (default)
  Up 2 hours, 20 minutes, 13 seconds

To recap:

  • A layer-3 switch creates a VLAN for every VXLAN segment.
  • A local (switch) MAC address is assigned to every VLAN segment (VLAN interface)1. The same MAC address is usually used for all VLAN segments.
  • The local MAC address and the transit VNI attached as extended BGP communities to every VRF IP prefix (RT5) EVPN route
  • The remote MAC address and the associated transit VNI are used to build the forwarding entry on the ingress routers.

Next: Using Multiple Transit VNIs per EVPN VRF Continue


  1. And I have no idea what they use to make the lower 24 bits unique, particularly in a virtual environment where all the containers are cloned from the same image. ↩︎

4 comments:

  1. how does this actually scale in larger environments or even existing environments where majority of the VLANs have already been consumed?

    ps. don't deploy vxlan, however i'd like to understand more

    Replies
    1. The only VLANs needed in this setup are the internal VLANs used for VXLAN-based transport (one per VRF). They are assigned dynamically (as needed) and don't have to match across the PE devices.

      However, if you deployed all 409x VLANs on every access switch, you have a bigger problem ;)

  2. Regarding "the transit VNI does not have to match across the PE devices" you are going to cover it in the future post, I speculate there could be a chipset limitations forcing the VNI to be the same or "symmetric". This limitation is documented for Juniper's ACX7100, so is this the case for anything Jericho2 based?

    Replies
    1. The obvious answer is "We don't know" (thank you, Broadcom, we love your NDAs), but I suspect it might be more of a case of a vendor doing a less-than-optimal job (in the case of L3VPN, L2VPN is another story)

  3. AFAIK, not all the vendors require to have dynamic virtual VLANs to map a L3VNI.

    I.e., Aruba CX and Dell OS10 don't (and - despite I hope nobody's doing that - you can potentially use for your network all the VLAN IDs).

    Additionally, they also allow to statically set the router-mac to use on the L3VNI and RT5 announces (i.e. https://www.arubanetworks.com/techdocs/AOS-CX/10.08/HTML/vxlan/Content/Chp_EVPN/EVPN_cmds/vir-mac.htm)

    Also Linux (frr, vyos, ...) does not require dynamic vlans, but instead requires to allocate a dedicated bridge interface (with no bridge slaves, apart from the vxlan device which maps the vni id) to be used for r-macs.

    Replies
    1. > AFAIK, not all the vendors require to have dynamic virtual VLANs to map a L3VNI.

      Don't require or don't show them? Also, what they're doing in a VM might be different from what they're doing in ASIC.

      > Also Linux (frr, vyos, ...) does not require dynamic vlans, but instead requires to allocate a dedicated bridge interface

      Correct, but how does that map into ASIC setup on Cumulus Linux (or any other switch using FRRouting)?

    2. >> AFAIK, not all the vendors require to have dynamic virtual VLANs to map a L3VNI.

      > Don't require or don't show them? Also, what they're doing in a VM might be different from what they're doing in ASIC.

      Talking about (BCM) ASIC here.

      Let's put this way: they do not require a strict VLAN, allowing the user to use all the possible VLANs ID (I know this because I did a qualification testing using all the vlans at the same time plus vxlan distributed irb).

      But since you triggered my curiosity, I went into the BCM shell to check.

      Seems that BCM chipsets allows to define a VLAN ID > 4096, and so vendor uses them for vxlan operations. This is what I see from the BCM shell on a fabric leaf which has only l3 vnis:

      SAI.0> l3 intf show
      Free L3INTF entries: 16369
      Unit  Intf  VRF Group VLAN    Source Mac     MTU TTL Tunnel InnerVlan  NATRealm
      ------------------------------------------------------------------
      0     1     -3    0     4095 1c:72:1d:bf:e8:80  16383 1    0     0     0
      0     2     -3    0     4095 1c:72:1d:bf:e8:99  1600 1    0     0     0
      0     3     -3    0     4095 1c:72:1d:bf:e8:9d  1600 1    0     0     0
      0     4     -3    0     4095 1c:72:1d:bf:e8:99  1600 1    0     0     0
      0     5     -3    0     4095 1c:72:1d:bf:e8:9d  1600 1    0     0     0
      0     4096  -3    0     28672 1c:72:1d:bf:e8:80  16383 1    0     0     0
      0     16376 2     0     28680 1c:72:1d:bf:e8:d3  9184 1    0     0     0
      0     16377 1     0     28679 1c:72:1d:bf:e8:d3  9184 1    0     0     0
      0     16378 5     0     28678 00:00:66:66:05:06  9184 1    0     0     0
      0     16379 4     0     28677 00:00:66:66:05:06  9184 1    0     0     0
      0     16380 3     0     28676 00:00:66:66:05:06  9184 1    0     0     0
      0     16381 2     0     28675 00:00:66:66:05:06  9184 1    0     0     0
      0     16382 1     0     28674 00:00:66:66:05:06  9184 1    0     0     0
      

      And 00:00:66:66:05:06 is the rMAC I statically defined on this leaf.

      So, a virtual interface, with VLAN id 28XXX is created for each VRF/L3 VNI, with the defined mac.

      But no VLAN space of the ASIC is used:

      SAI.0> vlan show
      vlan 1  ports xe1-xe23 (0x00000000000000000000000000000000000000000000000000007ff800001ffc), untagged xe1-xe23 (0x00000000000000000000000000000000000000000000000000007ff800001ffc) MCAST_FLOOD_UNKNOWN
      vlan 4095       ports ce,xe (0x0000000000000000000000000000000000000000000000000008fff800023ffe), untagged ce,xe (0x0000000000000000000000000000000000000000000000000008fff800023ffe) MCAST_FLOOD_UNKNOWN
      SAI.0>
      

      Then the virtual interface is used as egress operation for the VRF's routes. i.e.,

      SAI.0> l3 egress show 412291
      Entry  Mac                 Vlan INTF PORT MOD MPLS_LABEL ToCpu Drop RefCount L3MC
      412291  00:00:66:66:01:06 28675 16381     1    0        -1   no   no   37   no
      

      in the above case, a specific route is sent to the rmac 00:00:66:66:01:06, which is another leaf, using the virtual interface 16381.

      WRT Cumulus, if I'm not wrong it uses the Linux switchdev framework. Unfortunately it seems the documentation is not complete, so if a kernel programmer could have a look into it we can answer also that point ;)

    3. Wow. Thanks a million for that deep dive!

      As for Cumulus, it probably uses the Linux switchdev framework now that it only works on the Mellanox hardware 🤷‍♂️ (it used a proprietary translator between netlink and Broadcom SDK in the past)

  4. Hi Ivan,

    Minh Ha asked a question in a previous post I've linked at the bottom and I just happened to stumble into the same question. It comes from Hannes Gredler's "The Complete IS-IS Routing Protocol." The question specifically regards the potential loop that could be created if you actually ran a routing protocol over the tunnel interface. The author starts by saying,

    "Things behave really badly if the total IGP cost over the tunnel undermines the total topologies’ cost. What happens next is that the tunnel “wraps” around itself, ultimately causing a meltdown of the entire network."

    He finishes the paragraph with this: "Because no Hellos are sent down the tunnel there is no infinite recursion problem."

    I can't see why you couldn't run an IGP over the tunnel. But, of course, this creates redundant state, and with it more churn, so you shouldn't run a real IIH-based adjacency. The tunnel itself is based on other next-hops, so forwarding adjacency forgoes a real adjacency IIH-based adjacency because we can tie tunnel state to whether or not we have an IGP route to the endpoint.

    I've been sitting with pen and paper to try and make heads or tails of what he's saying here and I can't seem to wrap my brain around what he means.

    Any insight would be greatly appreciated!!!

    https://blog.ipspace.net/2020/08/worth-reading-default-isis-configuration-prefix-bloat/

Add comment
Sidebar