MUST READ: ARP Problems in EVPN

Decades ago there was a trick question on the CCIE exam exploring the intricate relationships between MAC and ARP table. I always understood the explanation for about 10 minutes and then I was back to I knew why that’s true, but now I lost it.

Fast forward 20 years, and we’re still seeing the same challenges, this time in EVPN networks using in-subnet proxy ARP. For more details, read the excellent ARP problems in EVPN article by Dmytro Shypovalov (I understood the problem after reading the article, and now it’s all a blur 🤷‍♂️).

3 comments:

  1. That is a really good thing to keep in mind.

    We ran into this a year ago in a VXLAN network and it took us a good week to troubleshoot why some connections were failing the way they were.

    Though we chose to simple increase the mac timeout timer, to potentially keep load lower, instead of decreasing the arp timeout timer.

  2. The linked to blog post seems to make two main points:
    1. ARP suppression can break ARP-based liveliness probes (i.e., when they need to be forwarded via VXLAN, but this is suppressed).
    2. Withdrawal of EVPN type 2 MAC/IP routes when the MAC table entry of a leaf times out results in missing ARP table entries on different leaf switches, which can affect asymmetric routing EVPN/VXLAN setups.

    Synchronizing ARP and MAC timeouts can help with point 2, but not point 1.

    The classic reason to synchronize ARP and MAC timeouts is to avoid unknown unicast flooding caused by ECMP to a pair of first hop routers when all traffic out of the subnet goes via one router (e.g., the VRRP master) and return traffic (also) arrives at another (e.g., VRRP backup) router.

    It does not matter much if the MAC timeout is increased to match the ARP timeout, or the ARP timeout decreased to that of the MAC timeout, or both set to matching non-default values.

    Matching here means that the ARP timeout should be slightly below the MAC timeout in order to refresh MAC tables entries with the ARP exchange.

    ExtremeXOS has a nice feature called "ARP refresh" that automatically generates ARP requests from the router before an ARP entry times out. If used in combination with appropriate MAC and ARP timeouts, this reliably avoids problems with not synchronized MAC and ARP tables, and avoids problems with the combination of so called silent devices, MAC authentication, and dynamic VLAN assignment.

    P.S. I'm no CCIE, so I don't know about CCIE trick questions. ;-)

    Replies
    1. Here's a nice summary of the original ARP/MAC timer issues:

      https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6000-series-switches/23563-143.html

      As for "ARP refresh", Cisco IOS does something similar when you use CEF switching:

      https://blog.ipspace.net/2007/06/ar.html

      No idea what NX-OS or IOS XE is doing.

    2. One difference between Cisco IOS and ExtremeXOS is that the latter allows disabling this functionality (disable iparp refresh, it is enabled by default).

  3. Dmytro Shypovalov is a master networker who has a sophisticated grasp of some of the most advanced topics in networking. He doesn't write often, but when he does, he writes exceptional content, both deep and broad. Have to say I agree with him 300% on "If an L2 network doesn’t scale, design a proper L3 network. But if people want to step on rakes, why discourage them." Hierarchical networking -- L3 routed network being an example – is the right way to scale. That L2 addressing was flat, created as a workaround from the start, and never meant to be a long-term solution, means all attempts to scale L2-based networks are attempts to make pigs fly, and therefore, by necessity, have to add a lot of moving parts, which increase complexity quickly.

    Over-optimization is a dumb process, and overclocking a L2-based network is a prime example of that. Due to the unscalable nature of L2 networking, kludges like VXLAN multihoming’s BUM flooding trick, have to be implemented, significantly complicating the network. If one adds another kludge, MLAG (mentioned in his linked post On Duplicates), on top of that, the whole thing quickly spirals out of control. With this kind of state explosion, how can anyone intuitively visualize the network and the data flow? Worse, how many combinations of error can show up, some already mentioned in his posts? Even without MLAG, the VXLAN BUM trick can get nasty proportional to the size and the degree of multi-homing in a network. And L2 network being unscalable is again clearly on display when one looks at asymmetric Inter-VXLAN routing, the way it works. Once again state explosion. So they have to come up with the symmetric IRB trick – essentially back to routing and hierarchy -- to reduce state and scale. Now how is all this complexity realized in the hardware FIB, and what kind of additional resources and latency it imposes? Is that why VXLAN in theory can scale to 2^24 segments, but not in actual hardware? I’d like to see the forwarding pipeline on a VTEP.

    The sheer complexity of EVPN and particularly of EVPN over VXLAN, creates a lot of subtlety. How many people have any hope of understanding all this, let alone applying the knowledge to efficiently design and build a big network? What happens if you have to troubleshoot this complex matrix of feature interaction, in a large network? The thing is, networking and networking devices are already very complex to begin with -- the ARP problem in async-routed network is complicated enough on its own -- so the goal should be more simplicity, less complexity. With EVPN over VXLAN, it's the other way around because there're more moving parts being added. And look at the nested encapsulation of VXLAN. Layering is harmful due to possible inter-layer dependency and added processing, and VXLAN just takes it to a class of its own.

    All of these dirty tricks, trying to make pigs fly, trying to scale the unscalable L2, come from people lacking a understanding of how addressing works, and how inadequate our current addressing model is (missing half of the structure), hence the creation of abominations like VXLAN and IPv6. Done correctly, and all of these become unnecessary.

    Btw Ivan, Dmytro mentioned this “if the whole process of re-discovering host B MAC and advertising it via BGP will take longer than 1 second, it will cause SW1/SW2 to send ICMP unreachables back to host A”. Not that I don’t trust him, but I tried to look for this sort of Arp response timeout on Cisco website, and couldn’t locate it. On the host side, looks like it’s gonna try for 3 seconds before giving up:

    https://unixpedia.blogspot.com/2017/06/no-route-to-host-errors-is-situation.html

    Is there a command you know of, to adjust this ARP setting for Cisco devices?

    Replies
    1. "Is there a command you know of, to adjust this ARP setting for Cisco devices?"

      Cisco IOS drops the packet(s) if there's no corresponding ARP entry without even sending back an ICMP reply. See https://blog.ipspace.net/2007/04/why-is-first-ping-lost.html for details.

      I would expect most other networking operating system behave the same way. It makes no sense to buffer a packet while waiting for an ARP reply when implementing an unreliable delivery mechanism (an IP network).

    2. There are networking operating systems that buffer some packets during ARP resolution, e.g., those using the Linux TCP/IP stack for packet forwarding, but also others. In my experience, the Cisco (at least IOS, IOS XE, and IOS XR) behavior of dropping packets unless an ARP cache entry exists is the exception (IMHO this Cisco behavior is sensible and usually preferable, but some buffering during ARP resolution works, too).

      Section 3.3.2 of RFC 1812 recommends this buffering behavior:

      "[The link layer] SHOULD queue up to a small number of datagrams breifly while performing the ARP request/reply sequence"

      It also specifies that "[t]he link layer MUST NOT report a Destination Unreachable error to IP solely because there is no ARP cache entry for a destination."

    3. And I was wrong... again. With CEF switching, the incoming packet is buffered until the ARP entry is complete. Here's the debugging printout of a single CEF-switched packet:

      • CEF input detailed debug message
      • ARP debug messages
      • CEF output detailed debug message
      *Oct 23 16:16:02.849: CEF-Debug: Packet from 10.1.0.1 (Gi0/1) to 10.1.0.5
      *Oct 23 16:16:02.849:   ihl=20, length=84, tos=0, ttl=63, checksum=58291, offset=0 DF
      *Oct 23 16:16:02.849:     ICMP type=8, code=0, checksum=42374
      *Oct 23 16:16:02.849:          ECHO
      *Oct 23 16:16:02.850: IP ARP: creating incomplete entry for IP address: 10.1.0.5 interface GigabitEthernet0/2
      *Oct 23 16:16:02.850: IP ARP: sent req src 10.1.0.6 5254.00d4.e045,
                       dst 10.1.0.5 0000.0000.0000 GigabitEthernet0/2
      *Oct 23 16:16:02.850: IP ARP: rcvd rep src 10.1.0.5 5254.0040.38b4, dst 10.1.0.6 GigabitEthernet0/2
      *Oct 23 16:16:02.851: CEF-Debug: Packet from 10.1.0.1 (Gi0/1) to 10.1.0.5 (Gi0/2)
      *Oct 23 16:16:02.851:   ihl=20, length=84, tos=0, ttl=63, checksum=58291, offset=0 DF
      *Oct 23 16:16:02.851:     ICMP type=8, code=0, checksum=42374
      *Oct 23 16:16:02.851:          ECHO
      
Add comment
Sidebar