… updated on Friday, May 5, 2023 05:18 UTC
Silent Hosts in EVPN Fabrics
The Dynamic MAC Learning versus EVPN blog post triggered tons of interesting responses describing edge cases and vendor
bugs implementation details, including an age-old case of silent hosts described by Nitzan:
Few years ago in EVPN network, I saw drops on the multicast queue (ingress replication goes to that queue). After analyzing it we found that the root cause is vMotion (the hosts in that VLAN are silent) which starts at a very high rate before the source leaf learns the destination MAC.
It turns out that the behavior they experienced was caused by a particularly slow EVPN implementation, so it’s not exactly the case of silent hosts, but let’s dig deeper into what could happen when you do have silent hosts attached to an EVPN fabric.
Let’s define silent hosts first. They are nodes that never send any traffic, so the switches cannot learn their MAC addresses and are forced to flood the traffic sent to those MAC addresses. Typical examples would be Syslog servers or traffic monitoring/inspection appliances; we’ll ignore monstrosities like Microsoft NLB for the moment.
Then there are what Someone in his comment called shy hosts – hosts that are completely quiet for a long time, so everyone’s ARP and MAC address caches time out before those hosts start chatting. However, if the communication with those hosts involves the usual initial exchange of ARP and TCP SYN packets, everything should be fine… unless the EVPN control plane takes “forever” to propagate the newly-rediscovered MAC address, in which case all communication to those hosts is flooded until the control plane gets its job done1. That’s obviously a pathological scenario that should result in yelling at the vendor until they get their **** together, and never buying from then again, but we all know that’s not exactly how enterprise IT works.
Back to silent hosts. If’s worth noting that with decent EVPN control plane, vMotion should fix the problem, not make it worse. ESXi servers send RARP packets on behalf of the moved virtual machines after completing vMotion to inform the switches that the VM MAC address has moved. Unfortunately, you could turn off the notify switches option making the vMotion events invisible, but that would result in traffic blackholes2 – the switches would think the VM MAC address is still present on the origin ESXi server – not flooding.
Now for the elephant in the room: whoever is sending the traffic to a silent host must know its MAC address. While one could use static ARP entries, we usually don’t, so the senders must send ARP queries now and then, and the silent hosts must respond to them, enabling the switches to learn all the MAC addresses in the VLAN.
Time to go back to first principles: the only way to solve the silent host challenge is to ensure MAC address entries time out later than ARP entries. That’s easy to do if the traffic is entering the VLAN through a router and a bit more cumbersome if you have to adjust the ARP timeouts on all hosts in the VLAN.
Fortunately, modern TCP/IP stacks use short ARP timeouts – default value on Linux is 30 seconds (randomized into 15-45 seconds), and the kernel removes stale entries (mappings without incoming traffic) every 60 seconds. ARP entries for silent hosts should become stale almost immediately and be refreshed in approximately two minutes.
Switches and routers have a different perspective. Cisco IOS and Arista EOS still age out ARP entries in four hours; Cisco Nexus OS does it in 1500 seconds (25 minutes). No wonder we get flooding in VLANs with silent hosts.
Back to the comment:
The quick and ugly solution was to scan the vMotion VLAN with NMAP every few minutes so the leafs would have all of the MAC addresses in their EVPN database.
And now we know why that works (assuming the Linux host running NMAP is attached to the same VLAN): ARP entries in the Linux kernel would become stale between NMAP runs, triggering ARP requests and responses from silent hosts regardless of whether the silent hosts would answer NMAP probes.
Long story short: the ancient challenge often used in vendor certification written exams did not disappear just because we replaced STP with EVPN. You might get flooded traffic whenever the ARP timeouts in your network are larger than the MAC address table timeouts.
Want to know more about EVPN? Check out the EVPN Technical Deep Dive webinar.
- Rewrote the vMotion-related part of the blog post based on the comment describing the impact of a particularly slow EVPN control plane implementation.
vMotion could easily generate a 10 Gbps TCP stream. Now imagine flooding that across the whole vSphere cluster, and sending numerous copies of every packet over leaf-to-spine uplinks due to ingress replication. Fun times. ↩︎
The same thing would happen if the EVPN control plane takes too long to advertise the MAC move, but the black hole would disappear as soon as EVPN gets its act together, whereas without the notify switches traffic would be blackholed until the moved VM sends its first packet (plus whatever time it takes for everyone to learn the new location of the VM MAC address). ↩︎
What you’ve described in this article, and in the previous ones in this subject should be the case but is not always true. I’ve been suffering for some time a problem related with the mac learning in EVPN from a big vendor, you can call it faulty implementation or feature, that’s up to anyone reading this.
The ESXi hosts are truly… if not silent, shy hosts; they don’t say a thing with the vmkernel vmotion interface until they have some VM to move. The ARP cache entries timeout is 20 minutes in VMware, the vendors switch mac address aging time is less and DRS is enabled at low sensibility (few vm moves). It’s true that the ESXi hosts send out an ARP request, create some ICMP packets and then the TCP session for the vmotion, they are truly polite. The problem that I suffered is that in some implementation of one vendor, a specific silicon can take up to two seconds to realize that it has a new mac to advertise, and then create the route type-2 advertisement. So, ARP, ICMP and TCP are being flooded until the remote switch learns the VTEP that has the ESXi behind and with HER/ingress replication this is a huge problem.
Then we have the RARP problem after the vmotion, if we take the same random time between zero and two seconds, the traffic destinated to this VM is going to go back to the old VTEP instead the actual one until convergence is achieved.
The sad solution is to limit the unknown unicast traffic and increase the mac aging-time to reduce the possibility of this problem.
Thanks a million, now it makes perfect sense. I should have seen that (oh, the "beauty" of perfect hindsight). Will rewrite the blog post accordingly.
As for "two seconds to report a new MAC address", that's plain ridiculous. I'm always amazed what vendors can get away with without anyone crucifying them in public.