Is Dynamic MAC Learning Better Than EVPN?
One of my readers worried about the control-plane-induced MAC learning lag in EVPN-based networks:
In all discussions about the advantages/disadvantages of VXLAN/EVPN, I can’t find any regarding the lag in learning new macs when you use the control plane for mac learning.
EVPN is definitely slower than data plane-based dynamic MAC learning (regardless of whether it’s done in hardware or software), but so is MLAG.
Aside: I had a customer that used an MLAG cluster with thousands of MAC addresses (VMs) reachable through an orphan (single-node) trunk link. It took minutes for the network to converge after a link failure that moved all those nodes to another orphan link due to the control-plane propagation of MAC reachability information.
That design was as awful as one could make it, but they inherited years of “organic growth” and that was the best they could do.
Anyway, while there’s a noticeable difference between data-plane and control-plane MAC learning, the real question is “does it matter?”. Back to my reader:
This has implications in how much BUM traffic is generated when someone starts a communication with a silent host, or when you do vMotion. While in data plane learning you measure this time in ms, in control plane learning it can take hundreds of ms or in some vendors seconds.
The BUM traffic concern is valid, but I wonder how often we see silent hosts these days. If nothing else, at least some network devices periodically refresh their ARP caches – if they’ve seen a host once, they’ll send it an ARP request every now and then. Furthermore, you’d need a silent host combined with a heavy burst of UDP traffic1 to generate noticeable amount of BUM traffic. Plausible, but not likely.
The vMotion argument is more worrying, until you realize the “at most one ping lost” claim people love to make is nonsense2. vMotion is neither instantaneous nor lossless, and while nobody wants to publish the measurements3, the final step of the vMotion process (freeze-transfer-thaw-resume) takes milliseconds.
EVPN definitely adds extra delay to the vMotion process. After all, the target hypervisor sends the RARP packet4 once the VM is ready. From the real-life anecdata perspective, I know plenty of organizations running either EVPN/VXLAN or Cisco ACI, and nobody ever complained about the connectivity problems following VM moves. If you have a counterexample, please write a comment.
Long story short:
- Assuming the BGP update timers are tweaked down to zero, EVPN control-plane delay is probably not a big deal in networks with reasonable number of MAC addresses and reasonable amount of churn.
- When in doubt, don’t vMotion VoIP gateways or high-performance workloads. Your users might notice.
More Details
As always, you’ll find hours of relevant content in these ipSpace.net webinars:
- Switching, Routing and Bridging part of How Networks Really Work
- EVPN Technical Deep Dive
- vSphere 6 Networking Deep Dive
- Leaf-and-Spine Fabric Architectures
-
… or TCP sessions with no idle detection. ↩︎
-
Ping packets are usually sent once per second. If vMotion-induced outage takes half a second you have a 50% chance of not losing a ping. Furthermore, the usual ping timeout is two seconds. If you do lose a packet, it’s likely you won’t lose the next one two seconds later. ↩︎
-
For whatever reason Spirent didn’t want the actual vMotion performance part of their Networking Field Day presentation recorded. ↩︎
-
Because VMware never bothered to figure out how to find the IP address of the VM, and RARP was the only broadcast packet they could find that did not need an IP address in the payload. Everyone else uses gratuitous ARPs. ↩︎
I had an issue with Control Plane learning in an EVPN Fabric. But its a bit special. The old fabric was FabricPath, the new one an Arista VXLAN EVPN fabric using ESIs instead of MLAGs. After all was setup and tested, we connected the two fabric together using a vPC/ESI and everything was fine for a while. Then vMotion happend, most VMs where fine but there were always some VMs that got stuck or had errors during vMotion and they needed to run vMotion for them again. It took us some time to get the issue, because its two issues resulting in this behaviour. FabricPath doesn't learn the MAC address of a sender with the first paket, which is usually a broadcast (ARP). It only learns the MAC if the paket thats sent is NOT a broadcast. That means: Host A doest ARP for Host B, nothing learned; Host B answers to Host A, Host B MAC learned; Host A sends a paket to Host B, Host A MAC learned. That extra paket takes some time, in our troubleshooting it was less then 50ms but still more then it would usually be. During that time, the Arista fabric also received the broadcast (ARP) and looped it back to the Cisco FabricPath fabric, because the Arista EVPN fabric wasn't fast enough to propagade the learned MAC to the ESI peer. Therefore the L2 split horizion of the ESI didn't work. I can't remember anymore why the next part happend, but the Cisco fabric learned now the MAC coming form the Aristas and began sending the traffic towards that empty new fabric. That only lasted as long as it took that other host to send another paket and FabricPath to learn that this MAC is on a Cisco Switch and not behind the Arista fabric. But it was enough to break some vMotion actions. The solution was easy, we only use one switch of the EVPN fabric during the migration. Therefor there was no ESI anymore. But other than that? No issues.
Lovely. Thanks a million for sharing this one!
Interesting subject! I've also recently noticed some vendors claiming that dataplane MAC learning is so much better because it reduces the number of BGP updates in large scale SP EVPN deployments. Apparently, some of them are working on IETF drafts to bring dataplane MAC learning "back" to EVPN. Not sure if this is really a relevant point - we know that BGP scales nicely, and its relatively easy to deploy virtualized RR with sufficient VPU resources.
Control-plane MAC learning in service provider environment never made sense to me. After all, you're selling bandwidth, and don't care (too much) how that bandwidth is used. Tracking customer MAC addresses is just asking for trouble and support calls when things go sideways.
However, when BGP is the answer no matter what the question is, or when you try to boil the ocean (hint: replace MLAG with ESI) you get the current state of affairs ;)
I know some service providers and IXPs use VXLAN encapsulation (to provide layer-2 transport over IP network) without EVPN. There might be a reason for that ;)
Since you've mentioned ACI specifically and stated »nobody ever complained about the connectivity problems following VM moves«: We've encountered VM connectivity issues after VM movements from one vPC leaf pair to a different vPC leaf pair with ACI. The issue did not occur immediately (due to ACI's bounce entries) and only sometimes, which made it very difficult to reproduce synthetically, but due to DRS and a large number of VMs it occurred frequently enough, that it was a serious problem for us.
The problem was, that sometimes the COOP database entry (ACI's separate control plane for MACs and host addresses) was not updated correctly to point to the new leaf pair. After the bounce entry on the old leaf pair expired (630 seconds by default), traffic to the VM was mostly blackholed, since remote endpoint learning is disabled on border leafs and always forwarded to the spines underlay IP address for proxying.
In the end we gave up and limited the VM migration domain to a single VPC leaf pair. VMware recommends a maximum number of 64 hosts per cluster anyway.
This definitely looks like a bug in ACI to me, but it's ghastly. It's even worse than what we encountered in early NSX versions (NSX controller losing track of MAC addresses after vCenter SOAP API broke down).
However, limiting DRS to a single VPC leaf pair might not be the right answer (unless you used VM affinity within a subset of HA cluster) -- you want the VMs to be restarted automatically even if both leaves blow up (for example, due to a VPC bug).
> This definitely looks like a bug in ACI to me
A colleague mentioned encountering this ACI problem as well, but they opened a case with the vendor and the bug was fixed (hearsay, I was not involved).
Few years ago in EVPN network, I saw drops on the multicast queue (ingress replication goes to that queue). After analyzing it we found that the root cause is vmotion (The hosts in that vlan are silent ) which starts at a very high rate before the source leaf learns the destination MAC. The quick and ugly solution was to scan the vmotion vlan with NMAP every few minutes so the leafs would have all of the MAC address in their EVPN database.
Would love to know more about this one. So far, it looks more like a mismatch between ARP and MAC timeouts to me, potentially combined with weird behavior like "don't learn MAC addresses from broadcasts, because ESXi should announce the moved MAC address immediately after vMotion. Solving the problem with NMAP also points in that direction.
While there have been detection issues with vMotion, I've never heard any server admins complain about any network hickups with vMotion when it's working correctly (and they're not shy about letting you know when the network isn't doing its job, even when it is). Sure, lots more packets that we think may get dropped, but if they're not perceptible so it's effectively flawless (for most workloads).
It's kind of like a switch with 20 µs of latency versus one with 1 ns of latency port-to-port. It's a 20x higher latency, but for the vast majority of workloads, especially virtual workloads, the difference is imperceptible.
There are some applications that are less drop sensitive, and they generally will have a vMotion prohibition. But those are rare.
> While there have been detection issues with vMotion, I've never heard any server admins complain about any network hickups with vMotion when it's working correctly
That has been my experience as well, so it was even more interesting to see all the counterexamples.
I think the counterexamples mentioned were all control plane issues. If the control plane is working correctly, I think most of the time no one notices any packets dropped (hence no complaints). When I learned vMotion as part of a VMware certification course (I had to be VCP certified to teach UCS back in the day) I was on a virtual desktop as it bounced back and forth between ESXi hosts and there wasn't anything I could do to notice it was occurring. It seemed like sorcery!
I've definitely run into apps that said you can't vMotion their apps, such as Arista CloudVision with virtual CVP nodes. But those are pretty rare.