Optimal L3 Forwarding with VARP and Active/Active VRRP
I’ve blogged about the need for optimal L3 forwarding across the whole data center in 2012 when I introduced it as one of the interesting requirements in Data Center Fabrics webinar. Years later, the concept became one of the cornerstones of modern EVPN fabrics, but there are still only a few companies that can deliver this functionality in a more traditional environment.
Fabric solutions that appear as a single system to the outside world usually offer optimal L3 forwarding. These solutions include:
- Stacking ToR switches and other similar solutions, including HP IRF and Juniper’s Virtual Chassis) definitely fall in this category (note: using stacked switches or virtual chassis architectures with ring-based interconnect in environments with heavy east-west traffic is NOT a good idea);
- Other architectures that present the whole fabric as a single layer-3 entity like Juniper’s QFabric (now mostly obsolete).
While optimal L3 forwarding with anycast first-hop gateways became a table stake for EVPN implementations, and most vendors offer active/active first-hop gateways in MLAG clusters, there are only a few companies I’m aware of that can implement anycast gateways across a traditional layer-2 fabric: Arista with Virtual ARP, Cumulus Linux with Virtual Router Redundancy, and Enterasys (now Extreme) with Fabric Routing.
Arista’s Virtual ARP is extremely simple1 – it’s like VRRP without VRRP. You have to configure the same IP address (first-hop gateway) on a VLAN interface of all ToR switches with ip virtual-router address configuration command and associate a MAC address with the shared IP address with the ip virtual-router mac-address interface configuration command.
The first switch that is hit with an ARP request for the shared virtual IP address will reply with the shared MAC address (I’m not sure about the details – it might well be that the ARP broadcast gets flooded to all switches, in which case the sender gets numerous replies). When a host sends an IP packet to that same shared MAC address, the first ToR switch that the packet hits intercepts the packet (because it’s listening to the shared MAC address), and performs L3 routing.
Things might get nasty if you have configuration mismatches – for example, missing ip virtual-router address configuration on one of the ToR switches. Make sure you use some sort of automation or orchestration system to configure the ToR switches.
- Removed mentions of obsolete products/startups.
- Added a mention of Cumulus Linux VRR.
- Added a link to VARP deep dive blog post
Cumulus Linux Virtual Router Redundancy is functionally equivalent to Arista’s Virtual ARP. ↩︎
(support was added in EOS 4.11.3 if you wanted to look back to where it appeared).
IPv6 is most definitely not a 2nd class citizen on Arista.
Thanks for the clarification and apologies for the implied snark ;-)
Arista VARP uses a single common mac address across all devices (more than 2 are supported) and in fact you can run it at different places in your network (e.g. both leaf and spine). Since every device is 'active' there is no need for any protocol and thus there is also no failover time period.
In NX-OS vPC and HSRP implementation, both the active and standby HSRP gateways actively forward packets (HSRP virtual MAC of vPC switches are programmed with the G flag on both systems).
This is still limited to a vpC pair of N7k or N5k but Anycast FHRP on Fabricpath should pop up in the next months..
Cisco's alternative behavior of run the standby as active for HSRP, VRRP and GLBP in vPC isn't really 'similar'. You still have a protocol, you still have a maximum of 2-way active and you still have scale limitations imposed by the protocol scaling (e.g. see Cisco's published "maximum system scale" numbers for # of FHRP instances.
FabricPath doesn't solve this problem (and nor does FabricPath address the inherent scale issues either with mac-table size on F1/F2 modules on N7K).
Anycast FHRP would be a good thing but then again I think I was talking about that 4 years ago, its still not there?
A small comment: you mention "Things might get nasty if you have configuration mismatches – for example, missing ip virtual-router address configuration on one of the ToR switches":
Actually, nothing 'bad' will happen if you did have a configuration mismatch like that. All that would happen is that you'd have more traffic flowing towards wherever the actual virtual-mac-address is that the host last heard a gratitous ARP from. And that may oscillate.
I think (but haven't checked) that VARP even knows about that oscillation and will point it out - its a neat aspect of gratuitous arps being broadcast, every switch will 'see' them.
makes 100% sense and was a silly thing to do, but easy mistake for a VARP rookie...
"The first switch that is hit with an ARP request for the shared virtual IP address will reply with the shared MAC address (I’m not sure about the details – it might well be that the ARP broadcast gets flooded to all switches, in which case the sender gets numerous replies). "
Just wanting to have an understanding of what should be expected data path wise. Can't be as simple as multiple replies to the ARPing host, can it?
The answer will obviously be implementation-dependent, but the short answer is that it could be any number of things and still work just fine.
What is required for ARP to work is that a device answers the ARP request. That 'reply' could eithe be a broadcast response (sort of like what GARP does) or unicast. If its unicast, only the destination receives it.
The "implementation dependent" piece depends on what the initial 'hop' switch does, its a broadcast ARP request, does it eat that broadcast and respond on its own, or does it forward the broadcast and potentially get multiple answers back from many distributed [independent] gateways.
Either is possible, via configuration.
There may be merits in localizing ARP response but nothing bad happens if there are duplicate responses.