Why Is Stretched ACI Infinitely Better than OTV?
Eluehike Chedu asked an interesting question after my explanation of why stretched ACI fabric (or alternatives, see below) is the least horrible way of stretching a subnet: What about OTV?
Time to go back to the basics. As Dinesh Dutt explained in our Routing on Hosts webinar, there are (at least) three reasons why people want to see stretched subnets:
- Service or node discovery based on broadcasts;
- Multicast cluster heartbeats;
- Assumptions of servers being in the same subnet (including IP address mobility and VM mobility).
While there’s not much one could do about the first two (apart from enabling IP multicast in the data center fabric), there are two ways of solving the third one:
- The wrong way: stretched VLAN;
- The right way: admitting that the IP subnet paradigm doesn’t fit all environments and going back to routing based on host identifiers (hint: CLNS got there decades ago).
What’s the difference between the two approaches? The stretched VLAN approach uses the wrong forwarding paradigm (panic-and-flood when you don’t know) that was invented to emulate a yellow coax cable. The routing on host identifiers approach is still routing (drop when you don’t know) but using more granular forwarding table.
You might have noticed I said host identifiers and not IP addresses. It really doesn’t matter that much if you do routing based on MAC or IP addresses as long as it’s deterministic and there’s no flooding. Figuring out why it still matters whether you use MAC or IP addresses is left as an exercise for the reader ;)
Various VLAN extension approaches like OTV are just lipstick on a pig. They have to use all sorts of tricks to fix the problems caused by using the wrong forwarding behavior (bridging):
- First-hop gateway selection (otherwise you get traffic trombones);
- Suboptimal ingress traffic problems;
- Excessive flooding across lower-speed links;
- Unnecessary unicast flooding.
You don’t get any of these problems when using routing based on host identifiers:
- First-hop gateway is always the first network device.
- Forwarding fabric already contains host routes, which can be redistributed into external routing protocol to get optimal ingress traffic flow (note: I’m not saying that’s a good idea).
- There’s no flooding, and ARP/ND/IGMP requests are terminated on the first-hop network device.
The “only” problem left for the host routing fabrics to solve: identifying the correct host identifiers (there’s a reason CLNS had ES-IS protocol). Most solutions misuse ARP requests to identify host IP addresses, or glean host IP addresses straight from data packets. VMware makes it even more interesting with their incredibly shortsighted decision to use RARP instead of ARP to signal VM move.
Is Cisco ACI the only fabric that works this way? Absolutely not. You have plenty of choices:
- Avaya fabric
- Cisco ACI
- Cisco DFA
- EVPN with symmetrical IRB (asymmetrical IRB still uses too much bridging), for example on Cisco Nexus switches
- Cumulus Linux Redistribute ARP
- Enterasys (now Extreme Networks) host routing.
We covered this idea in detail in the Leaf-and-Spine Fabric Designs webinar, but if you need just an overview, watch my IPv6 microsegmentation Troopers talk or IPv6 microsegmentation webinar.
However, there is a work in progress in the IETF to allow support for all requirements when interconnecting EVPN DCs, i.e. "Multi-Site EVPN": https://tools.ietf.org/html/draft-sharma-multi-site-evpn-01.
Appreciate your insight on issues like these, as always. Of note to me was your comment that "Most solutions misuse ARP requests to identify host IP addresses...", to which I have struggled with myself when deploying these things.
For example, a limitation of LISP ESM in the past for me has been silent hosts, and I believe that Cumulus redistribute ARP suffers from a similar pain point. It seems that speak-when-spoken-to hosts (cluster IPs, VIPs), for example, require contingency plans and workarounds, sometimes painful, to deploy these solutions.
I personally have not heard of any movements to try and deal with this by alternate fabric discovery mechanisms, but would be curious to hear. For me, this is one of the major stumbling points toward it being viable and not a nightmare to deploy.
However, for VIP addresses other hosts need to reach them, so they'll ARP and the fabric can capture the ARP reply (not sure which solutions do that though).
For me if I have a 2 host on the same subnet in both datacenters, it still means the failure domain is the same, no matter what technical way i achieve it (stretch a vlan or use "routed l2"ACI). The reason is that any host1 NIC failure/misconfig will result in flood to host2.
Or....are you saying that unknown dst mac does not get flooded. We canot have that, half of the apps would stop working....
"are you saying that unknown dst mac does not get flooded." <-- ideally NOTHING gets flooded.
"We cannot have that, half of the apps would stop working" <-- I don't believe that any more
Also, as I wrote, I was focused only on IP address mobility, not on supporting even-more-broken stupidities.
"how is the routing-l2 forwarding behavior different from having switchport block unicast on all server ports"
Is this an example of the lipstick-on-a-pig?
A challenge I still encounter regularly is that for many mid-size and smaller companies, the cost/complexity of building a fabric like you're describing often ends the conversation before it's really begun. I still find cases where OTV, for example, is certainly better than just trunking L2 across a DCI, and somewhat more approachable than the technologically superior alternatives.
Which technologies would you consider most appropriate when operational complexity is taken into consideration?