Stretched ACI Fabric Is Sometimes the Least Horrible Solution
One of my readers sent me a lengthy email asking my opinion about his ideas for new data center design (yep, I pointed out there’s a service for that while replying to his email ;). He started with:
I have to design a DR solution for a large enterprise. They have two data centers connected via Fabric Path.
There’s a red flag right there…
While it’s definitely better to use Fabric Path (or Avaya’s SPB fabric, or Brocade’s Metro VCS Fabric) than the MLAG-over-WAN kludges, extending bridging across two data centers makes them a single failure domain, as some people found out the hard way.
Most of the applications run in a HA manner in both locations.
I wonder why people still think that’s a good idea. Loss of DCI link will probably just break every application running across both locations (if the application would be written correctly they wouldn’t need L2 extension anyway).
Services run in a HA manner - one service device is active in one location and standby in the other. They communicate via Layer 2.
Stretched firewalls never made much sense, but the vendors are making sure the myth persists. However, let’s not go there…
Anyway, the reader’s idea was to replace Fabric Path with ACI:
According to Cisco and to your webinars ACI is a good candidate for a DR solution.
I don’t remember ever saying ACI is a good candidate for a DR solution ;), but it’s definitely the least horrible one. In fact, any solution that replaces bridging with host routing (ACI, DFA or Cumulus Linux redistribute ARP) is infinitely better than stretched VLANs because it removes the uncontrolled flooding behavior that’s the root cause of many catastrophic network failures.
Finally, as my reader was talking about disaster recovery I advised him to go back and talk about the real business needs. Once you get into that discussion, you often realize you don’t need stretched layer-2 fabrics, because the other infrastructure (example: storage) doesn’t support fully automated recovery.
Want to Know More?
Watch the Building Active-Active and Disaster Recovery Data Centers webinar or the Building Next-Generation Data Center online course.
problem with all l2 dcis as you mentioned are the typical l2 challenges: broadcasts, arps, unicast floods, stp, wrong choice of fhrp gateways.
to the best of my knowledge otv was built from ground up with all these in mind and has in built mechanisms to stop those control/data plane issues.
but as you also said most of these are just hacks around the real issues: applications need to be written correctly. If developers and network engineers go through the software dev lifecycle together and input where necessary most of these hacks would be reduced.
The country is pretty small, and the customer has 2 independent 10G circuits between the DCs.
They can actually have whatever circuits they want, as they control the fibres between the DCs.
They still want to stretch L2 across the DCs but using OTV.
I was preaching layer 3 and I even got their attention for a short while, but they had few good arguments:
1. They have good experience with OTV from previous deployments.
2. Going with L3 requires at least 1/3 more equipment (assume FW cluster in main site and single FW in DR site).
3. The main reason was managing FW rules between two independent FWs.
They are concerned that independent FW policies will be inconsistent and thus things won't work in d day.
I couldn't find good enough arguments to counter their arguments, so we will give their approach a try.
I'm interested to hear your opinions on this scenario
Anonymous and pessimistic