Stretched ACI Fabric Is Sometimes the Least Horrible Solution

One of my readers sent me a lengthy email asking my opinion about his ideas for new data center design (yep, I pointed out there’s a service for that while replying to his email ;). He started with:

I have to design a DR solution for a large enterprise. They have two data centers connected via Fabric Path.

There’s a red flag right there…

While it’s definitely better to use Fabric Path (or Avaya’s SPB fabric, or Brocade’s Metro VCS Fabric) than the MLAG-over-WAN kludges, extending bridging across two data centers makes them a single failure domain, as some people found out the hard way.

Most of the applications run in a HA manner in both locations.

I wonder why people still think that’s a good idea. Loss of DCI link will probably just break every application running across both locations (if the application would be written correctly they wouldn’t need L2 extension anyway).

Services run in a HA manner - one service device is active in one location and standby in the other. They communicate via Layer 2.

Stretched firewalls never made much sense, but the vendors are making sure the myth persists. However, let’s not go there…

Anyway, the reader’s idea was to replace Fabric Path with ACI:

According to Cisco and to your webinars ACI is a good candidate for a DR solution.

I don’t remember ever saying ACI is a good candidate for a DR solution ;), but it’s definitely the least horrible one. In fact, any solution that replaces bridging with host routing (ACI, DFA or Cumulus Linux redistribute ARP) is infinitely better than stretched VLANs because it removes the uncontrolled flooding behavior that’s the root cause of many catastrophic network failures.

Finally, as my reader was talking about disaster recovery I advised him to go back and talk about the real business needs. Once you get into that discussion, you often realize you don’t need stretched layer-2 fabrics, because the other infrastructure (example: storage) doesn’t support fully automated recovery.

Want to Know More?

Watch the Building Active-Active and Disaster Recovery Data Centers webinar or the Building Next-Generation Data Center online course.

Latest blog posts in Disaster Recovery series

6 comments:

  1. I agree with your thoughts 100%. I really cringe when I hear about L2 DCI.
  2. no mention of otv :)
    problem with all l2 dcis as you mentioned are the typical l2 challenges: broadcasts, arps, unicast floods, stp, wrong choice of fhrp gateways.
    to the best of my knowledge otv was built from ground up with all these in mind and has in built mechanisms to stop those control/data plane issues.
    but as you also said most of these are just hacks around the real issues: applications need to be written correctly. If developers and network engineers go through the software dev lifecycle together and input where necessary most of these hacks would be reduced.
  3. i too am stuck at a place where they stretch layer 2. no amount of education will change their minds. when asked why they designed it this way, their answer is always "because". i cry myself to sleep every night
  4. I'm in similar situation with a customer of mine, however few different factors:
    The country is pretty small, and the customer has 2 independent 10G circuits between the DCs.
    They can actually have whatever circuits they want, as they control the fibres between the DCs.
    They still want to stretch L2 across the DCs but using OTV.
    I was preaching layer 3 and I even got their attention for a short while, but they had few good arguments:
    1. They have good experience with OTV from previous deployments.
    2. Going with L3 requires at least 1/3 more equipment (assume FW cluster in main site and single FW in DR site).
    3. The main reason was managing FW rules between two independent FWs.
    They are concerned that independent FW policies will be inconsistent and thus things won't work in d day.
    I couldn't find good enough arguments to counter their arguments, so we will give their approach a try.
    I'm interested to hear your opinions on this scenario
    Replies
    1. When (and not if) their solution fails I bet they will still blame it on you!

      Anonymous and pessimistic
    2. To address point #3, this is why you should use a firewall management system so your policies and objects are consistent across different firewalls.
Add comment
Sidebar