Design Challenge: Multiple Data Centers Connected with Slow Links

One of my readers sent me this question:

What is best practice to get a copy of the VM image from DC1 to DC2 for DR when you have subrate (155 Mbps in my case) Metro Ethernet services between DC1 and DC2?

The slow link between the data centers effectively rules out any ideas of live VM migration; to figure out what you should be doing, you have to focus on business needs.

In this particular case, you have to figure out what the Recovery Point Objective (RPO) is, or (to put it bluntly) how fresh the data should be. It’s also nice to know what the Recovery Time Objective (RTO) is, or how long it can take to restore the services.

With well-designed applications that store data in the database and not on local disk, it’s good enough to copy VM images every night or so, and have database replication enabled between data centers. Alternatively, you can use database log shipping and recover original data from backups if that’s acceptable from RTO standpoint.

There are plenty of backup solutions that allow you to ship VM images to a standby location every night; you can also use vSphere Replication to run continuous data synchronization in the background.

In any case, almost anything is way better (from bandwidth utilization or goodput perspective) than dumb block-level disk replication promoted by storage vendors – database replication or vSphere Replication is aware of the actual content, and can ship the modifications in optimized format, whereas disk replication transfers any change to the disk when it happens (including continuous updates to database indices as records are changed).

To learn more about these topics, watch the the Designing Active-Active and Disaster Recovery Data Centers webinar.

The reply I got back from the reader made me sad because it’s so typical of the state of the industry:

As expected, it is just managing expectations. Our problem is probably more Sales and Marketing. Trying to engineer a solution after promises have been made is always going to be a challenge.

I’ve been talking about that for years, but of course I’m always preaching to the choir, and nobody else seems to care.

Latest blog posts in Disaster Recovery series

5 comments:

  1. Answering a different chunk of the question.

    "subrate (155 Mbps in my case) Metro Ethernet"

    Screams *ancient* SONET/SDH transmission system, in any major city the telco (assuming even a half decent one) would likely be very happy to upgrade you significantly if you'd talk to them, full gig for the same price may well be an option, not that it helps much.

    SONET/SDH is dead, most equipment that talks it as anything other than a P2P link is long obsolete, much like ATM transmission (vs ATM framing on ADSL & GPON for example). It's possible that's just a rate limit on a modern transport platform, but unless it was a direct migration from a SONET platform it's an odd rate to use.

    Dark fibre plus a passive xWDM mux is often not that expensive either, possibly cheaper than a 10g metro circuit would be.
  2. We have used SRM combined with Recoverpoint over leased 10Gb links, 350+ miles between data centers with enough success to make it our current standard for "DR" apps. Always looking at other data replication products as they emerge. Beyond that, the app owners sometimes also utilize database replication, but we don't do much more than provide the plumbing for that. So far SRM has worked pretty well, especially considering that the DR site uses different IPs than the primary data center, so there is a re-IP. Also, a layer of global-DNS to point users to the right location after a DR move has occurred. YMMV.
  3. Is he using VSphere? I'm assuming so.

    Storage replication depends if you are talking about realtime replication or on-demand replication. ZFS does block level differential replication, but it's not realtime. You take a snapshot and it will just sync the changes to a remote host. Tools like Veeam will do incremental VM backups as well and are highly integrated into ESX. There are obviously VMWare DR tools already mentioned which work as well. ESX has the ability to track just changed blocks so you aren't backing up the entire file.

    Of course if you have a VM writing like 500GB a day to disk or something you are SOL. :)

  4. 155Mb,- sounds like good old SDH/Sonet. Remenber,- you only got 140Mb payload size.

    Julien Goodwin, don't be too hard on this well proven tech. :-)
    It still delivers a reliability and fail-over response time any routed/switched network can dream of. The expression "telecom reliability", is not invented for fun, and it delivers something yet to be surpassed.

    It all depends on what is needed.
    If you can live with slow reacting, sloppy and "we hope the best" delivery, then routed/switched networks is the way to go. :-)
    Replies
    1. Yes and no, while it's a nice enough tech and the features can make it nice and easy to debug (as long as the problem isn't clocking) it pretty much died at OC192, with many of the transmission systems never making it past OC48.

      And just because the protocol works doesn't mean the equipment still will, old equipment really does start to get unreliable in aggregate, and older gear often has fixed optics which will be wearing out by now.

      But, as reminded by someone out-of-band, this sort of transmission system may well still be reasonable outside of major markets. I'm perhaps a little too used to my world where a single 10g is barely worth caring about.
Add comment
Sidebar