Long-distance vMotion for disaster avoidance? Do the math first

The proponents of inter-DC layer-2 connectivity (required by long-distance vMotion) inevitably cite disaster avoidance (along with buzzword-bingo-winning business agility) as one of the primary requirements after they figure out stretched clusters might not be such a good idea (and there’s no way to explain the dangers of split subnets to some people). When faced with the disaster avoidance “requirement”, ask them to do some basic math first.

However, before starting this third-grade math exercise, let’s define disaster avoidance. The idea is simple: an iRene is approaching, you know you’ll have to shut down one of the data centers, and want to migrate the workload to another data center. Major obstacle: maximum round-trip time supported by vMotion with vSphere 4.x is 5 ms (some other documents cite 200 km), extended to 10 ms in vSphere 5.x. Is that enough to bring you out of harm’s way? It depends; you might want to check the disaster avoidance requirements against the distance limits first.

Now let’s focus on the inter-DC bandwidth. To simplify the math, assume you have two 1Gbps links (that seems to be a common inter-DC link speed these days) between your data centers.

If you have a single link and someone starts talking about disaster avoidance, tell them to buy the second link first.

The inter-DC links are probably not empty (or your boss should be asking you some tough questions), so let’s say we have 1Gbps of bandwidth available for disaster avoidance-induced vMotion. Assuming perfect link utilization, no protocol overhead, and no repeat copies of dirty memory pages, you can move a GB of RAM in 8 seconds ... or completely saturate a 1Gbps link to vacate a physical server with 256 GB or RAM in just over half an hour ... or 2 hours for a single quarter-rack UCS chassis full of B230 M1 blade servers. Obviously the moved VMs will cause long-distance traffic trombones, further increasing the utilization of inter-DC link and reducing effective migration rate.

So, the next time someone drops by with a great disaster avoidance scheme, make sure you ask the following questions (after politely listening to all the perceived business benefits):

How much workload do we need to migrate? Remember the VM RAM utilization only tends to increase (and the few mission-critical VMs they talk about today will likely explode as soon as you implement inter-DC vMotion), so you might want to multiply today’s figures by a healthy safety margin.

How quickly do we need to evacuate the data center? Don’t forget that this business objective includes the time required for all preparatory and cleanup operations, including fixing the storage configuration and IP routing at the end of the migration.

How often do you think we’ll do that? Of course the answer will be “we don’t know” but good disaster recovery planners should always know whether they’re trying to protect the assets against frequently recurring events (like yearly floods) or 100-year events.

From the answers to these questions, you can compute the extra bandwidth needed to migrate the workload. Add another safety margin for protocol overhead and repeat copies. 20-30% seems reasonable, but you might get way more repeat copies if you move large VMs over low-speed links; if you disagree, please add your comment.

Next, investigate how much spare capacity you have on the inter-DC links during the peak period (unless someone gives you 48 hours to evacuate the data center, you have to assume the worst-case scenario). Subtract that from the required migration bandwidth. Now you know how much extra bandwidth you need to support the disaster avoidance solution.

Last step: ask your service provider for a quote and multiply incremental monthly costs by the expected period between disaster avoidance events (if you want to be extra fancy, compute the present value of a perpetual expense).

A simplified data point: present value of a $10,000 perpetual monthly expense is just over one and a quarter million dollars ($1,254,054 to be precise) assuming 10% discount rate.

The present value of the future WAN link expenses is the minimum implementation cost of the disaster avoidance idea; add the additional equipment/licensing expenses you need to support it (for example, OTV licenses for Nexus 7000) and the costs caused by increased operational complexity. Icing on the cake: add the opportunity cost of a once-in-a-decade two-DC meltdown caused by a broadcast storm or a bridging loop.

Expected final result: the disaster avoidance idea just might lose a bit of its luster after having to face the real-life implementation costs. Disagree? Please write a comment.

I would like to thank Jeremy Filliben, Matthew Norwood and Ethan Banks who helped me verify the basic assumptions of this blog post.

More information

You’ll find a long list of L2 DCI caveats and descriptions of major applicable technologies (including VPLS, OTV, vPC/VSS and F5’s EtherIP) in my Data Center Interconnects webinar (recording). They might come handy if someone forces you to implement layer-2 inter-DC link.

You’ll find big-picture perspective as well as in-depth discussions of various data center and network virtualization technologies in two other webinars: Data Center 3.0 for Networking Engineers (recording) and VMware Networking Deep Dive (recording).

All three webinars are also available as part of the yearly subscription and as a stand-alone Data Center Trilogy.

12 comments:

  1. Dmitri Kalintsev30 September, 2011 07:09

    Not disagreeing, just a couple of comments. :)

    > Obviously the moved VMs will cause long-distance traffic trombones, further increasing the utilization of inter-DC link and reducing effective migration rate.

    Apologies for stating the obvious, but if the traffic has to trombone back to the first datacenter (which supposedly in in the harm's way), you're not really avoiding the disaster, are you?

    > Major obstacle: maximum round-trip time supported by vMotion with vSphere 4.x is 5 ms (some other documents cite 200 km), extended to 10 ms in vSphere 5.x.

    5ms RTT is about 500km away. Should be plenty, no?

    > Assuming perfect link utilization, no protocol overhead, and no repeat copies of dirty memory pages, you can move a GB of RAM in 8 seconds

    I wonder if WAN accelerators can make any difference?

    > Last step: ask your service provider for a quote

    This, all in all, sounds like an *excellent* opportunity for a service provider to offer you a bandwidth on demand (putting aside the number of SP's products/marketing people murdered by network engineers on the grounds of all challenges that Bandwidth on Demand presents, especially when talking about any significant bandwidths). ;)

    I guess the summary of my thoughts on the subject is this: yes, there are plenty challenges; so how about coming up with a checklist of a sort for "how do I find out if the Disaster Avoidance makes sense in my circumstances"? :-$

    ReplyDelete
  2. Excellent points Ivan. LD vMotion en mass sounds kinda silly. Why vMotion at all? There is something to be said for push button application restart. Just stop and start the vApp in the other DC with all the same network configuration. No scripting or other complicated tools to re-configured IP & DNS addresses, no painfully long vMotions to wait for. Rather, the app starts up in the new DC and just works. Assuming of course that you have a functional implementation of LISP :-)

    Cheers,
    Brad

    ReplyDelete
  3. WAN accelerators will help if the data stored in memory is redundant and compressible (i.e. not encrypted)

    ReplyDelete
  4. I agree long-distance virtual machine migration is unworkable, but sad fact is that the majority of datacenter (read: financials, ERP, operations) applications are not architected to support load balancing and clustering across WAN distances, especially at the database layer. Fixing that problem is much harder, and in fact decades of research have not produced a database system which can do distributed transactions and replication across high-latency links.

    Like the speed of light for network engineers, the CAP theorem is a bitch for application engineers. So what is your alternate solution Ivan? Active/passive systems with asynchronous state replication (which means data loss in the event of failover)?

    ReplyDelete
  5. "I wonder if WAN accelerators can make any difference?" From my experience with Wan Optimizers, you can see an improvement in traffic transport in 95% of the cases, so I don't see why they will not make a difference in case of vMotion.

    ReplyDelete
  6. Is anyone really vmotioning across data centers or are people just talking about it? I keep hearing that it's not supported to vmotion across different Nexus 1000V switches.

    ReplyDelete
  7. Hi Ivan,

    two points:
    - could physical transfer of data be faster? I.e., taking all the data to be migrated into a briefcase and driving this briefcase away to safety?
    - my calculation of NPV for perpetuity leaves me at 1.2m$, no decimals. How come your result is not rounded, are you calculating in monthly increments?

    Cheers,
    Gregor

    ReplyDelete
  8. #2 - I converted yearly discount rate into compounded monthly discount rate (1.10^(1/12)-1)

    ReplyDelete
  9. I can't see a reason why you couldn't vMotion between two Nexus 1000V switches ... might be a VMware/vDS limitation.

    ReplyDelete
  10. It's so nice to see we're in agreement 8-)

    You might not need LISP if you control the L3 network between the two data centers. Host routes also work, but of course LISP (or MPLS) scales better as the intermediate nodes don't have to keep track of migrated IP addresses.

    However, I don't think the current NX-OS releases support a mechanism that would create host routes on-demand (like LAM did decades ago), whereas LISP with VM mobility is available.

    ReplyDelete
  11. Trombones: to complete the disaster avoidance exercise, you have to shut down the subnets in the first data center or fix the IP routing in some other way. As long as you have split subnets, traffic will flow in somewhat unpredictable direction.

    RTT: you have to take in account the queuing/processing/serialization delay in all intermediate devices as well as circuitous ways in which your lambdas might go over physical fiber infrastructure. Just heard a great example yesterday: a carrier was not willing to commit to 5ms delay within London.

    WAN acceleration: it does help. F5's EtherIP is a great solution that provides vMotion traffic compression and bridging-over-IP at the same time. Search their web site for vMotion/EtherIP.

    Bandwidth-on-demand: might be useful for maintenance/migration purposes. Not sure I want to rely on that feature being available when a major disaster is heading my way; everyone would probably want to get more bandwidth at that time.

    ReplyDelete
  12. Obviously you're exposed to some data loss if you can't afford synchronous replication. The question you have to ask is: how much loss is acceptable.

    If you want to retain true transactional integrity with roll-forward to the exact point of failure (which sounds great, but is not always as mandatory as people think it is), you cannot rely on asynchronous block storage replication, but there are other database-level mechanisms like transaction logs.

    If you're willing to accept loss of the transactions that were completed just prior to the failure, life becomes way simpler - for example, you can use read-only replicas.

    Disclaimer: I know absolutely nothing about relational databases ... apart from the syntax of the SELECT statement :-P

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.