Long-distance vMotion for Disaster Avoidance? Do the Math First
The proponents of inter-DC layer-2 connectivity (required by long-distance vMotion) inevitably cite disaster avoidance (along with buzzword-bingo-winning business agility) as one of the primary requirements after they figure out stretched clusters might not be such a good idea (and there’s no way to explain the dangers of split subnets to some people). When faced with the disaster avoidance “requirement”, ask them to do some basic math first.
However, before starting this third-grade math exercise, let’s define disaster avoidance. The idea is simple: an iRene is approaching, you know you’ll have to shut down one of the data centers, and want to migrate the workload to another data center. Major obstacle: maximum round-trip time supported by vMotion with vSphere 4.x is 5 ms (some other documents cite 200 km), extended to 10 ms in vSphere 5.x. Is that enough to bring you out of harm’s way? It depends; you might want to check the disaster avoidance requirements against the distance limits first.
Now let’s focus on the inter-DC bandwidth. To simplify the math, assume you have two 1Gbps links (that seems to be a common inter-DC link speed these days) between your data centers.
The inter-DC links are probably not empty (or your boss should be asking you some tough questions), so let’s say we have 1Gbps of bandwidth available for disaster avoidance-induced vMotion. Assuming perfect link utilization, no protocol overhead, and no repeat copies of dirty memory pages, you can move a GB of RAM in 8 seconds ... or completely saturate a 1Gbps link to vacate a physical server with 256 GB of RAM in just over half an hour ... or 2 hours for a single quarter-rack UCS chassis full of B230 M1 blade servers. Obviously the moved VMs will cause long-distance traffic trombones, further increasing the utilization of inter-DC link and reducing effective migration rate.
So, the next time someone drops by with a great disaster avoidance scheme, make sure you ask the following questions (after politely listening to all the perceived business benefits):
How much workload do we need to migrate? Remember the VM RAM utilization only tends to increase (and the few mission-critical VMs they talk about today will likely explode as soon as you implement inter-DC vMotion), so you might want to multiply today’s figures by a healthy safety margin.
How quickly do we need to evacuate the data center? Don’t forget that this business objective includes the time required for all preparatory and cleanup operations, including fixing the storage configuration and IP routing at the end of the migration.
How often do you think we’ll do that? Of course the answer will be “we don’t know” but good disaster recovery planners should always know whether they’re trying to protect the assets against frequently recurring events (like yearly floods) or 100-year events.
From the answers to these questions, you can compute the extra bandwidth needed to migrate the workload. Add another safety margin for protocol overhead and repeat copies. 20-30% seems reasonable, but you might get way more repeat copies if you move large VMs over low-speed links; if you disagree, please add your comment.
Next, investigate how much spare capacity you have on the inter-DC links during the peak period (unless someone gives you 48 hours to evacuate the data center, you have to assume the worst-case scenario). Subtract that from the required migration bandwidth. Now you know how much extra bandwidth you need to support the disaster avoidance solution.
Last step: ask your service provider for a quote and multiply incremental monthly costs by the expected period between disaster avoidance events (if you want to be extra fancy, compute the present value of a perpetual expense).
The present value of the future WAN link expenses is the minimum implementation cost of the disaster avoidance idea; add the additional equipment/licensing expenses you need to support it (for example, OTV licenses for Nexus 7000) and the costs caused by increased operational complexity. Icing on the cake: add the opportunity cost of a once-in-a-decade two-DC meltdown caused by a broadcast storm or a bridging loop.
Expected final result: the disaster avoidance idea just might lose a bit of its luster after having to face the real-life implementation costs. Disagree? Please write a comment.
You’ll find a long list of L2 DCI caveats and descriptions of major applicable technologies (including VPLS, OTV, vPC/VSS and F5’s EtherIP) in my Data Center Interconnects webinar.
You’ll find big-picture perspective as well as in-depth discussions of various data center and network virtualization technologies in two other webinars: Data Center 3.0 for Networking Engineers and VMware Networking Deep Dive.
> Obviously the moved VMs will cause long-distance traffic trombones, further increasing the utilization of inter-DC link and reducing effective migration rate.
Apologies for stating the obvious, but if the traffic has to trombone back to the first datacenter (which supposedly in in the harm's way), you're not really avoiding the disaster, are you?
> Major obstacle: maximum round-trip time supported by vMotion with vSphere 4.x is 5 ms (some other documents cite 200 km), extended to 10 ms in vSphere 5.x.
5ms RTT is about 500km away. Should be plenty, no?
> Assuming perfect link utilization, no protocol overhead, and no repeat copies of dirty memory pages, you can move a GB of RAM in 8 seconds
I wonder if WAN accelerators can make any difference?
> Last step: ask your service provider for a quote
This, all in all, sounds like an *excellent* opportunity for a service provider to offer you a bandwidth on demand (putting aside the number of SP's products/marketing people murdered by network engineers on the grounds of all challenges that Bandwidth on Demand presents, especially when talking about any significant bandwidths). ;)
I guess the summary of my thoughts on the subject is this: yes, there are plenty challenges; so how about coming up with a checklist of a sort for "how do I find out if the Disaster Avoidance makes sense in my circumstances"? :-$
Like the speed of light for network engineers, the CAP theorem is a bitch for application engineers. So what is your alternate solution Ivan? Active/passive systems with asynchronous state replication (which means data loss in the event of failover)?
- could physical transfer of data be faster? I.e., taking all the data to be migrated into a briefcase and driving this briefcase away to safety?
- my calculation of NPV for perpetuity leaves me at 1.2m$, no decimals. How come your result is not rounded, are you calculating in monthly increments?
You might not need LISP if you control the L3 network between the two data centers. Host routes also work, but of course LISP (or MPLS) scales better as the intermediate nodes don't have to keep track of migrated IP addresses.
However, I don't think the current NX-OS releases support a mechanism that would create host routes on-demand (like LAM did decades ago), whereas LISP with VM mobility is available.
RTT: you have to take in account the queuing/processing/serialization delay in all intermediate devices as well as circuitous ways in which your lambdas might go over physical fiber infrastructure. Just heard a great example yesterday: a carrier was not willing to commit to 5ms delay within London.
WAN acceleration: it does help. F5's EtherIP is a great solution that provides vMotion traffic compression and bridging-over-IP at the same time. Search their web site for vMotion/EtherIP.
Bandwidth-on-demand: might be useful for maintenance/migration purposes. Not sure I want to rely on that feature being available when a major disaster is heading my way; everyone would probably want to get more bandwidth at that time.
If you want to retain true transactional integrity with roll-forward to the exact point of failure (which sounds great, but is not always as mandatory as people think it is), you cannot rely on asynchronous block storage replication, but there are other database-level mechanisms like transaction logs.
If you're willing to accept loss of the transactions that were completed just prior to the failure, life becomes way simpler - for example, you can use read-only replicas.
Disclaimer: I know absolutely nothing about relational databases ... apart from the syntax of the SELECT statement :-P
Silly typo, but the "or" there broke the momentum of the argument.