Is Anyone Using Long-Distance VM Mobility in Production?

I had fun times participating in a discussion focused on whether it makes sense to deploy OTV+LISP in a new data center deployment. Someone quickly pointed out the elephant in the room:

How many LISP VM mobility installs has anyone on this list been involved with or heard of being successfully deployed? How many VM mobility installs in general, where the VMs go at least 1,000 miles? I'm curious as to what the success rate for that stuff is.

I think we got one semi-qualifying response, so I made it even simpler ;)

Let’s start with a way simpler target: "How many VM mobility installs in general… that actually get used"

So far, I haven’t seen a single one, apart from the case where a DC was split across two buildings 100 m apart with tons of dark fiber in between.

I see lots of people building stretched VLANs for all sorts of crazy reasons (most common: “Because the VMware consultants told us to do so”) and think they solved the disaster recovery use case.

OTOH, I haven’t seen anyone actually shifting workloads across the DCI link (excluding migrations) because performance tends to suck after you increase latency by orders of magnitude.

In summary, a lot of people spend significant amount of money (OTV licenses), time and mental energy to create a ticking bomb (because broadcast storm or DCI failure) that adds zero real-life value.

Hot tip: If you can’t persuade your peers what a bad idea stretched VLANs are, tell them Gartner said so ;)

Latest blog posts in Disaster Recovery series

7 comments:

  1. Similar to what you already commented, the only case I have seen this working is in a Tier-4 "twin datacenter", two buildings together connected using many layer-2 links. In fact, in such design hosts from the same pod were distributed between both buildings, and the two uplinks from each server connected to access switches in both buildings. In any case, I believe such design does not account as two independent DC's.
  2. My first question about LISP is always "who's going to operate it and how much will they cost?". Does someone really want their highly experienced (i.e. costly) staff get involved in daily operations because the L1/L2 support can't even understand the technology?
  3. working in public sector, I always thought the DCs I manage are 'behind the times' because we don't have any L2 DCI. The truth however is that there has NEVER been a compelling use case - and it has NEVER been requested. I am glad we never went down that rabbit hole. It's also encouraging to see that our lack of interest in L2 DCI didn't result in us being left out in any way.
  4. Another from the public sector. I've been running OTV (without LISP) since late 2011, for a small set of data-center vlans. The initial driver was migration from an old data-center, for which OTV worked very well. What remains are 2 data-centers, about 30 miles apart, still linked (standard L3, and OTV).

    Amazingly, I was able to hold the server team to no L2 on storage or vmotion (so moves are SRM or Hyper-V replication). Further, because I'm not using LISP to steer host traffic, the OTV extension is only for "DR lite".

    So, another datapoint for not using long distance VM mobility. I continue to push towards applications that don't depend on L2 or static IPs.
  5. Tried VMWare's SRM, along with storage replication and DHCP for the VMs. Isn't perfect, but it works well enough to call it a cold BC/DR strategy. More importantly, it worked well enough to obviate the need for long-distance DCI, which leadership was pushing for.
  6. A Gold/premium partner of HP/EMC/... told us to lease fibers between 2 datacenters for running fibre channel for synchronizing two 3Par SANs some months ago...
    I think he has never heard anything about DCI at all
  7. 8 years ago we demonstrated the feasibility of DR using live migration for enterprise workload (paper : http://dl.acm.org/citation.cfm?id=1555346 ) . However the actual practicality and requirement for the setup to work was such that it was highly risky to run. You added so many new points of failure along the path that hardening the whole system was an order of magnitude more expensive than other more sensible solutions.
Add comment
Sidebar