Every time I’m complaining about the stupidities $vendors are trying to sell us, someone from vendorland saddles a high horse and starts telling me how I got it all wrong, for example:
It is a duty of a pre-sales, consultant, vendor representative to inform the customer about the risk.
When you stop laughing (maybe it was just April Fools’ joke ;), here’s how the reality of that process looks like (straight from one of my readers):
I remember when the VM guys and their managers were telling me (like they had discovered the solutions to all of ours problems) about “with VXLAN we can move a machine from one country to another, and keep having service with the same IP” … while looking at me with the “I’m so smart” face… and me thinking shit… I’m doomed :) … I don’t even want to start explaining … but in the long run I had to anyway.
As a response to my Live vMotion into VMware-on-AWS Cloud blog post Nico Vilbert pointed me to his blog post explaining the details of cross-Atlantic vMotion into AWS.
Today I will not go into yet another rant pointing out all the things that can go wrong, but focus on a minor detail: “no ping was dropped in the process.”
The vMotion is instantaneous and lossless myth has been propagated since the early days of vMotion when sysadmins proudly demonstrated what seemed to be pure magic to amazed audiences… including the now-traditional terminal window running ping and not losing a single packet.
Considering VMware’s enrapturement with vMotion the following news (reported by Salman Naqvi in a comment to my blog post) was clearly inevitable:
I was surprised to learn that LIVE vMotion is supported between on-premise and Vmware on AWS Cloud
What’s more interesting is how did they manage to do it?
How many VM moves do you see in a medium and how many in a large data center environment per second and per minute? What would be a reasonable maximum?
Obviously the answer to the first part is it depends (please share your experience in the comments), so we’ll focus on the second one. It’s time for another Fermi estimate.
Nuno wrote an interesting comment to my Stretched Firewalls across L3 DCI blog post:
You're an old school, disciplined networking leader that architects networks based on rock-solid, time-tested designs. But it seems that the prevailing fashion in network design and availability go against your traditional design principles: inter-site firewall clustering, inter-site vMotion, DCI, etc.
Not so fast, my young padawan.
Let’s define prevailing fashion first. You might define it as Kool-Aid id peddled by snake oil salesmen or cool network designs by people who know what they’re doing. If we stick with the first definition, you’re absolutely right.
Now let’s look at the second camp: how people who know what they’re doing build their network (Amazon VPC, Microsoft Azure or Bing, Google, Facebook, a number of other large-scale networks). You’ll find L3 down to ToR switch (or even virtual switch), and absolutely no inter-site vMotion or clustering – because they don’t want to bet their service, ads or likes on the whims of technology that was designed to emulate thick yellow cable.
Want to know how to design an application to work over a stable network? Watch my Designing Active-Active and Disaster Recovery Data Centers webinar.
This isn't the first time that readers have asked you about these technologies, and it won't be the last. Vendors will continue to market them despite their shortcomings, and customers will continue to eat them up.
As long as there will be someone willing to believe in fairy tales and Santa Claus, there will be someone dressed in red coat and fake beard yelling “Ho, Ho, Ho!”
Enterprise IT managers sometimes act like small kids. They don’t want to hear that they have people- and process problems, and love to believe that the next magical bit of technology will solve whatever it is that bothers them. Vendors obviously love to explore these cravings and sell them ever-more-complex solutions.
I'd like to think that vendors will also continue to work out the kinks and over time the technology will become rock solid and time-tested.
I am positive you can make any technology almost-rock-solid. You can also make pigs fly (see RFC 1925 sect. 2.3). However, have you included the fuel costs in your TCO?
Also, the more complex a technology is, the likelier it is to crash down like a house of cards, and you’ll be left with an incomprehensible mix of bits and pieces that will be impossible to put back together (see also: You can’t reformat your data center).
Nino concluded his comment with a question:
Are you too stuck on past, traditional designs and not being open to new ways of building IT? I get that IT is very cyclical, and these new trends may die in the future...or thrive, and the customers may either fail...or succeed.
I am very open to new ways of building IT. I preach the need for meaningful SDN (not the centralized control plane crap), network automation, and proper application architecture. I just refuse to believe in fairy tales, and solving non-technical problems with technology.
I had fun times participating in a discussion focused on whether it makes sense to deploy OTV+LISP in a new data center deployment. Someone quickly pointed out the elephant in the room:
How many LISP VM mobility installs has anyone on this list been involved with or heard of being successfully deployed? How many VM mobility installs in general, where the VMs go at least 1,000 miles? I'm curious as to what the success rate for that stuff is.
I think we got one semi-qualifying response, so I made it even simpler ;)
One of my readers recently pointed me to a blog post written by Andrew Lerner from Gartner describing the drawbacks of stretched VLANs.
TL&DR: He’s saying more-or-less the same things I’ve been preaching for years. Now I can put Blessed by Gartner logo on my blog posts ;), and you can use the report to sway your CIO.
I expect to hear a lot about the “wonderful” idea of moving running VMs 100 msec away (across the continent) in the upcoming weeks. I would recommend you read a few of my older blog posts before considering it… and don’t waste time trying to persuade the true believers with technical arguments – talk with whoever will foot the bill or walk away.
A few months ago I described how bandwidth limitations shatter the dreams of spread-out application stacks with elements residing (or being dynamically migrated) between data centers. Today let’s focus on bandwidth’s ugly cousin: latency.
TL&DR Summary: Spreading the server components of an application across multiple locations (multiple data centers or hybrid cloud deployments) can easily result in dismal performance even when there’s plenty of bandwidth available.
In his The Case for Hybrids blog post Mat Mathews described the Hotel California effect of public clouds as: “One of the most oft mentioned issues with public cloud is the difficulty in getting out.” Once you start relying on cloud provider APIs to provide DNS, load balancing, CDN, content hosting, security groups, and a plethora of other services, it’s impossible to get out.
Interestingly, the side effects of public cloud deployments extend into the realm of application programming, as I was surprised to find out during one of my Expert Express engagements.
VMware announced several vMotion enhancements in vSphere 6, ranging from “finally” to “interesting”.
vMotion across virtual switches. Finally. The tricks you had to use previous were absolutely bizarre.
A reader sent me this question:
My company will have 10GE dark fiber across our DCs with possibly OTV as the DCI. The VM team has also expressed interest in DC-to-DC vMotion (<4ms). Based on your blogs it looks like overall you don't recommend long-distance vMotion across DCI. Will the "Data Center trilogy" package be the right fit to help me better understand why?
Unfortunately, long-distance vMotion seems to be a persistent craze that peaks with a predicable period of approximately 12 months, and while it seems nothing can inoculate your peers against it, having technical arguments might help.
A while ago I wrote “vMotion over VXLAN is stupid and unnecessary” in a comment to a blog post by Duncan Epping, assuming everyone knew the necessary background details. I was wrong (again).
Moving a running VM into a foreign subnet is Mission Impossible due to stale ARP entries (anyone telling you otherwise is handwaving over a detail or two - maybe their VM doesn't communicate with other VMs in the same subnet), but it's entirely feasible to migrate a cold VM into a foreign subnet if you can fix IP routing. Here's how you can do the trick with Enterasys switches.
Every time someone mentions how awesome new technologies solve live VM mobility across WAN networks, I start muttering unmentionables. Live VM mobility across disjoint layer-2 subnets works great in demos, but usually fails in real life due to stale ARP caches. The only way to solve this problem for good is to implement EC2-like layer-3 forwarding in hypervisor soft switches.
Update: LISP Host Mobility seems to be a potential exception; see the comment from Nico.
For more details, watch the VM Mobility Requirements video (part of Enterasys-sponsored DCI webinar), read the Hot and Cold VM Mobility blog post or watch the recording of NFD4 session with Cisco’s Victor Moreno.