Data Has Mass and Gravity

A while ago, while listening to an interesting CloudCast podcast (my second favorite podcast - the best one out there is still the Packet Pushers), I stumbled upon an interesting idea “Data has gravity”. The podcast guest used that idea to explain how data agglomerates in larger and larger chunks and how it makes sense to move the data processing (application) closer to the data.

Look up the data gravity formula on datagravity.org. It’s almost identical to Newton’s law of universal gravitation, with latency adjusted for bandwidth playing the part of distance.

Having spent the last few months as an occasional “physics consultant” for my high-school daughter, I loved the idea, quickly mentally transformed it to “Data has mass” (which then generates gravity) and immediately started drawing additional parallels that you should keep in mind every time someone starts blabbering about long-distance VM migration or cloudbursting.

Cloudbursting (noun) - the "visionary" idea of dynamically moving workloads into the cloud, while forgetting the dirty details like IP routing, access to data, latency, bandwidth and a few other minor details. Works best in the reality-distortion field of PowerPoint presentations.

Data is hard to move. The energy needed to move an object is proportional to object's mass and the speed with which you want to move the object. In networking terms: you need more bandwidth (more energy) or more time (because you're moving data at lower speed). The follow-up CloudCast podcast has a great example of someone who spent four days moving the data so he could run a 15-minute computation on it.

Data creates gravity well. You have to reach pretty high speed (and spend a significant amount of energy) if you want to escape from a large mass. In application terms: Once an application (and its users) becomes accustomed to low latency access to data, it takes an enormous amount of energy to move that application somewhere else, sometimes requiring a total rewrite of the application (see also: going from Cessna to Space Shuttle).

Here’s the more formal explanation of gravity well. As is usually the case, xkcd provides a more down-to-Earth explanation.

A similar phenomenon (although on a smaller scale, because vMotion limits latency to 10msec) happens when you vMotion a live VM away from its data. In the famous white paper where Cisco, EMC and VMware managed to move a running SQL server into another data center, the performance dropped by ~30% until they made the LUN closer to the VM active. If you ever had the "privilege" of talking with someone over a satellite connection, you know how that poor SQL server felt.

In more humorous terms, the further away from the data the application is, the lower the data’s gravity force is … until the application gets (v)motion sickness from lack of gravity ;)

Data attracts more data. An obvious corollary to the previous two facts: as you don't move the data (because it's too bulky) and the applications using the data live close to it (to enjoy the low-latency environment), those applications generate even more data. The bulky data heap only grows, generating an even higher gravity well.

Disclaimer: I'm not saying that "moving workloads into the cloud" is hogwash. What I'm saying is that a move to the cloud must be a conscious, well-planned decision and that you have to figure out what to do with the data before moving the compute workload … a fact that is well obscured by point-and-click eye candy offered by several vendors (you know, you just click on a VM … and **unicorn magic** it runs in the cloud).

3 comments:

  1. One possible set of circumstances when "cloudbursting" may be feasible is when you also use your cloud provider as your backup/archive for your data stash. In that case most if not all of your data is available close at hand for the workloads you spin in the cloud.
  2. Ivan, thank you for this, location does indeed matter. We all know it and this is a nice analogy to articulate the relationship between application and data, application and application, and data and data. The relationships exist and as such (virtual) distance will greatly impact behavior.
  3. really nice analogy!
Add comment
Sidebar