The Impact of Data Gravity: a Campfire Story

Here’s an interesting story illustrating the potential pitfalls of multi-DC deployments and the impact of data gravity on application performance.

Long long time ago on a cloudy planet far far away, a multinational organization decided to centralize their IT operations and move all workloads into a central private cloud.

They knew what they were doing (after all, it wasn’t their first workload migration) and carefully prepared the infrastructure, analyzed the dependencies in the application stack, and moved the whole application stack to the central private cloud (including the underlying database).

The results were dismal: transactions that took milliseconds before the migration now took several seconds.

The enlightened gurus quickly identified the only possible culprit: it must be The Network. After all, the users were in the same campus network as the application prior to the migration, and had to traverse the Internet (or private WAN) to get to the new private cloud after the migration. Nothing else changed but the underlying network infrastructure. Case closed.

It’s almost impossible to prove the poor Network innocent once the judgment in absentia has been passed. No amount of traceroutes, latency measurements or other probes will ever persuade non-networking people that the network is not the problem (but try using Thousand Eyes – the jury might be swayed by nice-looking diagrams and graphs).

The situation on the cloudy planet was no different. The networking engineers did their measurements that clearly showed there’s no network problem – latencies and round-trip times were in the expected range, there was plenty of bandwidth and packet drops were negligible. Still, nobody believed those measurements until the whole problem exploded and forced the application and database teams to start troubleshooting their respective silos.

As it turned out, the application did use just one database, which was moved with the application, but there was another small database with user credentials and other user details, and that one wasn’t moved to the central private cloud due to local privacy protection laws. The database server running in the central private cloud thus continuously queried the remote database, adding tens of milliseconds to the transaction processing time with each query.

6 comments:

  1. So, it still was the network. The added latency between the authentication server and the application is delaying processing. :P
    Replies
    1. It's not the network team's job to identify every dependency that every application required, in this example the network conditions were found to be fine but someone on the application team neglected to test the effects of putting some distance between the main db and the user db
    2. It is network engineers resposibility to understand applications' packet flow.
    3. Completely impractical at a large enterprise environment where there are literally thousands of applications. There is always need to balance the responsibilities between the network engineer and application technical owner.
  2. This is gonna be my new bedtime story. I'm raising a "network" guy.
  3. Blaming netwotk is the simplest way, but it is not solution - network guys will not break laws of physics. In good organization everybody (net, admins and devs) should work together and have one common target. And everybody should undarstand laws of physics.
Add comment
Sidebar