Monitor Public SaaS Providers with ThousandEyes

If you’ve ever tried to troubleshoot web application performance issues, you’ve probably seen it all – browser waterfall diagrams, visual traceroute tools, network topologies produced by network management systems … but I haven’t seen them packaged in a comprehensive, easy-to-use and visually compelling package before. Welcome to ThousandEyes.

What is it?

Short summary: cloud-based application performance monitoring system. Nothing new.

Here’s an elevator pitch summary of how it all works:

  • Create an account on their web site;
  • Deploy their agent software on Linux servers (or download mini VM in OVA format) at your sites;
  • Configure URLs you want to monitor;
  • Agents start periodic application-level probes (TCP or HTTP(S) probes) combined with smart traceroute (using TCP instead of ICMP to get the same treatment as the actual traffic). Web probes use Chrome browser for their timing, so you get HTTP connection time as well as DOM ready time (at which point the JavaScript code usually starts running) and total page load time (including all the graphics).
  • Agents report their results to ThousandEyes cloud-based servers, which alert you when the agents experience performance problems;
  • The agent reports are combined with public BGP feeds (example: RIPE feeds) to create a visual representation of the state of the global Internet at the time the agent(s) encountered performance problems, allowing you to identify the true root cause of the problem.

Sounds boring, right? We’ve been doing all this one way or another. You have to test the product or watch NDF6 videos to grasp the true magic of ThousandEyes. Here are the videos (watch all of them, these guys weren’t boring):

And now for the crazy ideas

We’ve been mentioning several obvious ideas (monitor internal web servers) and a few crazy ones when talking with ThousandEyes during NFD6 (so go watch the videos). Most of them would require additional functionality, so if you find them interesting (and you’re big enough), go talk with ThousandEyes and push them in this direction.

  • Package ThousandEyes server as an appliance that could be deployed within a security-conscious enterprise environment. One of the major hurdles we experienced when deploying a similar solutions was the interaction between internal probes and our management system – every decent CISO gets upset initially.
  • Interact with internal BGP routing to get BGP-based visibility into internal network as well as global Internet.
  • It would be really cool if they could (somehow) import data from your xVPN-over-Internet configurations to get the transport endpoints and then map those into their BGP-based visualization to give you true hop-by-hop path analysis.
  • Implement an automatic baselining system. You can configure absolute HTTP connection and page load timeouts, but it would be great to be able to get reasonable thresholds (per agent location) automatically.

Would I use them?

Absolutely. If you’re a business heavily relying on SaaS products (Salesforce, Dropbox, Google Docs, Gmail …) something like ThousandEyes is a must-have. Even if you can’t do a thing when an ISP two hops down the road scrambles their BGP configs, you’ll have at least an insurance policy when unhappy users start shouting at you … and a plausible reason why it might be a good idea to switch ISPs and pay someone else a bit more for a more reliable service.

Disclosure

ThousandEyes was a sponsor of Networking Tech Field Day 6 and so indirectly covered some of my travel expenses.

1 comments:

  1. Hi Ivan,

    We didnt cover this in the presentation in detail, but:

    #2: we're actually already developing this feature; we're gonna start with private ebgp feeds and then move to ibgp; this will be a nice complement to our view since it will tell our customers how they can reach the rest of the world (vs. the opposite) and provide more details about their networks

    #3: we already detect vpn tunnel entry points in path visualization by looking at mtu decrements; we don't do the exit point though and your suggestion is worth exploring once we get our hands in device configs

    #4: we spend a lot of time making sure we reduce alert fatigue and that each alert we send is relevant (versus noise); to your point, we already do baselining of several metrics, including page load, http response time, etc; in addition, we filter alerts for cases when the public agent is experiencing local problems (e.g. lost connection to gateway); im actually writing a blog entry about this last feature soon

    Cheers,

    --Ricardo
Add comment
Sidebar