Streaming Telemetry: View from the Trenches

I asked David Gee to review my streaming telemetry blog posts to make sure I didn’t make too many blunders, and he sent me a nice summary of his view on the topic in return.

The only thing I could do after reading it was to ask him for permission to do a copy-paste. Here it is:

  • A higher number of data consumers beyond basic Network Management Systems (NMS) now require instrumentation data. The eye of Sauron now looks at the network;
  • CPUs in network devices are usually pathetic and weak by consumer standards, therefore they cannot handle the load of being queried constantly by multiple sources;
  • Where possible, we should use the ASICs to stream this data in UDP and leave the CPU alone. Not possible for generic health telemetry, but we're good for the most part;
  • The CPU should stream using a pub/sub mechanism rather than keep being hammered by polls;
  • It's better to stream, store data in a Time-Series Database (TSDB) then query than keep hammering a mission-critical CPU that keeps network control plane alive;
  • By having this data in a TSDB, data integrity for correlation is preserved. With staggered polling, data collection skews can cause issues.
  • With Model-Driven Telemetry (MDT) and Google Protocol Buffers (GPB), we can use the data structures natively in applications that integrate with and surround the network without worrying about pesky SNMP or some crappy API on a crappy and outdated NMS;
  • Streaming telemetry represents a convergence point between the world of application developers and networking. Despite it being a 'rinse and repeat', let’s make developers’ lives easier by using an approach they can relate to. In the long run, customers have to be happy.
  • Our silo has been smashed and now data that has been "our precious" is required elsewhere. MDT makes this whole thing much easier on everyone involved.

2 comments:

  1. Great summary and reasons for Telemetry instead of polling. This powerful MDT has to be supplemented by visualization and query tools to make best use of the data. This is where intent-driven assurance systems play a vital role. Operators need a flexible interface to express their requirements which includes conditions & actions and the software has to perform the heavy lifting of processing telemetry data at acceptable throughput.
  2. Good overview. I am working on trying to show the value in telemetry to a group that only knows SNMP.
Add comment
Sidebar