Use nProbe and ELK Stack to Build a Netflow Solution on Software Gone Wild

How do you capture all the flows entering or exiting a data center if your core Nexus 7000 switch cannot do it in hardware? You take an x86 server, load nProbe on it, and connect the nProbe to an analysis system built with ELK stack… at least that’s what Clay Curtis did (and documented in a blog post).

Obviously I wanted to know more about his solution and invited him to the Software Gone Wild podcast. In Episode 39 we discussed:

  • Differences between European and US vacations;
  • Hardware limitations you face when building a Netflow solution;
  • What server you need to capture 10Gbps of traffic;
  • How to connect nProbe with ELK stack using JSON (or not);
  • What is ELK stack and what are its components (Logstash, Elasticsearch and Kibana);
  • The wonderful things you can do with data once it’s within the ELK stack;

For more details, read Clay’s blog posts:

And here are the relevant Software Gone Wild episodes we mentioned during our discussion:

4 comments:

  1. A few comments:

    1) Ivan mentioned an in-memory copy for 500GB of data. Due issues with JVM heap size, individual Elasticsearch nodes don't scale well beyond 64GB of RAM. After reaching 64GB of RAM (with 31GB allocated to the Java heap), you should scale horizontally rather than vertically. It's possible to run multiple Elasticsearch nodes on one OS instance (e.g., 2 nodes with 64GB each running on one instance of Linux or Windows), but this adds complexity. You can also run multiple 64GB nodes in separate VMs on one hypervisor. In my experience this works fine, but it depends on the use-case. Very large Elasticsearch clusters typically run on bare metal with 64GB RAM per node and a mixture of SSD and spinning disk.

    2) Elasticsearch has a lot of optimizations built around fast retrieval from disk, and a lot of knobs you can tweak to ensure that the most frequently searched indices live on SSD.

    3) With respect to the concern about high-volume indexing causing search performance problems: if this is a problem you can use index routing to help by ensuring that data is indexed on nodes with the fastest disk (say SSD in RAID 0), then moved to nodes with spinning disk. If your cluster is search-heavy you could also increase the number of replica shards, which requires more storage but decreases search time.

    4) With respect to the question about aggregating flow data over time: Ivan is right that you would need to write custom code to do this; you could either do this as a batch job that reads a time slice of data from an index, aggregates it, and writes it to a new index, or you could use so-called "entity-centric indexing" to create indices with different data models at approximately the same time:

    https://www.elastic.co/videos/entity-centric-indexing-mark-harwood
  2. One more comment: obviously, if you put multiple ES nodes on the same hypervisor or OS instance, you need to be really careful to make sure that you have anti-affinity rules or some other mechanism to ensure that a hypervisor failure doesn't destroy ES's cluster-resiliency model.
  3. Finally, I forgot to mention that Logstash has native NetFlow v5 and v9 codecs. It can't handle high volume (I'm guessing no more than a few hundred flows per second), but it might be worth trying for smaller use cases.
  4. There's a SaaS solution for this problem that is way easier than setting up your own ELK stack--http://www.kentik.com.
Add comment
Sidebar