In March I explained why it’s unrealistic to expect to use machine learning to solve unknown problems in today’s snowflake networks… but are there other problems that could be solved?
- The training data is available (public DNS domain names) and does not depend on specific network design (everyone is querying the same public DNS records);
- The project uses two sets of training data: DNS names that are known to be good, and DNS names that are known to be malicious. The two sets are coming from trusted sources.
So far so good. The sample neural network got to 98% accuracy, and I’m positive it’s pretty easy to make it a bit better with a larger training set and larger neural net.
What I’m struggling with is whether that’s good enough. Like any other pretty-reliable test this one has deal with
- False negatives - how often a malicious domain name is not identified as such;
- False positives - how many times a valid domain name is blocked because it’s identified as malicious.
Considering that one (hopefully) wouldn’t use a DNS blocker as the only security tool, I would be worried about false positives. Getting 2% of valid domain names blocked seems a bit high to me… but then I have no baseline to compare it to. Pointers would be highly appreciated.