What’s the Big Deal with Validation?

This blog post was initially sent to subscribers of my mailing list. Subscribe here.

In his Intent-Based Networking Taxonomy blog post Saša Ratković mentioned real-time change validation as one of the requirements for a true intent-based networking product.

Old-time networking engineers would instinctively say “sure, we need that” while most everyone else might be totally flabbergasted. After all, when you create a VM, the VM is there (or you’d get an error message), and when you write to a file and sync the file system the data is stored, right?

As is often the case, networking is different.

Let’s start with the real challenges. A network is a tightly coupled distributed system built from unreliable components (I’m talking about links, not software we have to deal with – that’s topic for another time). Validating that all components are still operating as expected makes perfect sense.

At this point it’s worth mentioning that most other IT systems ignore the reality of unreliable components (see also: fallacies of distributed computing) or at least don’t do that well. Servers crash when they lose connectivity to storage or encounter parity error, software crashes when it encounters an exception that might be triggered by cosmic rays

Then there’s the other kind of validation – the WTF one. In January 2018 I was sitting in a presentation where a major networking vendor explained how their network assurance engine validates that the configuration requests made through their SDN controller are properly implemented on their proprietary closed hardware fabric. The whole thing became so ridiculous that we had to ask the obvious question “Are you telling us we should buy additional software to check whether your software is bug-free?

Why would we ever have to do something as crazy as that? Ignoring the subpar software quality for the moment the root cause is often the unreliable mechanism we have to use to configure network devices – from CLI that was designed for hands and eyes of an operator not automation scripts, to NETCONF implementations that don’t support candidate configurations or can’t even rollback on error. Finally, there are always edge cases where the device software tries to squeeze too much into device hardware and fails without reporting the failure.

As long as customers are not willing to vote with their wallets and buy gear that properly implements mechanisms needed for somewhat-reliable network automation there’s little we can do apart from:

  • Use Trust but Verify approach – every time you make a change to a networking device, use show commands (not device configuration) to validate that the changes were implemented. Bonus points for using real traffic instead of show commands.
  • Minimize the changes made to networking devicesconfiguring a gazillion features on network edge to solve higher-level incompetence is a sure recipe for a disaster;
  • Minimize the blast radius – old-time service providers built separate transport and services infrastructure for a good reason. That lesson somehow got lost in various converged crazes. Hardware networking vendors fighting nail and tooth to avoid irrelevance don’t help either.

We covered the first topic in some details in Building Network Automation Solutions online course, and discussed various aspects of the third one in virtualization webinars (all of them available with ipSpace.net subscription) and Building Next-Generation Data Center online course.

Add comment