Event-Driven Network Automation in Network Automation Online Course

Event-driven automation (changing network state and/or configuration based on events) is the holy grail of network automation. Imagine being able to change routing policies (or QoS settings, or security rules) based on changes in the network.

We were able to automate simple responses with on-box solutions like Embedded Event Manager (EEM) available on Cisco IOS for years; modern network automation tools allow you to build robust solutions that identify significant events from the noise generated by syslog messages, SNMP traps and recently streaming telemetry, and trigger centralized responses that can change the behavior of the whole network.

As with any complex solution, it’s hard to get event-driven automation right. You have to define what significant events are, how to identify them based on reports sent by network devices, and what the response should be. Just to give you an example:

  • A major link has gone down. Should we react immediately or should we wait? How long should we wait?
  • A few seconds later the link is reestablished. Should we close the incident or wait to see what happens next?
  • The link has flapped several times in the last 10 minutes. How often should it flap before we take it out of service? Do we have redundancy in place so we can take the link out of service, or should we continue using the obviously-broken link because there is no alternative?
  • Let’s say we decide to take the link out of service and report the problem to the carrier. What should we do if we get a subsequent failure on a redundant path?

As you can see, it’s relatively easy to define some of these elements (for example, link flap dampening has been implemented in Cisco IOS quite a while ago), while others (what to do after a failure on redundant link) probably require a human intervention. Event-driven automation is thus always a mix of human-machine interactions that’s hard to turn into a shrink-wrapped product that you could deploy and forget.

The only alternatives are to (A) ignore this challenge or (B) build a solution from smaller building blocks. If you decide to go down the latter path, check out the event-driven network automation section of the Building Network Automation Solutions course, as well as all the other course modules that will help you design and implement the overall network automation solution you need.

The next live session of the course starts in February 2019, but you don’t have to wait that long. Register now and start your automation journey immediately.

3 comments:

  1. It looks similar to the discussion I had when was assessing SDN/OpenFlow Centralised Controller solutions to handle local link failure.
    Is this fast enough to reprogram the (OpenFlow-based switches) to the new path?

    As a result we still use BFD-based solution to trigger OSPF changes...
    In other words we handle the local failure locally without centralised controler...
    Replies
    1. I had much the same discussions when the whole centralized control plane stupidity came out - fast feedback loop executed across unreliable relatively-high-latency links is a recipe for disaster.

      However, the problem with BFD (or any other similar mechanism) is that it can only trigger vendor-predefined actions. EEM (and equivalents) can bring you to custom actions, but they still cannot be coordinated across more than one node (my "should we bring the flapping link down" example).
    2. Yes, exactly. I am in the phase of moving from one vendor to another, and need to preserve failure detection (3x 300ms BFD timer), and propagate the change throughout the network. So moving to the new vendor is always a challenge. Not to mention how different can be implementation of the same RFC (so many grey areas implemented differently).
Add comment
Sidebar