Router reload after 15 minutes of failed pings

Jeroen sent me an interesting challenge: he would like to reload the router when the 3G WAN interface gets stuck (I thought my Nokia phone is the only one exhibiting this problem, but obviously I was wrong). The reload-on-failed-ping EEM applet I’ve published would be a perfect solution, but it uses track delay and the maximum delay timeout is three minutes, while Jeroen would like to wait 15 minutes before reloading the router.

I had two off-the-cuff ideas: execute reload in X command when SLA fails and reload cancel when SLA recovers, or use a second EEM applet with event timer watchdog that is triggered (and stopped) by the SLA-tracking applets. Both options are pretty messy so I was not really happy with either one ... and then Jeroen managed to find a third, totally unexpected solution.

He decided to use the SNMP value event detector to detect SLA failure (each SLA measurement has its own MIB variables) and combined it with a trigger saying “execute this applet if the OID value is below the threshold X times in X sampling intervals.” Here’s his SLA definition (he gets extra bonus points for starting SLA measurements 30 minutes after power up) ...

ip sla 10
 icmp-echo 10.255.251.64 source-interface Loopback0
 request-data-size 16384
 frequency 10
ip sla schedule 10 life forever start-time after 00:30:00

... and the EEM applet (the last number in the OID string has to match ip sla entry number and the polling frequency should match the ip sla frequency):

event manager applet vodafone_down_RELOAD 
 event snmp oid 1.3.6.1.4.1.9.9.42.1.2.9.1.6.10 
get-type exact entry-op lt entry-val "2" poll-interval 10
 trigger occurs 179 period 1790
 action 01.0 syslog msg "No ping response last 30 min."
 action 02.0 syslog msg "Reloading now to see if things get better..."
 action 03.0 reload

11 comments:

  1. Awesome!!!!!!!!!

    ReplyDelete
  2. Jónatan Natti19 May, 2011 12:26

    Just a thought. But in my experience it's usually enough to do a shut/no shut on the cellular interface to get the 3G back up and running.
    I've got this same request a while ago, to reload the router if 3G has been down for a few minutes.
    (This was based on the customers experience with other 3G solutions, so it seems common that 3G users have to reload their equipment...)
    But this ended with using EEM/TCL and doing shut/no shut on the cellular interface before reloading the router. (different timers). So if shut/no shut fixed the problem, SLA recovered and the router didn't have to reload. (And we preserve the logging buffers, and the recovery is quicker, etc.)

    There's also another issue regarding 3G.
    Most 3G equipment can fallback to GPRS/EDGE if the if the 3G signal is to weak or unavailable, and this can happen automatically.
    However, from what I've heard*, the 3G equipment will not try to go back to 3G even if the 3G signal is available, if there is any data flowing. It will wait until there's no data transfer going on before going from GPRS/EDGE back to 3G.
    (* I've not verified this myself, but I heard this from someone who's more familiar with 3G equipment than I am.)

    ReplyDelete
  3. You can also just reboot the cellular modem using "test cellular 0 2 modem-power-cycle".

    ReplyDelete
  4. A provider we hired to configure our 3G dmvpn oob routers had this problem aswell, he got in contact with TAC and they provided him after some faultsearching with a working IOS. Dont know about public release though...

    ReplyDelete
  5. Ivan,

    I have done similar EEM scripts in my role. But I don't reload the router, I only reload the 3G-HWIC instead and I do it after I miss 8x IP SLA consecutive pings at 1min intervals and default ping timeout of 5s.

    I can share my config if you wish, let me know.

    Cheers,
    Joe.

    ReplyDelete
  6. Ivan Pepelnjak05 June, 2011 18:52

    That would be fantastic. Just paste it as a comment or post a link to somewhere.

    Thank you! Ivan

    ReplyDelete
  7. is it necessary to have this on your conf:

    snmp-server enable traps ipsla

    ReplyDelete
  8. Joe,
    please share

    ReplyDelete
  9. i would appriciate any help with this one :

    i have an ipsla that pings a host .
    if syslog message "%TRACKING-5-STATE: 222 ip sla 333 reachability Up->Down" has happened 2 times in 3 minutes, its putting a null route .

    what i would like to know is how can i make it that this Null route would be removed only if its been 30 Minutes since the last syslog message "%TRACKING-5-STATE: 222 ip sla 333 reachability Down->Up" ?

    the thing is i need to know i can have a reliable backup link with a mechanism to verify it [the 30minutes safe period].

    track 222 ip sla 223 reachability
    ip sla 223
    icmp-echo x.x.x.x source-ip y.y.y.y
    threshold 500
    frequency 5
    ip sla schedule 223 life forever start-time now
    ip sla reaction-configuration 223 react timeout threshold-type xOfy 2 5 action-type trapOnly
    !
    event manager applet IPSEC_TUNNEL_2_FAIL
    event syslog pattern "%TRACKING-5-STATE: 222 ip sla 223 reachability Up->Down"
    trigger occurs 2 period 180
    action 1.0 cli command "enable"
    action 2.0 cli command "config t"
    action 3.0 cli command "ip route 192.168.255.5 255.255.255.255 Null0 name NULL_WHEN_IPSLA223_FAIL"
    action 3.1 cli command "exit"
    action 4.0 syslog msg "IPSEC_VPN_TUNNEL2 TIMEOUT - MOVING TO IPSEC_TUNNEL1"

    i was thinking on using watchdog timer but i understand it counts down from the time of a trigger . thats great , but if the sla is flapping and i get two "Down->Up" - i think it would initiate multiple times the specific eem , no ? if yes - then in case of a continouse flapping ill get into trouble ...

    Thank you

    ReplyDelete
  10. Until Joe C shares his complete config with ip sla, I can share you 2 useful eem applets to reload the Cellular module.

    ! 1) Manual reload, using "event manager run reload.3g.module"
    ! You can use this if you still have access to the router.

    event manager applet reload.3g.module
    event none
    action 1.0 cli command "enable"
    action 1.1 cli command "configure terminal"
    action 1.2 cli command "service internal"
    action 1.3 cli command "end"
    action 2.0 cli command "test cellular 0 modem-power-cycle"
    action 3.1 cli command "configure terminal"
    action 3.2 cli command "no service internal"
    action 3.3 cli command "end"
    action 4.0 syslog msg "Cellular0 module has been rebooted. Reason: unknown Cisco bug."


    ! 2) Automatic reload (based on a syslog. Usually the hwic throws an error when is faulty).
    ! You can adapt this to a tracked object and execute it.

    event manager applet auto.reload.3g.module
    event syslog pattern "CISCO800-2-MODEM_REMOVAL_DETECTED: Cellular0 modem is now REMOVED"
    action 1.0 cli command "enable"
    action 1.1 cli command "configure terminal"
    action 1.2 cli command "service internal"
    action 1.3 cli command "end"
    action 2.0 cli command "test cellular 0 modem-power-cycle"
    action 3.1 cli command "configure terminal"
    action 3.2 cli command "no service internal"
    action 3.3 cli command "end"
    action 4.0 syslog msg "Cellular0 module has been rebooted. Reason: unknown Cisco bug."
    !

    All the best,
    CR

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.