Detect CPU spikes with Embedded Resource Manager

David Winter wanted to detect high-CPU spikes and act on them. The first part (high CPU utilization) could be done with SNMP, but since IOS release 12.3(14)T, the right tool for the job is the Embedded Resource Manager (ERM).

The ERM syntax is a bit baroque (and not well documented), so let's work through the example: this is the configuration you need to detect high overall CPU utilization on the main CPU in the box:

resource policy
 policy HighGlobalCPU global
  system
   cpu total
    critical rising 95 falling 70 interval 10
    major rising 75 falling 50 interval 10
 !
 user global HighGlobalCPU

And here are the usage/configuration guidelines:

  • The whole ERM subsystem is configured under the resource policy section;
  • You always have to configure a policy and a user to which the policy applies. In our example, the user is global (as we're measuring the global CPU load);
  • The policy we're defining must have the global keyword to indicate we're measuring overall utilization (otherwise you can't attach it to the global user);
  • We're measuring the load on the main CPU, so we're configuring the system subsection of the policy (on distributed platforms you could specify slot name to measure utilization on a specific linecard);
  • The cpu section selects CPU load measurements. You could measure interrupt load, process load or total CPU load.
  • Within each resource section in the policy (in our example, total CPU load on the main system) you can define minor, major and critical thresholds (syslog messages are generated when each threshold is crossed).
  • After the policy is defined, it's applied to the global user.

With the CPU load measurement policy defined, the router will generate syslog messages (SYS-4-CPURESRISING) every time the overall CPU load exceeds the specified rising thresholds. When the utilization falls below the falling threshold, the SYS-4-CPURESFALLING syslog message is generated.

This article is part of You've asked for it series.

8 comments:

  1. Interesting article. I tried on 2 different routers and could see the CPURESRISING logs, but not the falling logs. Any ideas? If the fall is within the interval, will the fall not be logged?

    ReplyDelete
  2. The FALL should be logged when the CPU load goes below the falling value ... and please note that the falling value should be less than the rising value.

    ReplyDelete
  3. Hi!

    Please help with ERM configuration below:


    !
    resource policy
    policy C881W-CPU global
    system
    cpu total
    critical rising 50 interval 30 falling 20 interval 10
    major rising 35 interval 15 falling 15 interval 20
    !
    !
    !
    user global C881W-CPU
    !
    !
    !

    Router Cisco 881, IOS ver. 15.0.1M7. This policy don't place syslog message after CPU load to 55-60%.
    snmp value for last 5 min.: $ snmpwalk -v2c -c String 1.1.1.1 1.3.6.1.4.1.9.2.1.58.0
    SNMPv2-SMI::enterprises.9.2.1.58.0 = INTEGER: 59
    Thanks.

    ReplyDelete
  4. Could be a bug. Why don't you open a case with Cisco TAC?

    ReplyDelete
  5. Ivan, in my log don't registerd any SYS-4-CPURESRISING or SYS-4-CPURESFALLING messages. After i add this lines to my configuration:
    process cpu threshold type total rising 75 interval 30 falling 40 interval 10

    and CPU load rising to 82% on syslog added this messages:
    270601: Nov 18 14:10:34.068: %SYS-1-CPURISINGTHRESHOLD: Threshold: Total CPU Utilization(Total/Intr): 82%/76%, Top 3 processes(Pid/Util): 75/4%, 151/0%, 98/0%
    274323: Nov 18 14:21:04.060: %SYS-1-CPUFALLINGTHRESHOLD: Threshold: Total CPU Utilization(Total/Intr) 8%/2%.

    Process cpu threshold working fine and place syslog messages for rising and falling CPU values. Resource policy don't add any messages.

    With the best regards, Alexey

    ReplyDelete
  6. On this time i don't allow active service contract for this router :(
    I probaly change IOS version.

    Alexey

    ReplyDelete
  7. This feature not for all ios/platforms. Cisco Feature Nav. anounce full ERM for 7200/7600 platforms.
    After i change IOS on my 881 router to 12.4.20T4 version resource policy generate rising syslog message:
    003009: Nov 18 22:11:00.588: %SYS-4-CPURESRISING: System is seeing global cpu util 87% at total level more than the configured major limit 35 %
    004169: Nov 18 22:13:05.596: %SYS-1-CPURISINGTHRESHOLD: Threshold: Total CPU Utilization(Total/Intr): 93%/65%, Top 3 processes(Pid/Util): 81/22%, 63/4%, 217/0%
    004232: Nov 18 22:13:10.616: %SYS-4-CPURESRISING: System is seeing global cpu util 91% at total level more than the configured critical limit 50 %
    004745: Nov 18 22:14:15.606: %SYS-1-CPUFALLINGTHRESHOLD: Threshold: Total CPU Utilization(Total/Intr) 0%/64%.
    011972: Nov 18 22:30:10.620: %SYS-4-CPURESRISING: System is seeing global cpu util 41% at total level more than the configured major limit 35 %

    and don't generate falling syslog messages. For smb and branch routers use "process cpu threshold" configuration.

    Ivan, thanks for your post and questions.
    Alexey

    ReplyDelete
  8. Ivan, thanks for your post, questions and reply!
    Alexey

    :)

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.