Inaccurate CPU usage by top & webui?

Discussion in 'Tomato Firmware' started by cloneman, Jul 2, 2013.

  1. cloneman

    cloneman Addicted to LI Member

    I just installed optware on my E4200v1 with shibby 1.08.

    When I initiate 25mbps of traffic I get about 50% cpu usage as reported by htop, but the webui doesn't report any significant CPU usage, and neither does the regular top or procps-top.

    Which is one is accurate? When I stop the download, htop goes back to 5-6%.
  2. Malitiacurt

    Malitiacurt Networkin' Nut Member

  3. Planiwa

    Planiwa Network Guru Member

    Here's what it looks like on:
    Tomato v1.28.9013 MIPSR2-RAF-V1.1v K26 USB VLAN-NGINX-64K

    Creating traffic from a LAN host:
    ping -q -f -s1472 router
    CPU Utilization on the LAN host:
    iostat -c10 -w10 -d -n0 -C
    us sy id
    9 26 64
    ... load averages: 1.38 0.85 0.78
      netstat  -I en1 -w1
                input          (en1)          output
      packets  errs      bytes    packets  errs      bytes colls
          4158    0    6295212      4164    0    6305810    0
    On the router:
    CPU Load    80.02%
    CPU Load (1 / 5 / 15 mins)    0.01 / 0.01 / 0.00
    top -n1 -b |head -3|tail +2
    CPU:  0% usr  7% sys  0% nic  21% idle  0% io  0% irq  71% sirq
    Load average: 0.00 0.00 0.00 1/37 14441
  4. Elfew

    Elfew Network Guru Member

    I saw fix for this in last build from Victek
  5. koitsu

    koitsu Network Guru Member

    I clearly see 71% sirq and 7% sys in top output, which when combined is pretty damn close to 80% shown by the GUI. Plus, 21% idle clearly shows that 79% of the CPU is being used at that time for something. So what Planiwa shows is accurate/correct. :) Also just a key:

    usr = Userland (i.e. applications, daemons, etc.)
    sys = System (i.e. kernel time)
    nic = I suppose NIC (as in network interface card) but this seems a bit strange/bizarre to me
    idle = Idle (i.e. remaining/unused portion of CPU time, 100 - {whatever is used} = idle)
    io = I/O
    irq = Hardware IRQ (as in being generated from a device or via APICs)
    sirq = Software IRQ

    I never quite understood the softirq stuff. Highly technical, just a warning. :)
  6. cloneman

    cloneman Addicted to LI Member

    So if top reports 0/0/0 load averages, but there's 100% sirq utilization, the CPU is overburdened?
  7. koitsu

    koitsu Network Guru Member

    "overburdened" is not the word I would use. The CPU load should be showing 1.00 in that case. If it isn't, that's a problem.

    However, what Planiwa shows quite clearly is that the CPU Load line looks correct, the CPU line looks correct, but the Load average line does not look correct.

    But I myself cannot reproduce this problem using tomato-K26USB-1.28.0502.8MIPSR2Toastman-RT-N-Ext.trx and the procps-top package (i.e. /opt/bin/top) from Entware, nor using stock Busybox top (i.e. /usr/bin/top).

    It's important to know that the load average is calculated over time, so a simply test running for a few seconds may not cause the load average to hit non-zero values. It may require one to run something intensive for 30-60 seconds.

    Also remember: if this turns out to be a kernel issue, it's very likely it cannot be fixed. The kernel cannot be upgraded on these firmwares due to use of binary blob wireless drivers from Broadcom.

    I cannot help past this point in time.
  8. mstombs

    mstombs Network Guru Member

    There's potentially a big overhead in collating the stats - how would you feel filling out a timesheet that explains how you spend your day - in hours, minutes or seconds?

    Its not unusual for the different monitors to change with kernel or busybox version as they strive for better info with lower overhead. And when Broadcom binary kernel modules are in control they can and probably do what they like!

    In a device I studied in detail the load averages where from polls of the process task queue, which is a measure of how many tasks are waiting to have some CPU time when checked. The device was working with load averages of 2 or 3, and I suspect was due to badly written code with 'busy waits' rather than sleeps, embedded devices often don't have proper processor sleeps so the cpu must be doing something at all times. When completely overloaded (externally observed) the internal counters went down - presumably a monitoring task was low priority?

    My advice is to use the info with care, look for deltas when you make small changes and not treat any number as absolute!
    asterger likes this.
  9. Victek

    Victek Network Guru Member

    CPU Load cal is correct, was modified in my last git contribution.
    sscanf (buff, "%s %u %u %u %u %u %u %u", o[n].name, &o[n].user, &o[n].nice, &o[n].system, &o[n].idle, &o[n].io, &o[n].irq, &o[n].sirq);

    Unfortunately the shoot shows one capture without any time indication, and we don't know gui refresh time also, may be the 80% CPU Load had a peak of 0.2 seconds but due to standard refresh rate of the GUI display persist in the gui for 3 seconds. Average CPU load over time is correct and % CPU Load is correct also when the refresh time for the GUI is adjusted properly (1 sec) and not three as default.
    Elfew likes this.
  10. cloneman

    cloneman Addicted to LI Member

    Thanks for the responses guys...
    Mostly I'm asking as a linux noob rather than a router question.

    My firmware doesn't have a CPU load setting in the web GUI. I was merely wondering if sirq load should normally be factored into load averages. If I have 50% sirq load (in top), then my CPU usage (in htop) would be similar (about 60%). However, my load in top remains at 0 or 0.1 even if I run this type of scenario indefinitely.

    TLDR; is it normal that sirq utilization is not factored in to 1/5/15 min load?
    What happens during 100% sirq usage - the cpu is waiting and other activities are queued?
  11. Monk E. Boy

    Monk E. Boy Network Guru Member

    I think what's been explained so far is that you may see high sirq utilization, and that may be true for the time of the snapshot, but in-between snapshots (updates) the system is still being utilized and it's possible that, in-between snapshots, values could be wildly different. The load averages aren't calculated at large intervals like this, they're calculated over all the very, very small timeslices and averaged out to produce load figures.

    top just gives you the value at that the moment it updates and doesn't average over time. And keep in mind top itself can be rather, um, CPU intensive... 20% or more of those sirqs could be just from top.

    Oh, and it's entirely possible the load figures could be wrong, don't get me wrong... but it's not uncommon for CPU utilization to be higher when top is running vs. normal cpu utilization...
  12. koitsu

    koitsu Network Guru Member

    Real simple answer: I don't know. Maybe you should ask folks on the LKML (Linux Kernel Mailing List)? Be sure to tell them you're using Linux 2.6.22 (more specifically, that ought to get a couple smirks out of them. The reason I don't know is because softirq is this bizarre Linux concept that makes no sense to me whatsoever (we do not have such a thing on FreeBSD).
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice