1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

WRT54GL + Tomato 1.25 restarts

Discussion in 'Tomato Firmware' started by albinkroken, Nov 11, 2009.

  1. albinkroken

    albinkroken Addicted to LI Member

    First of all, english is not my native language.

    I have two WRT54GL setup after eachother, like this:

    DSL---router (DHCP enabled) wifi ---router (DHCP disabled) wifi

    Works lika a charm, except the DHCP enabled router restarts spontaneously 1-2 times a day at various times, without any for me obvious reason (that I know about anyway).
    QOS i enabled and works perfectly, but the restarting result in lost statistic.

    Anyone got a clue what to do?
    Anyone with the same problems?
  2. albinkroken

    albinkroken Addicted to LI Member

    Maybe somone in here can interpretate the following that occur according to the log, just when the router restarts:

    Jan 1 01:00:13 ? daemon.notice miniupnpd[99]: received signal 15, good-bye
    Jan 1 01:00:14 ? daemon.notice miniupnpd[166]: HTTP listening on port 5000
    Nov 11 22:43:51 ? user.err hotplug[55]: Unable to find nas
    Nov 11 22:43:51 ? user.info rcheck[193]: Activating rule 1
  3. Planiwa

    Planiwa LI Guru Member

    monitor system before crash

    Logs after a crash contain no information about the conditions before the crash that caused it.

    You might consider ...

    1. keeping a session open that tails the log on another system
    (tail -f /var/log/messages) -- beware of rollover!

    2. maintaining syslog on another system

    3. writing vital statistics to a novolatile place (JFFS FS?).

    I'd be sure to monitor number of connections, as well as free memory.

    might consider vit or idle -- I'll post the scripts if there is interest.
    # vit
    VIT: 2 37 5968-1552 .1 1/23 2d 2d 2d ppp0:2315-1108 eth1:3961-2663
    # vit -t
    WiFiClients Connections TotMem-Free Load Processes FWUptime WANUptime SysUptime Interface:Tot-TX ...
    VIT: 2 49 5968-1552 .1 1/23 2d 2d 2d ppp0:2315-1108 eth1:3962-2663
    # vit -h
    Syntax		vit [-h] [-t] [-k] [-p] [-d] [-dir] [-all] [Interface ...]
    -h		this help
    -t		Title headers
    -k		kiloBytes instead of MegaBytes, or kiloPackets instead of Packets
    -p		Packets instead of Bytes
    -d		prepend Datestamp
    -dir DIR	write output to file HHMM in directory DIR.
    		NB: Use CRU in INIT script to control exact Minute!  (Scheduler is unpredictable.)
    		eg: cru a Vit '01,16,31,46 * * * * /jffs/vit -d -p -dir /jffs/vitlog'
    -all		all Interfaces
    Interface ...	one or more interfaces.  Default: ppp0 eth1
    # vit -d -p -k
    VIT: 091113_12:42 2 29 5968-1556 .0 1/23 2d 2d 2d ppp0:10047-4556 eth1:12485-6782
  4. albinkroken

    albinkroken Addicted to LI Member


    First of all, Thanks wery much for your answer.

    After looking over the connection limit, tcp and udp time out, the router have been up running almost 48 h. Perhaps thats whats causing the restarts after all.

    About the external log solution, thats very interesting. Unfortunatly I'm a total n00b on Linux.

    Is it even possible for a noob to set up a external log?
    Are there any guide availible somwhere?
  5. TexasFlood

    TexasFlood Network Guru Member

    Simplest thing is to set up Wall Watcher on a Windows box. There are some Tomato specific configuration instructions with pictures over at the Link Logger site which should be the same for any syslog setup.
  6. Planiwa

    Planiwa LI Guru Member

    You're very welcome. I've posted the latest version of vit in a separate thread.
    You may find it useful. It will give more information than the log, and all you need is an ssh or telnet terminal session.

    Let me know if you need help with that.
  7. Razor512

    Razor512 Addicted to LI Member

    Just wanted to add that I also get random restarts with my WRT54GL when using tomato 1.25 but going back to 1.23 fixes the problem
  8. Planiwa

    Planiwa LI Guru Member

    We experience events as random when we are unaware of patterns, observable factors, and relations.

    Problem solving responses may include:

    Retreating to the familiar tried and true,
    Trying another panacea,
    Making random changes in the hope of chancing upon a fix,
    Asking "what should I do?",
    Observing, learning, gathering data, deriving information, looking for patterns,
    Making controlled experiments,
    Learning about the problem situation, factors, actions, and relationships,
    Making tools, monitoring, instrumenting,
    Building better mousetraps,
    And more. (Any significant omissions?)

    Tools such as the vit command can be used to observe the conditions before a crash, and help find patterns, causes, and solutions.

    vit -- Vital Statistics Monitoring Command:
  9. rkloost

    rkloost Addicted to LI Member

    I would suggest to log to KIWI syslogd (windows) or another remote syslogdaemon. For analysis we need logging.

  10. Planiwa

    Planiwa LI Guru Member

    While syslogs are the obvious first place to look, there's a bigger context.

    We need to monitor relevant statistics.
    We need to measure them.
    We need to record the measurements. (logging)
    We need to preserve the records.
    We need to evaluate the records.

    Default Tomato logs do not contain relevant vital statitics, such as:

    Number of users
    Number of connections
    Number of processes
    CPU Load
    Free memory

    The common "random reboot" crash pre-conditions are not known to be logged by the system. If the system were aware enough of iminent crash conditions to log them, the system could prevent them, or at least provide information that would help diagnose the problem. I know of no body of data from external syslogs that contains pre-crash circumstantial clues. Does any one else here?

    Still, it is worth trying to get at pre-crash log data, whether through external logging, logging to permanent storage such as JFFS, or external monitoring.

    And as we learn more about what factors might be causing crashes, it is worth monitoring and recording those factors.

    If those people who say that their router crashes several times a day would only monitor their routers for a few hours, we would be learning very quickly what the problems are. But, even though the tools are made available, we learn nothing unless the tools are applied and results are shared.

    Unfortunately most people with problems don't say:

    "Since I have this problem I can help everyone by monitoring and sharing my findings."

    Rather they tend to say:

    "I don't care what causes the problems, I just want a quick fix".

    Even those who have used the tools don't necessarily provide feedback. That makes it harder to learn from their findings, and to make better tools, which can help gain better understanding, and so on.

    So, we learn very slowly, while the "mysterious" "random" problems continue. :)
  11. Toastman

    Toastman Super Moderator Staff Member Member

    Planiwa is absolutely right. The normal logs are almost useless in this case and the use of these scripts will provide better information. Whether anyone here will be able to pick up on anything and effect any kind of cure is unknown, but first we have to start with information.

    We need a little more help to enable us to do this. Most of the people on the forum are not Linux users and don't know how to actually run and utilize the scripts - but with a little help would probably be happy to try it.

    Planiwa, can you provide us with a "noob's" abc tutorial of how to install and set up your vit script to log to an external file on our PC, which would be the easiest method for us to cut and paste the data into a post or email ?

    I have a little more time lately and this time perhaps I can do something useful to help.
  12. Planiwa

    Planiwa LI Guru Member

    cflp -- JFFS-based monitor

    While I think about that tutorial, ...

    I made something that is minimal, self-contained, and accessible to all :)
    It only assumes that 200k are available for a JFFS2 file system.
    I have not been able to field-test it, (from INIT script and with huge numbers of connections), so for now I am requesting tech-savvy users to check it out and critique it. Thanks. I would ask other users not to try it yet. :)

    Here it is (upated):

    # cflp -- minimalist Vital Stats # Planiwa 2009-11-19 [ Not recommended before V1.25]
    # 1. Enable JFFS and format it. (Administration > JFFS2)
    # 2. Copy this to (the end of) your Tomato INIT Script
    #    ( or copy it to /jffs/cflp.sh  and put /jffs/cflp.sh in your INIT Script )
    #    To restart, run this from the command line: nohup /jffs/cflp.sh &
    # 3. After post-crash reboot, examine the recent entries in /jffs/CFLP
    # Takes up 150k of space: 1 file per hour, 4 lines per minute.  
    # TODO CFLP is an awkward name ... tweetlog ?
    FSDIR=/cifs1; DIR=CFLP; SUF=".txt" # put this line after the other one, to use CIFS
    FSDIR=/jffs;  DIR=CFLP; SUF=""
    logger -p info -t $DIR "Starting in 5 minutes $$"
    sleep 300 # 5 minute delay for fail safe, and to prevent recent log removal on quick reboot
    cd $FSDIR; if [ $? -ne 0 ];then logger -p error -t $DIR "Cannot cd to $FSDIR $$";exit;fi
    mkdir -p $DIR 
    cd $DIR; if [ $? -ne 0 ];then logger -p error -t $DIR "Cannot cd to $FSDIR/$DIR $$";exit;fi
    while :;do
      if [ -r /tmp/${DIR}SLEEPSEC ];then SLEEPSEC=$(cat /tmp/${DIR}SLEEPSEC);else SLEEPSEC=15;fi
      set -- $(cat /proc/meminfo); set -- $((${10}+${13})) ${10}; FRE=${1%???}-${2%???} # kB
      set -- $(cat /proc/loadavg)
      LOADL=${1%.*}; LOADL=${LOADL#0}; LOADR=${1#*.}; LOADR=${LOADR#0};
      LOADR=$(((LOADR+5)/10)); case $LOADR in 10)LOADR=9;;esac; LOA=$LOADL.$LOADR
      USE=$(arp | wc -l) # set -- $(wl assoclist); USE=$(($#/2))
      if [ -r /proc/sys/net/ipv4/ip_conntrack_count ]
      then CON=$(cat /proc/sys/net/ipv4/ip_conntrack_count) ### Thanks, Teddy Bear!
      else CON=$(cat /proc/net/ip_conntrack | wc -l)        ### This is very costly
      MMSS=$(date +%M%S)
      HH=$(date +%H)
      if [ "$HH" != "$LASTHH" ];then
        TIMESTAMP="$(date '+%Y-%m-%d %H:%M:%S')"
        set -- $TIMESTAMP; YMD=$1; HMS=$2
        if [ "$LASTHH" == "" ];then
          case $TIMESTAMP in 19*) sleep $SLEEPSEC; continue;; esac
          sleep 3;  if [ -r /tmp/${DIR}$$ ];then ID=$(cat /tmp/${DIR}$$);else ID=$$;fi
          # check for duplicate probe daemon: if file exists and is current, kill other
          if [ -r $HH$SUF ];then
            FILESEC=$(date -r $HH$SUF '+%s'); NOWSEC=$(date '+%s'); FILESECS=$((NOWSEC-FILESEC))
            if [ $FILESECS -lt 300 ];then
              set -- $(grep $YMD $HH$SUF|tail -1)
              logger -p error -t $DIR "Duplicate Process ($3) Terminated by $ID"
              kill $3 || logger -p error -t $DIR "Failed to kill $3 [$*] by $ID"
          echo "$TIMESTAMP $ID RESTART" >> $HH$SUF
          echo "$TIMESTAMP $ID" > $HH$SUF ## erase 24h old hourlog
          echo "MMSS PR U LA FTOT-FMEM CON" >> $HH$SUF
        logger -p info -t $DIR "$PRO $USE $LOA $FRE $CON (${SLEEPSEC}s) $ID"
      echo "$MMSS $PRO $USE $LOA $FRE $CON" >> $HH$SUF
      sleep $SLEEPSEC
    done &
    logger -p info -t $DIR "$$ ==> $!"
    echo "$!" >/tmp/$DIR$$
    ### end CFLP
    Here's how it looks:

    # l -tr /jffs/CFLP
    -rw-r--r--    1 root     root         2442 Nov 15 22:59 22
    -rw-r--r--    1 root     root         3406 Nov 15 23:59 23
    -rw-r--r--    1 root     root         6134 Nov 16 00:59 00
    -rw-r--r--    1 root     root         5944 Nov 16 01:59 01
    -rw-r--r--    1 root     root         6171 Nov 16 02:59 02
    -rw-r--r--    1 root     root         5999 Nov 16 03:59 03
    -rw-r--r--    1 root     root         5916 Nov 16 04:59 04
    -rw-r--r--    1 root     root         5922 Nov 16 05:59 05
    -rw-r--r--    1 root     root         5897 Nov 16 06:59 06
    -rw-r--r--    1 root     root         5921 Nov 16 07:59 07
    -rw-r--r--    1 root     root         5922 Nov 16 08:59 08
    -rw-r--r--    1 root     root         6158 Nov 16 09:59 09
    -rw-r--r--    1 root     root         6076 Nov 16 10:59 10
    -rw-r--r--    1 root     root         5959 Nov 16 11:59 11
    -rw-r--r--    1 root     root         6062 Nov 16 12:59 12
    -rw-r--r--    1 root     root         5634 Nov 16 13:59 13
    -rw-r--r--    1 root     root         5660 Nov 16 14:59 14
    -rw-r--r--    1 root     root         3006 Nov 16 15:30 15
    # head /jffs/CFLP/15
    2009-11-16 15:00:04
    0004 21 3 .0 6787-1536 38
    0020 21 3 .1 6787-1531 36
    0037 21 3 .1 6787-1531 39
    0053 21 3 .1 6787-1531 36
    0110 21 3 .1 6787-1531 37
    0126 21 3 .1 6787-1531 36
    0143 21 3 .1 6787-1531 36
    0159 21 3 .0 6787-1531 39
    # tail /jffs/CFLP/15
    2923 23 3 .1 6488-1232 203
    2939 23 3 .1 6488-1232 171
    2956 23 3 .1 6488-1232 155
    3012 23 3 .1 6488-1232 173
    3029 23 3 .1 6488-1232 207
    3045 23 3 .2 6488-1232 216
    3102 26 3 .2 6275-1019 219
    3118 23 3 .1 6488-1232 176
    3135 23 3 .1 6488-1228 159
    3151 23 3 .1 6488-1228 143
  13. Toastman

    Toastman Super Moderator Staff Member Member

    Great stuff. I'll wait until it's finished, check back later!
  14. Planiwa

    Planiwa LI Guru Member

    My biggest concern was that since this goes into the INIT script, something might get stuck.

    I did manage to catch and fix something -- it appears that the INIT script is re-executed under some conditions, without rebooting. This would have resulted in multiple copies of our script running.

    I think it's safe enough to try out, for someone who is able to ssh or telnet to the router and display files. Do make sure you have more than 150k available on the JFFS.

    To list the available files:

    l -tr /jffs/CFLP
    To show the beginning of the log for hour 21:

    head /jffs/CFLP/21
    To show the end:

    tail /jffs/CFLP/21
    To show all of it: (about 240 lines)

    cat /jffs/CFLP/21
    To monitor the live probes for (the current) hour 22, as they happen (but this will not continue into the next hour):

    tail -f /jffs/CFLP/22
    . . .

    To look for restarts within the last 24 hours:

    grep RESTART /jffs/CFLP/*
    I think it's most useful for frequent "random reboots".
    Since it (re)writes about 150k/day, it may not be so good to just start it up and totally forget about it. Especially on older NVRAM. I'm sure we'll get comments on that. :)
  15. Toastman

    Toastman Super Moderator Staff Member Member

    That may be a problem but let's see. I am at home today, so I am going to put it on a noncritical line and try it first.

    I just made a CIFS share on my Windows 7 box, it mounts and displays several hundred gigabytes free. How do I get it to write to the CIFS share? I've tried several things but nothing is appearing in that directory. Still scratching my head.
  16. Toastman

    Toastman Super Moderator Staff Member Member

    Tried to enable JFFS2 on 3 routers, all of them give the same message. What am I doing wrong?

    "Error mounting JFFS2. Check the logs to see if they contain more details about this error."

    The log contains the same message but nothing more.

    EDIT OK Now, I was using a version 1.23 that JFFS2 AND CIFS did not seem to work. Changed to 1.25.8515.2 Victek which worked immediately. However, the message about JFFS being not mounted is still there, despite being able to list that directory. Is this a bug?

    Planiwa, some feedback. On one busy v 1.23 router tried so far, 24 hour rstats ceased working. Upgrading to 8515.2 still the same but when I restored my config backup all was OK. Looks like 1.23 had some problems.

    Back to my test setup now, all seems working, logging hour 18 now. All of my sites have quite severe limiting of total TCP and UDP connections, I am averaging about 270 at the moment, would you like this turned off to gather information?

    What version of Teddy Bear's firware has ip-conntrack count? I take it this is much less overhead than the next line?

    I have one remote site that has severe administration problems, seem to be several reasons for restarts, interference by residents. I'd like to run this on that site later if it's stable.
  17. Planiwa

    Planiwa LI Guru Member

    FWIW, I'm trying to focus on JFFS for now, because it is self-contained, I have no way to test it (the only Tomato site I have is remote), and I don't have MS PC's in any case.
  18. Planiwa

    Planiwa LI Guru Member

    Did you format the JFFS first? (Maybe I should say "Enable and Format JFFS"?

    Thanks. I'll say "works on 1.25 and later".

    that would be great. :)
    It would be good to know how long it takes (and other impacts) if there are very many conntrack entries.
    Of course that is not an issue with Teddy Bear's instant count. :)

    I don't know. I put the hopeful code there to use it when it finds it. :)

    That sounds like a good idea.
  19. Toastman

    Toastman Super Moderator Staff Member Member

    Haha like the bit about T/B code. I'll look later but at the moment it seems not too bad as it is with Victek's RAF 8515.2.
    OK. The script is running on my test system but nothing really odd logged yet. So I have now put it online on one of the busy routers logging to my PC via CIFS share.

    (one question, to enable me to quickly check these files, I'd like it to write them as say 20.txt - should that work?)

    After I post this I will disable the limits which I normally use on UDP/TCP packets. My Conntrack settings are also pretty severely cut down, as are yours. Anyway, here we go with a small sample cut and pasted from cifs2 shared directory.

    2009-11-17 21:00:13

    0410 24 0 .2 5189-524 727
    0420 24 0 .0 5619-753 280
    0426 24 0 .2 5189-524 687
    0435 24 0 .0 5619-753 286
    0513 24 0 .1 5169-503 856
    0521 24 0 .1 5619-753 292
    0528 24 0 .2 5136-471 801
    0537 24 0 .0 5619-753 275
    0544 24 0 .2 5136-471 824
    0552 24 0 .0 5619-753 284
    0600 24 0 .3 5136-471 858
    0607 24 0 .0 5619-753 277

    OK about the CIFS share. Here is how I just did it. This is for Windows 7 but I think XP was pretty much the same last time I did it. Anyone wants to try it, remember some versions of Tomato may not work properly, some 1.23 versions particularly seem bugged.

    In windows, create your directory for the router files. Then right click, go to properties/sharing/advanced/check "share this folder" - and give it a share name e.g. routerdata

    Then click permissions/full control for EVERYONE. That should enable that directory share.

    In Tomato, in Administration/CIFS check the enable box, enter the IP and share name, username and password (same as your log on to Windows). Click SAVE and after a few moments the page refreshes, you should now see that the share directory on the PC has been accessed and the size is now shown.

    Now, in the script, replace all mentions of jffs with cifs1​
  20. Toastman

    Toastman Super Moderator Staff Member Member

  21. Planiwa

    Planiwa LI Guru Member

    I've updated the source.
    Just flip the two lines where it says so ... :)
  22. Planiwa

    Planiwa LI Guru Member

    Some analysis ...

    First, there are two of these running!!
    (The duplicate detector is not working.)
    Can you please run the ps command, and post its output?
    And also the output from: ls /tmp
    On JFFS this will produce twice the data and run out of space.

    Second, we can already see some small connection storms and their effect on free space:

    For example, from 0521 to 0528 (7 seconds) connections rise from 292 to 801.
    This is accompanied by a drop in MemFree from 753 to 471, and a reduction in total memory by 483kB!

    That's pretty drastic.
  23. Toastman

    Toastman Super Moderator Staff Member Member

    With no limiting on TC/UDP this is pretty much normal here. I am sure if the conntrack timeouts were put back to orig defaults things would get pretty unstable.

    # ps
    1 root 1768 S init noinitrd
    2 root 0 SW [keventd]
    3 root 0 RWN [ksoftirqd_CPU0]
    4 root 0 SW [kswapd]
    5 root 0 SW [bdflush]
    6 root 0 SW [kupdated]
    7 root 0 SW [mtdblockd]
    26 root 1720 S buttons
    28 root 0 SWN [jffs2_gcd_mtd3]
    66 root 1720 S redial
    69 root 856 S pppoecd vlan1 -u XXXXXXXXXXXXXXXXX -p XXXXXXXX -r 146
    70 root 1940 S telnetd -p 23
    75 root 1952 S syslogd -R -L -s 50
    76 root 1532 S dropbear -p 22 -a
    78 root 1932 S klogd
    96 root 1960 S crond
    101 nobody 896 S dnsmasq
    102 root 1256 S rstats
    106 root 1664 S httpd
    107 root 0 SW [cifsoplockd]
    144 root 0 SW [cifsd]
    198 root 996 S miniupnpd -f /etc/upnp/config
    448 root 1968 S /bin/sh /tmp/script_init.sh
    5160 root 1972 S -sh
    5184 root 1932 S sleep 15
    5185 root 1948 R ps

    # ls /tmp
    etc mnt script_fire.sh var
    home ppp script_init.sh

    Can you see the CFLP directory on my webserver? It should start filling up with .txt files shortly.
  24. Planiwa

    Planiwa LI Guru Member

    Is this the same system as the one from which the data came?
    It only has one probe running:

    448 root 1968 S /bin/sh /tmp/script_init.sh
    5184 root 1932 S sleep 15

    Maybe multiple systems are writing to the same CIFS share?
    That could be quite amusing. :)
  25. Toastman

    Toastman Super Moderator Staff Member Member

    Perhaps it is possible, that the test system was also writing to the same place. I thought I had wiped all the stuff of that and rebooted before I posted those files. Never mind. 23.txt has just appeared and should be viewable on the webserver now... if it's still duplicated let me know.

    Currently there are about 8 P2P online clients and they are downloading at about 4Mbps. The worst and most evil downloader (myself) is on another router. Most of the residents here are unaware that their downloads will be slowed down greatly if they do not stop seeding. Which is lucky :biggrin: Many of them also have DHT enabled, the uTorrent defaults.
  26. Planiwa

    Planiwa LI Guru Member

    This looks good. No dup. (the way to tell is that there are two results in every 15 second interval).

    I'm curious as to why the (WiFi) User count is always 0.

    does the command
    wl assoclist
    not return anything?

    # wl assoclist
    assoclist 00:19:D2:4C:aa:bb
    assoclist 00:23:CD:BE:cc:dd
    Or are they all on ether?
  27. Toastman

    Toastman Super Moderator Staff Member Member

    Yes, these routers are all in locked steel cabinets to keep them safe from the resident twiddlers. So the wireless is always off. This building currently has 15 AP's. Some have a lot more. This is one of the 2 routers, residents get assigned to one of them as gateway by DHCP. It's crude load balancing by MAC address. The number of people online on this router is rarely less than 12, can be up to about 65 or so.
  28. Planiwa

    Planiwa LI Guru Member

    Maybe I should be counting something other than wifi users.

    I'll track arp instead of wl assoclist.
  29. Toastman

    Toastman Super Moderator Staff Member Member

    I am off to bed but will be up in 5 hours, if you have anything extra to try I will update the script. Since it's on CIFS share on a 1TB disk log as much as you like. CPU load is still under .1.
  30. Planiwa

    Planiwa LI Guru Member

    The latest version is now pretty robust:


    Those who don't want to run it from the INIT Script can put it in a file, and then run it like this: (assuming that they call the file /jffs/cflp.sh):

    [B]nohup sh /jffs/cflp.sh &[/B]
    "nohup" will make it persist, after you log out from telnet or ssh.
    & means you don't have to wait while it waits 5 minutes to get started.

    Any feedback, questions, problem reports, suggestions, much appreciated.
  31. Toastman

    Toastman Super Moderator Staff Member Member

    Updated the script at about 16:14 p.m log time. There is a big storm going on now at 16:20 - checked out - seems to be due to one resident running uTorrent, probably DHT enabled. Your detection of no. of clients on the LAN is working correctly.
  32. Planiwa

    Planiwa LI Guru Member

    Very nice:

    2009-11-18 16:00:00
    1745 24 7 .1 6012-806 117
    1801 24 7 .1 5971-765 544
    1817 24 6 .2 [B]5423[/B]-741 1629
    1835 27 6 .2 [B]4591[/B]-557 2644
    1853 24 6 .3 [B]4341[/B]-831 3781
    1914 24 6 .6 [B]3731[/B]-741 4999
    1932 24 6 .9 3723-733 4999
    1949 24 6 .8 3698-708 4990
    2006 24 6 1.0 3674-573 4514
    2023 24 6 .9 3674-569 3592
    2039 24 6 .7 3649-528 2673
    2055 24 6 .6 3649-528 1791
    2111 24 6 .5 3645-503 911
    2126 24 6 .4 3653-643 161
    2141 24 6 .3 3649-638 115
    3706 24 6 .0 3645-872 29
    This is a nice picture of a connection storm.
    If one had looked every 4 minutes, say at 16:17:45 and at 16:21:45, one would never have seen it at all.

    From 16:17:54 to 16:19:14 (80 seocnds), connections shot up from 117 to 4999.
    That's a rate of over 60 connections/second.

    Note that the clock time to count the conntrack table entries increases only slightly (from less than 1s to 2 or 3 seconds.

    (This results in slighly shorter log files for busier periods.)

    The Load increases from .1 to 1.0, with 6 users and 5000 connections.

    Compare with ...

    22:1407 24 28 .1 3584-946 902 -- 28 users, load of .1, 900 connections.

    Note also the points at which free space drops.
    And note how the free space does not return, once the storm is over.

    This suggests that the Conntrack table is allocated in chunks (as connections rise).

    This suggests that there is particular risk when number of connections exceeds the previous maximum.
  33. Planiwa

    Planiwa LI Guru Member

    How to kill a running CFLP and start a new one

    The INIT script is re-run, for example, when you update DHCP.

    Because of that, I had to make sure that this will not start an additional CFLP process.
    When it sees that one is running already, it suicides.

    I may change that, so that the new one kills the old one, or else the old one suicides.

    But, in the meantime, here is how to terminate a running CFLP process:

    At the head of each hourly log file, and at each (RE)START, there is the Process-ID:

    2009-11-18 22:00:05 451
    2009-11-17 19:23:16 24032 RESTART

    This also appears in the log, every hour:

    Nov 18 00:00:09 ROUTER user.info CFLP: 21 5 .0 6877-2306 85 (15s) 24032

    To terminate this process:

    kill 24032
    Reminder -- if you have the script in a file called /jffs/cflp.sh, then, to run it:

    nohup sh /jffs/cflp.sh &
    You can use ps (or top -- use 'q' to stop top) to show the processes.

    if it's run from the command line, as shown a bove, it looks like this:

    24032 root      1964 S    sh /jffs/cflp.sh
    24032     1 root     S     1964  14%   0% sh /jffs/cflp.sh
    if it's run from the INIT script, look for something that contains:

    BTW, here is a comparison between VIT running from the Scheduler and CFLP running nohup:

    Nov 18 11:00:07 ROUTER user.info CFLP: [B]23[/B] 2 .1 [B]6303-1753[/B] 52 (15s) 24032
    Nov 18 11:01:02 ROUTER user.info VIT: 2  64 [B]6004-1458[/B] .1 1/[B]26[/B] 2d 2d 4d ...
    As you can see, CFLP takes 3 fewer processes and 300 kB less RAM.


    I have now updated the source.

    If you start a new process, it terminates the old process if there is one.
    The syslog looks like this:

    Nov 18 22:51:44 ROUTER user.info CFLP: Starting in 5 minutes 31100
    Nov 18 22:56:44 ROUTER user.info CFLP: 31100 ==> 31429
    Nov 18 22:56:48 ROUTER user.err CFLP: Duplicate Process (28601) Terminated by 31429
    Nov 18 22:56:48 ROUTER user.info CFLP: 31 6 .1 5025-897 114 (15s) 31429
  34. Toastman

    Toastman Super Moderator Staff Member Member

    As of 23:30 I have this script running on another remote site which is mostly (badly) administered by the owner and reboots regularly. I am hoping this will finally reveal whether the reboot is due to the router or by someone switching off the UPS! I'll post anything interesting, maybe it'll take 24 hours.
  35. Planiwa

    Planiwa LI Guru Member

    Hm. If the modem restarts every time the router does, that might give an additional clue, depending on power topology and switch placement.:)
  36. Toastman

    Toastman Super Moderator Staff Member Member

    Yes. The big problem with that site is that there are too many people playing with it. Last time I went there, the UPS was only supplying one AP and they had pulled all the others and the router off the UPS. They're like monkeys, I despair ... anyway, today so far nothing wrong, there are not many users in the building.
  37. Planiwa

    Planiwa LI Guru Member

  38. Toastman

    Toastman Super Moderator Staff Member Member

    Just a note - Teddy Bear v8739 Lite gives additional room for JFFS = 704k
  39. Planiwa

    Planiwa LI Guru Member

  40. Toastman

    Toastman Super Moderator Staff Member Member

    Still logging - apologies for any break while I upgraded to 64 bit OS and software.

    results for past 24 hours:

    http://firmware.mooo.com/router files/CFLP/

    The Hitachi NAS product is cute.

    EDIT: Discontinued logging and share on 13 December 2009

Share This Page