1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Kernel panic upon enabling IPv6 (Shibbby 110-VPN-EN, Asus RT-N66U)

Discussion in 'Tomato Firmware' started by maleadt, Jul 5, 2013.

  1. maleadt

    maleadt Networkin' Nut Member

    Hi all,

    I wanted to enable IPv6 in Tomato, which my ISP enables through the use of DHCPv6 with prefix delegation, but some (short) time after my router gets its IPv6 address it dies and reboots immediately.

    I bought a Serial TTL cable to debug this, and when the router crashes I get to see the following panic on my console:
    Code:
    CPU 0 Unable to handle kernel paging request at virtual address 00000014, epc == 8020ff04, ra == 802106b4
    Oops[#1]:
    Cpu 0
    $ 0   : 00000000 00000000 00000000 00000000
    $ 4   : 87f55760 87cdbc90 87f55760 00000000
    $ 8   : 0000003c 80005034 0000000a 00000000
    $12   : 00000000 00000000 000000a8 00000000
    $16   : 879bdf20 879bdf00 00000000 00000000
    $20   : 00000000 87f55760 8133b000 87cdbc90
    $24   : 00000000 c01208ec                  
    $28   : 87cda000 87cdbbd0 00000000 802106b4
    Hi    : 0000071f
    Lo    : b321b800
    epc   : 8020ff04     Tainted: P      
    ra    : 802106b4 Status: 11007c03    KERNEL EXL IE 
    Cause : 00000008
    BadVA : 00000014
    PrId  : 00019749
    Modules linked in: tun xt_layer7 ip6table_filter ip6table_mangle xt_length xt_recent xt_IMQ imq nf_conntrack_ipv6 ehci_hcd sdhci mmc_block mmc_core vfat fat ext2 ext3 jbd mbcache usb_storage sd_mod scsi_wait_scan scsi_mod leds_usb led_class ledtrig_usbdev usbcore nf_nat_pptp nf_conntrack_pptp nf_nat_proto_gre nf_conntrack_proto_gre nf_nat_ftp nf_conntrack_ftp nf_nat_sip nf_conntrack_sip nf_nat_h323 nf_conntrack_h323 wl(P) dnsmq(P) et(P) igs(P) emf(P)
    Process vpnserver1 (pid: 1090, threadinfo=87cda000, task=87b77800)
    Stack : 8133b000 00000001 801a924c 801a924c 00000028 00000001 81383d80 00000000
            00000000 802be164 00000005 87cdbc48 8133b000 80192e6c 00000000 00000024
            00000028 87cdbc40 0ff00009 879bdf00 879bdf20 879bdf00 87cdbc90 00000000
            00000000 00000001 8133b000 802c01b0 80000000 802106b4 879bdf20 87817780
            87817780 00000000 87817780 8025528c 000000a8 879bdf00 803189f0 c018d000
            ...
    Call Trace:[<c0133154>][<c01330f8>][<c01545bc>][<c0154e50>][<c04d49ac>][<c04c434c>][<8000150c>][<80001640>]
    
    Code: 8fb00050  03e00008  27bd0078 <8e420014> 30420002  1440ffef  3c038031  24636b50  8c620004 
    Kernel panic - not syncing: Fatal exception in interrupt
    Rebooting in 3 seconds..
    Since it mentions 'process vpnserver' I tried disabling OpenVPN, but it still crashes, now in a different process (swapper, pid 0):
    Code:
    CPU 0 Unable to handle kernel paging request at virtual address 00000014, epc == 8020ff04, ra == 802106b4
    Oops[#1]:
    Cpu 0
    $ 0   : 00000000 00000001 00000000 00000000
    $ 4   : 87f55760 802a9b80 87f55760 00000000
    $ 8   : 0000003c 80005034 00000003 00000000
    $12   : 00000000 00000000 000000a8 00000000
    $16   : 87846e80 87846e60 00000000 00000000
    $20   : 00000000 87f55760 80336000 802a9b80
    $24   : 00000000 c01208ec                  
    $28   : 802a8000 802a9ac0 00000000 802106b4
    Hi    : 000005f2
    Lo    : e53b7800
    epc   : 8020ff04     Tainted: P      
    ra    : 802106b4 Status: 11007c03    KERNEL EXL IE 
    Cause : 00000008
    BadVA : 00000014
    PrId  : 00019749
    Modules linked in: xt_layer7 ip6table_filter ip6table_mangle xt_length xt_recent xt_IMQ imq nf_conntrack_ipv6 ehci_hcd sdhci mmc_block mmc_core vfat fat ext2 ext3 jbd mbcache usb_storage sd_mod scsi_wait_scan scsi_mod leds_usb led_class ledtrig_usbdev usbcore nf_nat_pptp nf_conntrack_pptp nf_nat_proto_gre nf_conntrack_proto_gre nf_nat_ftp nf_conntrack_ftp nf_nat_sip nf_conntrack_sip nf_nat_h323 nf_conntrack_h323 wl(P) dnsmq(P) et(P) igs(P) emf(P)
    Process swapper (pid: 0, threadinfo=802a8000, task=802aa188)
    Stack : 80336000 00000001 801a924c 801a924c 00000028 00000001 81383d80 00000000
            00000000 802be164 00000005 802a9b38 80336000 80192e6c 00000000 00000024
            00000028 802a9b30 0ff00009 87846e60 87846e80 87846e60 802a9b80 00000000
            00000000 00000001 80336000 802c01b0 80000000 802106b4 87846e80 87b05880
            87b05880 00000000 87b05880 8025528c 000000a8 87846e60 803189f0 c018d000
            ...
    Call Trace:[<c0133154>][<c01330f8>][<c01576f0>][<8000150c>][<802c2b58>][<802c2224>]
    
    Code: 8fb00050  03e00008  27bd0078 <8e420014> 30420002  1440ffef  3c038031  24636b50  8c620004 
    Kernel panic - not syncing: Fatal exception in interrupt
    Rebooting in 3 seconds..
    Full logs are attached. Mind that the second crash, with OpenVPN disabled, happened a bit faster. This is typical; sometimes I get the time to browse to the Tomato webinterface, other times I don't even get to the point of hearing back from the DHCP server. As soon as I unplug my WAN connection, the router doesn't crash anymore.

    Does anybody have any idea what could be the cause of this, and how I could further debug or even fix it?
    Thanks
     

    Attached Files:

  2. koitsu

    koitsu Network Guru Member

    I have an idea what could be the cause of it:

    A bug in the kernel or some related kernel module. :)

    This is the first time I've seen a kernel panic from the kernel on TomatoUSB. (Note: I do not mean "this is the first time I've seen Linux panic", it means this is the first time I've had the opportunity to see the output on the TomatoUSB version)

    Sadly none of the information is useful, particularly the stack trace ("Call Trace"). All the hexadecimal means jack squat without debugging symbols. If the kernel (and I imagine the rest of the system (modules, etc.)) were built with debugging symbols enabled, that stack trace would contain function names and possibly kernel module names -- then someone who was familiar with the kernel code could figure out where the bug lies.

    The problem with enabling debugging symbols is that they greatly increase the size of the firmware -- to the point where it might even be too big to fit on a router (even ones with larger flash).

    The settings to do this are CONFIG_DEBUG_INFO=y and CONFIG_DEBUG_KERNEL=y in in the Linux kernel configuration file release/src-rt/linux/linux-2.6/config_base. I would also recommend CONFIG_NETFILTER_DEBUG=y since the issue may be there. You obviously would have to build your own firmware to enable these features.

    What I can tell from the calling stack is that the last two (most-recently-called) functions in kernel space are the same both crashes, which may or may not be useful (those could be the kernel panic/trap handler functions). Take a look at this for what a kernel with debugging symbols gives for the stack trace (this is also on a 64-bit platform but doesn't matter):

    http://codeascraft.com/2012/03/30/kernel-debugging-101/

    Quite a bit different/much more useful, yes? :)

    Also be aware that enabling kernel debugging often can have a substantial effect on performance, but that's too much to go into -- it varies depending on the underlying debugging code written by Linux folks (kernel, modules/drivers, etc.).

    If none of this can be done feasibly, then I would suggest rolling back to an older firmware release and see if the problem goes away.
     
  3. maleadt

    maleadt Networkin' Nut Member

    Ah, I was afraid this would be the answer :) Compiling my own version as we speak, I'll report back as soon as I have a human-readable stack trace.
     
  4. maleadt

    maleadt Networkin' Nut Member

    I compiled the kernel with those debug options turned on, but the trace still only contained addresses and no symbol names. Loading the (unprocessed) vmlinux in gdb however indicated it was loading symbols, which AFAIS confirms the symbols are there. But then I discovered the final MIPS kernel image Makefile used objcopy to thoroughly strip the binary (-S -R .mdebug in arch/mips/brcm-boards/bcm947xx/compressed/Makefile), yet after removing these I still don't get to see any additional information...

    Am I missing something? Could it be that the mdebug sections aren't mentioned in the bcm947xx vmlinux.lds, so they aren't laid out in the final image (just guessing)?
     
  5. maleadt

    maleadt Networkin' Nut Member

    Some progress: disabling QoS prevents the crash.
     
  6. maleadt

    maleadt Networkin' Nut Member

    Update on the debug info issue: not only does the Makefile strip the kernel as mentioned before, it also extracts the machine instructions and continues processing that raw binary which obviously has all interesting ELF sections (such as debug_info) removed...
     
  7. koitsu

    koitsu Network Guru Member

    Thanks for the digging (you got way further than I would have, honestly -- I have more familiarity with FreeBSD in this regard than Linux). I'm not too surprised by your findings, as the modus operandi of these firmwares is to "be as small as possible", making blind assumptions that they're rock solid (...right...) and nobody would need to debug a kernel panic (then again 99.8% of the customer demographic wouldn't know how to do this anyway).

    It makes me wonder how folks at Asus and Linksys actually debug kernel issues during testing. If I had to take a guess, I'd say they probably have a PXE boot setup where the router PXE boots and gets a debug-aware firmware. I can't imagine someone sitting around all day flashing a firmware to a router 30 or 40 times just for testing... (Brings me back to the days of EPROMs. :) )

    Knowing that disabling QoS prevents the issue is a good piece of info. Maybe Shibby can use that and correlate things back to a particular git commit.
     
  8. maleadt

    maleadt Networkin' Nut Member

    Yeah I've given up :( I really wanted to find the underlying issue, but I'm not going to spend time converting that Makefile to a more canonical one (although it should be possible to look at OpenWRT here) just to get debugging symbols in my kernel. Especially now that I discovered that Toastman's releases don't crash when combining IPv6 and QoS -- maybe this information can help Shibby spot the issue in his code?

    About Asus, I really hope they left some code out of the source drop, if not debugging must have been one hell of a painful experience :)
     
  9. RMerlin

    RMerlin Network Guru Member


    I do. That's how I debugged the ipt_account module, and the nvram corruption issue on the RT-AC56U. For that one, I ended up enabling almost every single memory debugging option of the kernel, and first boot - bam! serial console outputs a buffer overrun report in a specific location of the kernel code).

    The kernel has a lot of tools that helps in debugging - if you know how to use them. I mostly don't :(
     
    visceralpsyche likes this.
  10. Merlyn_3D

    Merlyn_3D Network Guru Member

    Is there a way to notify shibby of the issue? It'd be nice not to have to choose between IPv6 and QOS.
     
  11. shibby20

    shibby20 Network Guru Member

    you don`t have to. I`m here :)
     
    LinkyPete likes this.
  12. Merlyn_3D

    Merlyn_3D Network Guru Member

    Ah...cool...well hey, let me know if I can offer up any help :D
     
  13. LinkyPete

    LinkyPete Network Newbie Member

    Shibby et al, FYI this issue still seems to happen (enabling QoS and IPv6 (in my case, Prefix Delegated)) causes the router to infinitely reboot. Using v123 for ASUS RT-N66U - Tomato Firmware 1.28.0000 MIPSR2-123 K26AC USB AIO-64K. Sounds like it is just kernel, which is really too bad.
    Anybody been able to get this to work?
     
    Last edited: Oct 13, 2014
  14. underpickled

    underpickled New Member Member

    I'm having this same issue with firmware: tomato-K26USB-1.28.RT-N5x-MIPSR2-132-AIO-64K.

    QoS works, IPv6 (DHCPv6 with prefix delegation) works, but the combination causes the router (Asus RT-N66U) to continuously reboot. It's actually quite difficult to recover from this state... a combination of 30-30-30 (to get it into recovery mode) and using the WPS button to clear NVRAM will get it back into a usable state, but sometimes it takes multiple attempts. Are there any plans to address the issue? This thread started over 2.5 years ago...
     
  15. Elfew

    Elfew Addicted to LI Member

    Contact shibby by pm
     

Share This Page