arrmo / shibby multiwan unable to resolve DNS after failover

Discussion in 'Tomato Firmware' started by ajtish, Dec 15, 2017.

  1. ajtish

    ajtish Connected Client Member

    I was seeing a strange bug in my customized firmware based on Shibby, now Arrmo. I was running the problem down for a few days in my custom firmware, and decided to try and compile vanilla and found the same issue. Currently working with the latest release of the arrmo-RT-AC branch.

    I was hoping to take advantage of the multiwan feature to make the internet at several locations more reliable with a second internet connection.

    Issue: In the event that the primary WAN fails with network connectivity, but no route to the internet, watchdog runs and changes the default route to the secondary connection, but I am not able to resolve DNS after the routes have changed. I have seen that on occasion after some time (15+ minutes) DNS will start to work, but haven't determined why this sometimes works, but most of the time does not.

    Preferred resolution: have the DNS servers for WAN and WAN2 picked up for forwarding by dnsmasq. certainly open to other solutions with the exception of using the ISP-provided DNS servers as they are unreliable

    Configuration:
    • ASUS RT-N12
    • 2 WANs
      • Primary - vlan1, DHCP on subnet 192.168.3.0/24
      • Secondary - vlan12, DHCP on subnet 192.168.0.0/24
    • OpenVPN client connects back to the central office
    • WAN DNS set to 8.8.8.8 8.8.4.4
    • WAN2 DNS set to 208.67.222.222 208.67.220.220
    • primary WAN has a WAN weight of 1
    • secondary WAN has a WAN weight of 0 (failover only)
    • Watchdog, set to run every 60 minutes. I have been executing it manually for my testing in the vanilla compile, and in my custom firmware I have a script that runs every minute to check if the vpn is up, and as part of its troubleshooting it runs watchdog. Otherwise watchdog runs on the hour to get the connection to failback to the primary if it has come back online

    Troubleshooting steps:
    • Turned on mwan, vpn, and dnsmasq debugging and can see that mwanrouting is making route changes and has successfully failed over
    • Tried changing wan weights, but having anything besides pings and failover traffic on the backup connection will result in large data fees being that its a LTE to ethernet device. having the secondary connection in anything other than failover also causes undesirable instability in our VPN
    • I can perform ping -I vlan12 8.8.8.8, but cannot perform ping 8.8.8.8 indicating to me that there is some sort of a routing issue
    • See no routes for any DNS server in route -n or ip route list
    • Tried different DNS servers for the secondary WAN, but these servers are never shown in the /etc/resolv.dnsmasq meaning that they are not picked up for use, confirmed in the dnsmasq \var\log\messages
    • Searched through the sourcecode to see if I could figure out where the DNS servers were getting picked up and if a static route was being set, but I couldn't find anything. I'm a handy sysadmin, not a developer :)
    • Restarting services: dnsmasq, wan, wan2
    • killing mwanrouting and started a new instance

    User stormy reported the exact issue I have, says they got it "a bit better" but I don't understand the improvement they made: http://www.linksysinfo.org/index.php?threads/tomato-multiwan.71978/page-4#post-283499

    relevant information from /var/log/messages
    (every minute) Dec 15 08:20:12 router1 daemon.err openvpn[1711]: RESOLVE: Cannot resolve host address: vpn.company.net:1194 (Name or service not known)
    (every 30 seconds) Dec 15 08:20:35 router1 user.info mwanroute[31044]: mwan_status_update, failover in action - WAN2
    (upon dnsmasq service starting)
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: started, version 2.76 cachesize 4096
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: compile time options: IPv6 GNU-getopt no-RTC no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset Tomato-helper auth no-DNSSEC loop-detect no-inotify
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: asynchronous logging enabled, queue limit is 5 messages
    Dec 15 08:32:53 router1 daemon.info dnsmasq-dhcp[12199]: DHCP, IP range LANSUBNET.100 -- LANSUBNET.200, lease time 1d
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: reading /etc/resolv.dnsmasq
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: using nameserver 8.8.8.8#53
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: using nameserver 8.8.4.4#53
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: read /etc/hosts - 5 addresses
    Dec 15 08:32:53 router1 daemon.info dnsmasq[12199]: read /etc/dnsmasq/hosts/hosts - 5 addresses
    Dec 15 08:32:53 router1 daemon.info dnsmasq-dhcp[12199]: read /etc/dnsmasq/dhcp/dhcp-hosts

    root@router1:/tmp# cat state_wan*
    0
    1

    root@router1:/tmp/home/root# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    --- 8.8.8.8 ping statistics ---
    4 packets transmitted, 0 packets received, 100% packet loss

    root@router1:/tmp/home/root# ping -I vlan12 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    64 bytes from 8.8.8.8: seq=0 ttl=55 time=202.351 ms
    64 bytes from 8.8.8.8: seq=1 ttl=55 time=60.884 ms
    64 bytes from 8.8.8.8: seq=2 ttl=55 time=40.566 ms
    64 bytes from 8.8.8.8: seq=3 ttl=55 time=61.868 ms
    64 bytes from 8.8.8.8: seq=4 ttl=55 time=40.656 ms
    --- 8.8.8.8 ping statistics ---
    5 packets transmitted, 5 packets received, 0% packet loss
    round-trip min/avg/max = 40.566/81.265/202.351 ms

    root@router1:/tmp/home/root# ping 208.67.222.222
    PING 208.67.222.222 (208.67.222.222): 56 data bytes
    64 bytes from 208.67.222.222: seq=0 ttl=53 time=183.863 ms
    64 bytes from 208.67.222.222: seq=1 ttl=53 time=72.003 ms
    64 bytes from 208.67.222.222: seq=2 ttl=53 time=53.747 ms
    64 bytes from 208.67.222.222: seq=3 ttl=53 time=72.903 ms

    --- 208.67.222.222 ping statistics ---
    4 packets transmitted, 4 packets received, 0% packet loss
    round-trip min/avg/max = 53.747/95.629/183.863 ms

    root@router1:/tmp/home/root# route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    192.168.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 vlan12
    192.168.3.1 0.0.0.0 255.255.255.255 UH 0 0 0 vlan1
    10.20.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 tun11
    192.168.3.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan1
    10.2.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
    192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan12
    127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
    0.0.0.0 10.20.0.1 128.0.0.0 UG 0 0 0 tun11
    0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 vlan12

    root@router1:/tmp/home/root# ip route list
    192.168.0.1 dev vlan12 scope link
    192.168.3.1 dev vlan1 scope link
    10.20.0.1 dev tun11 proto kernel scope link src 10.20.1.1
    192.168.3.0/24 dev vlan1 proto kernel scope link src 192.168.3.39
    10.2.1.0/24 dev br0 proto kernel scope link src 10.2.1.1
    192.168.0.0/24 dev vlan12 proto kernel scope link src 192.168.0.10
    127.0.0.0/8 dev lo scope link
    0.0.0.0/1 via 10.20.0.1 dev tun11
    default via 192.168.0.1 dev vlan12
     
    Last edited: Dec 18, 2017
  2. ajtish

    ajtish Connected Client Member

    Additionally, I have confirmed that I am able to do an nslookup against the DNS servers on WAN2, and have the shell grabs below to show that the resolver is working, and that the DNS server is routing out WAN2.

    root@router1:/tmp/home/root# nslookup google.com 208.67.222.222
    Server: 208.67.222.222
    Address 1: 208.67.222.222 resolver1.opendns.com

    Name: google.com
    Address 1: 2607:f8b0:4000:80f::200e dfw25s16-in-x0e.1e100.net
    Address 2: 172.217.2.238 dfw28s01-in-f14.1e100.net

    root@router1:/tmp/home/root# traceroute 208.67.222.222
    traceroute to 208.67.222.222 (208.67.222.222), 30 hops max, 38 byte packets
    1 192.168.0.1 (192.168.0.1) 2.420 ms 0.404 ms 0.357 ms
    2 172.26.96.161 (172.26.96.161) 195.026 ms 41.009 ms 39.316 ms
     
  3. ajtish

    ajtish Connected Client Member

    Been doing some more troubleshooting today and got nowhere as far as progressing past the problem, but getting closer to maybe identifying where the problem lies.

    These commands were run on a router that had a primary connection upon starting, but no longer has a routable connection to the Internet over the primary WAN (vlan1).

    What I am seeing is that the DNS for the primary WAN (vlan1) is always cached in the route cache. It remains in the cache, routed to vlan1 gateway regardless of if I clear the cache, or set a static route to route it over the failover interface.

    Which brings me to the question, where is network routing actually being handled?
    It does not seem to be handled by the routing tables.
    Is there something within the kernel, busy box or something else that is overriding the route tables that are actually being presented to the user, but appear to be in the route cache for some reason?


    root@router1:/tmp/home/root# route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    192.168.3.1 0.0.0.0 255.255.255.255 UH 0 0 0 vlan1
    192.168.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 vlan12
    192.168.3.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan1
    10.2.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
    192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan12
    127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
    0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 vlan12

    root@router1:/tmp/home/root# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes

    --- 8.8.8.8 ping statistics ---
    5 packets transmitted, 0 packets received, 100% packet loss

    root@router1:/tmp/home/root# ip route get 8.8.8.8
    8.8.8.8 via 192.168.3.1 dev vlan1 src 192.168.3.39
    cache mtu 1500 advmss 1460 hoplimit 64

    root@router1:/tmp/home/root# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes

    --- 8.8.8.8 ping statistics ---
    3 packets transmitted, 0 packets received, 100% packet loss

    root@router1:/tmp/home/root# ping -I vlan12 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    64 bytes from 8.8.8.8: seq=0 ttl=55 time=74.229 ms
    64 bytes from 8.8.8.8: seq=1 ttl=55 time=134.144 ms
    64 bytes from 8.8.8.8: seq=2 ttl=55 time=63.827 ms
    64 bytes from 8.8.8.8: seq=3 ttl=55 time=53.526 ms
    64 bytes from 8.8.8.8: seq=4 ttl=55 time=88.484 ms
    64 bytes from 8.8.8.8: seq=5 ttl=55 time=53.550 ms

    --- 8.8.8.8 ping statistics ---
    6 packets transmitted, 6 packets received, 0% packet loss
    round-trip min/avg/max = 53.526/77.960/134.144 ms

    root@router1:/tmp/home/root# ip route replace 8.8.8.8/32 via 192.168.0.1 dev vlan12

    root@router1:/tmp/home/root# route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    192.168.3.1 0.0.0.0 255.255.255.255 UH 0 0 0 vlan1
    192.168.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 vlan12
    8.8.8.8 192.168.0.1 255.255.255.255 UGH 0 0 0 vlan12
    192.168.3.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan1
    10.2.1.0 0.0.0.0 255.255.255.0 U 0 0 0 br0
    192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan12
    127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
    0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 vlan12

    root@router1:/tmp/home/root# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes

    --- 8.8.8.8 ping statistics ---
    4 packets transmitted, 0 packets received, 100% packet loss

    root@router1:/tmp/home/root# ip route flush cache

    root@router1:/tmp/home/root# ip route get 8.8.8.8
    8.8.8.8 via 192.168.3.1 dev vlan1 src 192.168.3.39
    cache mtu 1500 advmss 1460 hoplimit 64

    root@router1:/tmp/home/root# ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes

    --- 8.8.8.8 ping statistics ---
    4 packets transmitted, 0 packets received, 100% packet loss

    root@router1:/tmp/home/root# ping -I vlan12 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    64 bytes from 8.8.8.8: seq=0 ttl=55 time=41.116 ms
    64 bytes from 8.8.8.8: seq=1 ttl=55 time=55.724 ms
    64 bytes from 8.8.8.8: seq=2 ttl=55 time=37.905 ms
    64 bytes from 8.8.8.8: seq=3 ttl=55 time=103.212 ms
    64 bytes from 8.8.8.8: seq=4 ttl=55 time=54.365 ms

    --- 8.8.8.8 ping statistics ---
    5 packets transmitted, 5 packets received, 0% packet loss
    round-trip min/avg/max = 37.905/58.464/103.212 ms
     
  4. ajtish

    ajtish Connected Client Member

    Solved the WAN2 DNS not working by editing the the dns_to_resolv function in the source file services.c

    The modification pulls the DNS servers for all WANs, the only catch to this is that the DNS servers have to be different for each WAN as I have not determined how the apparently static routes to each WANs DNS servers are created, or how to remove/flush the routes out. I am not confident in how solid the code is for all users, but this solution works for our configuration and keeps our site-to-site VPN up in the event that the primary WAN fails.

    The function starts on line 606 in the original release/src/router/rc/services.c file.

    Original:
    Code:
    void dns_to_resolv(void)
    {
        FILE *f;
        const dns_list_t *dns;
        int i;
        mode_t m;
        char wan_prefix[] = "wanXX";
        int wan_unit,mwan_num;
    
        mwan_num = nvram_get_int("mwan_num");
        if(mwan_num < 1 || mwan_num > MWAN_MAX){
            mwan_num = 1;
        }
        for(wan_unit = 1; wan_unit <= mwan_num; ++wan_unit){
            get_wan_prefix(wan_unit, wan_prefix);
            if(check_wanup(wan_prefix) && get_dns(wan_prefix)->count) break;
        }
    
        m = umask(022);    // 077 from pppoecd
        if ((f = fopen(dmresolv, "w")) != NULL) {
            // Check for VPN DNS entries
            if (!write_pptpvpn_resolv(f) && !write_vpn_resolv(f)) {
    #ifdef TCONFIG_IPV6
                if (write_ipv6_dns_servers(f, "nameserver ", nvram_safe_get("ipv6_dns"), "\n", 0) == 0 || nvram_get_int("dns_addget"))
                    write_ipv6_dns_servers(f, "nameserver ", nvram_safe_get("ipv6_get_dns"), "\n", 0);
    #endif
                dns = get_dns(wan_prefix);    // static buffer
                if (dns->count == 0) {
                    // Put a pseudo DNS IP to trigger Connect On Demand
                    if (nvram_match("ppp_demand", "1")) {
                        switch (get_wan_proto()) {
                        case WP_PPPOE:
                        case WP_PPP3G:
                        case WP_PPTP:
                        case WP_L2TP:
                            fprintf(f, "nameserver 1.1.1.1\n");
                            break;
                        }
                    }
                }
                else {
                    for (i = 0; i < dns->count; i++) {
                        if (dns->dns[i].port == 53) {    // resolv.conf doesn't allow for an alternate port
                            fprintf(f, "nameserver %s\n", inet_ntoa(dns->dns[i].addr));
                        }
                    }
                }
            }
            fclose(f);
        }
        umask(m);
    }
    
    Modified:
    Code:
    void dns_to_resolv(void)
    {
        FILE *f;
        const dns_list_t *dns;
        int i;
        mode_t m;
        char wan_prefix[] = "wanXX";
        int wan_unit,mwan_num;
    
        mwan_num = nvram_get_int("mwan_num");
        if(mwan_num < 1 || mwan_num > MWAN_MAX){
            mwan_num = 1;
        }
        
        m = umask(022);    // 077 from pppoecd
        if ((f = fopen(dmresolv, "w")) != NULL) {
            // Check for VPN DNS entries
            if (!write_pptpvpn_resolv(f) && !write_vpn_resolv(f)) {
    #ifdef TCONFIG_IPV6
                if (write_ipv6_dns_servers(f, "nameserver ", nvram_safe_get("ipv6_dns"), "\n", 0) == 0 || nvram_get_int("dns_addget"))
                    write_ipv6_dns_servers(f, "nameserver ", nvram_safe_get("ipv6_get_dns"), "\n", 0);
    #endif
                for(wan_unit = 1; wan_unit <= mwan_num; ++wan_unit){
                    get_wan_prefix(wan_unit, wan_prefix);
                    if(check_wanup(wan_prefix) && get_dns(wan_prefix)->count) {
                        dns = get_dns(wan_prefix);    // static buffer
                        if (dns->count == 0) {
                            // Put a pseudo DNS IP to trigger Connect On Demand
                            if (nvram_match("ppp_demand", "1")) {
                                switch (get_wan_proto()) {
                                case WP_PPPOE:
                                case WP_PPP3G:
                                case WP_PPTP:
                                case WP_L2TP:
                                    fprintf(f, "nameserver 1.1.1.1\n");
                                    break;
                                }
                            }
                        }
                        else {
                            for (i = 0; i < dns->count; i++) {
                                if (dns->dns[i].port == 53) {    // resolv.conf doesn't allow for an alternate port
                                    fprintf(f, "nameserver %s\n", inet_ntoa(dns->dns[i].addr));
                                }
                            }
                        }
                    }
                }
            }
            fclose(f);
        }
        umask(m);
    }
    
     
    pedro311 and kille72 like this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice