1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Connection Storms Crash Routers, Challenge Programmers

Discussion in 'Tomato Firmware' started by Planiwa, Sep 18, 2009.

  1. Planiwa

    Planiwa LI Guru Member

    Most people don't even know of the existence of Connection Storms. That includes the programmers who write the P2P applications that unleash them, as well as the programmers who write the router software that crashes when faced with Connection Storms, rather than containing them.

    Tomato's "Advanced > Conntrack/Netfilter" WebGUI page displays Current Connection Counts, by Connection State, ephemerally (and incompletely!).

    Currently it is extremely costly to obtain these counts from shell scripts, as this involves reading and processing in detail the entire Conntrack Table.

    I would like to issue an invitation to capable programmers to make the Current Connection Counts available in /proc pseudofiles, in order to enable detection, measurement, and control of Connection Storms, by sys admins, with shell scripts.

    # # #

    I would like to direct the reader's attention to the attachement.

    We see that near the start of the chart, at 19:44:26, the number of connections jumped by 150 or more in just 5 seconds, to 381. During the next 10 seconds it continued to rise slowly to 437, levelled off for 10 seconds, and then dropped by 300, 5 seconds later. What had happened?

    Here is a record of these connections, by idle time and connection state:

    Code:
    HH:MM:SS  0s  1s  2s  3s  >3s Total  State -180 E1200 S120 R60 F120 T120 C10 CW60 LA30 L120 UU/-30 UA180
    19:44:30   0   2   4   0  374   380  U-:316 (DNSe):159 (DNSi):158 E:40 T:21 UA:1 Cl:1 CW:1                   
    19:44:38   0   4   1  19  414   438  U-:371 (DNSe):188 (DNSi):184 E:40 T:25 UA:1 Cl:1                   
    19:44:47   0   3   2   2  419   426  U-:368 (DNSe):186 (DNSi):183 E:38 T:19 UA:1                        
    19:44:53   0   1   0   1  124   126  U-:95 (DNSe):54 (DNSi):41 E:23 T:8                                 
    19:44:59   0   4   6   1  108   119  U-:60 E:37 (DNSi):31 (DNSe):30 T:21 UA:1                           
    19:45:06   0   4   0   0   56    60  E:37 T:20 U-:2 (DNSi):2 UA:1 (DNSe):1                              
    
    Allmost all the connections are idle (i.e. defunct) and waiting to time-out (i.e. be removed from the Conntrack Table).

    The entire 300+ connection surge is UDP. Unclassified UDP -- the kind that the abovementioned GUI page forgets to count. These are DNS queries. Half are to the Router, the other half are their relays, from router to NS. 30 seconds later they are gone -- timed out.

    This is just one kind of Connection Storm.

    # # #

    Other kinds of Connection Storms start up numerous SYN Sent or UDP connections that may or may not convert into short-lived Established connections, and then immediately wait to time out in Time-Wait state or UDP assured or UDP "stateless".

    The system under study uses Victek's default timeouts, i.e.:
    -180 E1200 S120 R60 F120 T120 C10 CW60 LA30 L120 UU/-30 UA180

    I invite those who understand the connection mechanisms in this context to explain why the timeouts for the States that sustain Connection Storms should not be reduced greatly, namely thus:

    Code:
    SS: 120s -> 5s   
    TW: 120s -> 5s
    UU:  30s -> 5s (== U-)
    UA: 180s -> 10s
    Once we understand that a Connection Storm involves sharp surges of hundreds, if not thousands of "connections" that can crash the router, that typically last for less than a minute, and are made up almost exclusively of defunct "connections" that will time out in 2 or 3 minutes, it makes sense to get rid of these defunct "connections" before they can build up a storm that crashes the router.

    Thus, if the connections are created at a rate of 150 per second, and expire in 120 seconds, we risk a storm of of a magnitude of 18,000. But if we reduce the timeouts (of those already defunct "connections") to 5s, the maximum storm size is 750.

    The problem is not (as the conventionally-wise coders have been assuming), that Established TCP connections linger for 5 days and slowly fill up the Conntrack Table. The problem is that some P2P (and other reckless) applications will attempt to create thousands of connections instantly, and while most of those connections never materialize, they create a huge connection storm during the attempt.

    What I have suggested in this message could significantly alleviate this problem that may be the most serious one affecting Tomato users.

    I am not calling for developers to feature-solve the problem (which they can not be expected to understand), but for a capable programmer to make available those Current Connection Counts, to enable sys admins to detect, measure, understand, and control these Connection Storms.

    Edit:

    1. It would be ideal if, in addition to the Total Current Connection Count, the Current Connection Counts-by-State would be available, just as the Timeouts-by-State are available.

    2. Ideally, such counts should include the count for "UDP (None) -- U-", in addition to the counts for "UDP Assured -- UA" and "UDP Unresponsive -- UU". "UDP (None)" is missing from the "Advanced > Conntrack/Netfilter" WebGUI page. (As a result, the counts on that page fail to add up, when there are U- connections!)

    (Even though the vast majority of UDP connections are U-, there is no separate Timeout for U-. the Timeout for U- is implicitly inherited from UU. This confusion (and omission) may be rooted in the challenge to "Connection-Track" the "Connection-less" UDP Flows.
  2. teddy_bear

    teddy_bear LI Guru Member

    This looks like it could be useful, and is easy to do. I'll include it into the next build of my mod, and if others want to implement it before I push it into git, here's the patch:
    Code:
    diff --git a/release/src/linux/linux/net/ipv4/netfilter/ip_conntrack_core.c b/release/src/linux/linux/net/ipv4/netfilter/ip_conntrack_core.c
    index b856357..25c186c 100644
    --- a/release/src/linux/linux/net/ipv4/netfilter/ip_conntrack_core.c
    +++ b/release/src/linux/linux/net/ipv4/netfilter/ip_conntrack_core.c
    @@ -61,7 +61,7 @@ LIST_HEAD(protocol_list);
     static LIST_HEAD(helpers);
     unsigned int ip_conntrack_htable_size = 0;
     int ip_conntrack_max = 0;
    -static atomic_t ip_conntrack_count = ATOMIC_INIT(0);
    +atomic_t ip_conntrack_count = ATOMIC_INIT(0);
     struct list_head *ip_conntrack_hash;
     static kmem_cache_t *ip_conntrack_cachep;
     static LIST_HEAD(unconfirmed);
    @@ -1415,6 +1415,7 @@ static struct nf_sockopt_ops so_getorigdst
     #define NET_IP_CONNTRACK_MAX 2089
     #define NET_IP_CONNTRACK_TCP_TIMEOUTS  2090
     #define NET_IP_CONNTRACK_UDP_TIMEOUTS  2091
    +#define NET_IP_CONNTRACK_COUNT 2092
     #define NET_IP_CONNTRACK_MAX_NAME "ip_conntrack_max"
     
     #ifdef CONFIG_SYSCTL
    @@ -1423,6 +1424,8 @@ static struct ctl_table_header *ip_conntrack_sysctl_header;
     static ctl_table ip_conntrack_table[] = {
           { NET_IP_CONNTRACK_MAX, NET_IP_CONNTRACK_MAX_NAME, &ip_conntrack_max,
             sizeof(ip_conntrack_max), 0644,  NULL, proc_dointvec },
    +      { NET_IP_CONNTRACK_COUNT, "ip_conntrack_count", &ip_conntrack_count,
    +        sizeof(ip_conntrack_count), 0444,  NULL, proc_dointvec },
           { NET_IP_CONNTRACK_TCP_TIMEOUTS, "ip_conntrack_tcp_timeouts",
               &sysctl_ip_conntrack_tcp_timeouts,
               sizeof(sysctl_ip_conntrack_tcp_timeouts),
    
    The connection count will be available in /proc/sys/net/ipv4/ip_conntrack_count.
  3. Toastman

    Toastman Super Moderator Staff Member Member

    This may be a big step forward to preventing the instability we all experience from time to time. Thanks to Planiwa for his work on revealing some of the "skeletons" in the cupboard, and to T/B for doing that mod! I look forward to putting it to use.

    Would it be possible to add the output in a visible form in the "conntrack" page?
  4. Planiwa

    Planiwa LI Guru Member

    The mod makes it possible for scripts to answer the question "how many connections are in the table now?", without having to pay a huge price in processing. This makes it possible to write scripts to detect, measure, track, and (eventually) control storms, without totally bogging down the router.

    The Conntrack page already gives a (fleeting) answer to that question, but there is no way to use that answer to automate anything.


    The one thing that should be added to that page is a count of Unqualified UDP connections, i.e. UDP connections other than UDP Unreplied and UDP Assured.
    These may be the vast majority of connections during a storm, yet they are missing from the page.

    These unqualified UDP connections have no explicit timeout of their own, but inherit the timeout from "UDP Unreplied".

    While this is almost completely unknown, it greatly affects connection storms and thus router stability. It is related to the mystery of Unclassified connections in QoS.
  5. Toastman

    Toastman Super Moderator Staff Member Member

    Interesting. If these can be counted and displayed, would it be also possible to add the function to time them out quickly? Or am I prematurely jumping one step ahead ? :biggrin: You see, I'm excited by the investigation Planiwa has been doing. It's going to my head :eek:

    I've been monitoring these connection storms at five sites for a few days now. There have been many connection storms (hundreds) involving thousands of UDP connections opening in very short periods, about which the present Tomato can do nothing - it is really pot luck whether the routers survive it or crash.

    It would be really something if Tomato could be made more reliable than any of the competition!
  6. RonWessels

    RonWessels LI Guru Member

    Perhaps I'm not understanding the full problem here. But isn't this solvable by some simple IPTABLES rules, something along the lines of
    Code:
    iptables -I INPUT -i br0 -m state --state NEW -m recent --set
    iptables -I INPUT -i br0 -m state --state NEW -m recent --update --seconds 1 --hitcount 10 -j DROP
    If I understand correctly, the above two rules will limit incoming new connections to the router from individual sources to 10/second, and thus limit the connection storm.
  7. Toastman

    Toastman Super Moderator Staff Member Member

    Thanks Ron. I haven't seen this particular script format. I see it is for all connections not just UDP, so it will somewhat slow down complex but legitimate web pages. I will try it out now.

    The one I tried is the usual one posted on the forums:

    #Limit UDP opens from all users to 4 per second
    iptables -A FORWARD -p UDP -s 192.168.1.0/24 -m limit --limit 4/s -j ACCEPT
    iptables -A FORWARD -p UDP -s 192.168.1.0/24 -j DROP

    Which doesn't seem to limit storms, with either -A or -I. (I was told the second line is necessary, but it's not usually posted).
  8. i1135t

    i1135t Network Guru Member

    Are you trying to limit inbound or outbound UDP connections?
  9. Toastman

    Toastman Super Moderator Staff Member Member

    Outbound mostly is the problem.
  10. Planiwa

    Planiwa LI Guru Member

    But we should keep in mind that while "ultimately" all connections originate with a LAN host, "technically" the tracked connections may be:

    [1] LAN -> WAN
    [2] LAN -> Router-LANIF (DNS Query)
    [3] Router-WANIF -> WAN (DNS Relay)
    [4] WAN -> Router-WANIP (Port-forwarded external source (P2P-callback)
  11. ntest7

    ntest7 Network Guru Member

    I wonder if maybe there's a more appropriate place for this rule than the FORWARD chain. Maybe the INPUT chain? Or maybe both?

    And after further review, I don't think the second line in your above rules is necessary. Looks as if it's only needed in chains that default to ACCEPT.
  12. mstombs

    mstombs Network Guru Member

    I think INPUT could be a good call. I wonder if the router crash can be caused by dnsmasq trying to service all the DNS requests and simply using up all the free ram? Did the problem get worse after Tomato 1.22? There were some pretty hasty patches to dnsmasq to fix the great dns potential vulnerability - later versions use source port randomization which probably requires more processing to support?
  13. Toastman

    Toastman Super Moderator Staff Member Member

    I can't be certain but I think it's been like this for a long time. Only back then, we often blamed the crashes and reboots onto the wireless driver. Now that has been largely fixed with the ND drivers, it shows up more obviously.

    Certainly free RAM disappears so quickly during a connection storm you often don't even get any log messages. Decreasing the maximum no. of connections in Conntrack assists, increasing it makes the problem worse.

    The existing conntrack rules seem to have been designed assuming all applications to be well behaved. But they aren't...

    Thanks ntest7 ....
  14. i1135t

    i1135t Network Guru Member

    Now, I'm not sure if this would work, but it may be worth a try:

    Code:
    iptables -t nat -I PREROUTING -i br0 -p udp -m state --state NEW -m recent --set --name UDP
    iptables -t nat -I PREROUTING -i br0 -p udp -m state --state NEW -m recent --update --seconds 1 --hitcount 100 --rttl --name UDP -j DROP
    iptables -I INPUT -p udp -m state --state NEW -m recent --set --name UDP
    iptables -I INPUT -p udp -m state --state NEW -m recent --update --seconds 1 --hitcount 100 --rttl --name UDP -j DROP
    iptables -t nat -I POSTROUTING -o br0 -p udp -m state --state NEW -m recent --set --name UDP
    iptables -t nat -I POSTROUTING -o br0 -p udp -m state --state NEW -m recent --update --seconds 1 --hitcount 100 --rttl --name UDP -j DROP
    iptables -I FORWARD -p udp -m state --state NEW -m recent --set --name UDP
    iptables -I FORWARD -p udp -m state --state NEW -m recent --update --seconds 1 --hitcount 100 --rttl --name UDP -j DROP
    It should start blocking for one second after 100 UDP hits, regardless of what chain the packets are travelling... I think? :)
  15. RonWessels

    RonWessels LI Guru Member

    As I understand the iptables traversal rules, your entries in the nat table/PREROUTING chain and nat table/POSTROUTING chain are redundant and inappropriate for the purpose of the nat table. This sort of, well, "filtering" should be done in the (surprise) "filtering" table.

    Your entries in the filtering table/INPUT chain will take care of packets destined for the router itself, while the entries in the filtering table/FORWARD chain will take care of packets to be forwarded.

    Also, if I understand the functionality correctly, the same filtering can be done using the "recent" module (-m recent) or the "limit" module (-m limit).
  16. i1135t

    i1135t Network Guru Member

    That's true, but I never stated that I was an expert with iptables... The tables provided were modified from existing code that I already use and are known to work against brute force attacks for open ports. Planiwa suggested that there were many instances where these storms could form, so I did my best to provide a "fix" for the problem. This is a first attempt, so there may be some redundant data, but I appreciate someone, who has more knowledge in this area, as yourself, to point it out. :)
  17. Toastman

    Toastman Super Moderator Staff Member Member

    I appreciate any help - being completely useless with scripts. Anyway, the following suggested by Ron was installed on one of my routers, but a connection storm happened this afternoon opening thousands of connections again. Assuming the syntax is right, can anyone comment on why it may not be working, or perhaps it just does not prevent these UDP storms? Planiwa?

    iptables -I INPUT -i br0 -m state --state NEW -m recent --set
    iptables -I INPUT -i br0 -m state --state NEW -m recent --update --seconds 1 --hitcount 10 -j DROP
  18. RonWessels

    RonWessels LI Guru Member

    Unfortunately, I'm not an expert with iptables either, so I'm not sure what's going wrong. Do you know if the connection storm was packets destined for the router or packets destined for the outside world? Those two lines will only stop packets destined for the router.

    By the way i1135t, I should apologize. On re-reading my previous post, it comes across as condescending, which is not what I intended.
  19. Toastman

    Toastman Super Moderator Staff Member Member

    Ron, I'm not sure exactly - the log is full of conntrack full - dropped "n" packets. But quite probably not only from the client to the router, but also DNS lookups to outside etc. too, as described above by Planiwa. However, I imagine that if you can stop the client's packets reaching the router, then that would take care of the others too.

    Anyway, I'm leaving it running, to see if it happens again. Unfortunately I'm not too good at reading iptables rules to see if they have taken hold. Maybe you can see if this rule does anything your end?
  20. i1135t

    i1135t Network Guru Member

    Ron, no problem. I tend to get a little defensive sometimes and jump to conclusions. We all make mistakes, no biggie.

    Toastman, I don't know what to say. According to what I know about iptables, it should work. The code you posted should block all inbound connections to the router that exceed 10 hits, including TCP, so I'm surprised you didn't get locked out of your router, haha. I tried some of the iptables suggested in this thread and tested it by flooding my router with UDP packets to port 53 and the iptables don't do anything. I see in the conntrack connections for UDP "assured" jump from 100 to 500+ in a matter of seconds, so none of the iptable entries work.

    Are you entering in these iptables through SSH or through the Administration --> Scripts --> Firewall entry? If you put it from there, you will need to do it backwards as when it's read by the router, the second rule in your entry is put on top of the first rule due to -I option. If you SSH'd into your router and did "iptables -vL INPUT", you will see what I mean. Either way, that doesn't fix it either, but I thought I should point that out. Maybe someone who is more knowledgeable about the firmware itself can chime in? It does baffle me though as to why the rules aren't doing what they should be doing...
  21. ntest7

    ntest7 Network Guru Member

    I believe the problem is that the UDP packets are accepted by other rules, so adding limit/recent rules doesn't have any affect.

    Try this (from ssh/telnet if you want to try as a temporary test)

    iptables -I INPUT -p udp -j DROP
    iptables -I INPUT -p udp -m limit --limit 10/s --limit-burst 20 -j ACCEPT

    adjust the 10/s and the 20 burst as desired...
  22. RonWessels

    RonWessels LI Guru Member

    Actually, the "recent" module keeps track of events on a per-IP basis, so unless he was attempting to get to the router from the machine creating the packet storm, it would have no effect.

    In fact, thinking about it, this would be the difference between implementing it via "-m recent" vs "-m limit" - using the limit module would lump all attempts together, so it would have blocked router access after a connection storm.

    I'm still stumped why it didn't work. I'll play around when I get home and see if I can spot something.
  23. Toastman

    Toastman Super Moderator Staff Member Member

    What's the difference between the limit and "burst" ? Why are both used/what do they do?

    Thanks!
  24. vyrticl

    vyrticl Networkin' Nut Member

    I think Ron is on the right path.

    However reading over the iptables man page and looking over some examples shouldn't the statements be swapped to:

    Code:
    iptables -I INPUT -i br0 -m state --state NEW -m recent --update --seconds 1 --hitcount 10 -j DROP
    iptables -I INPUT -i br0 -m state --state NEW -m recent --set -j ACCEPT
    
    If the '--set' statement is called first won't it just keep updating the timestamp on the existing entry so it will never be able to tell if it's getting the packets too fast?

    Maybe even adding '-p udp' to it as well just so you're only blocking UDP packets since they seem to be the major causes of of these connection storms.

    I've never used the recent module in iptables though so this is just an observation/thought.
  25. mstombs

    mstombs Network Guru Member

    If you enter the "-I" commands in that order the second one gets inserted above the first!

    By the way their appears to be a limit of 20 for "hitcount", ipt_recent has been rewritten for late 2.6 kernels and now warns if this limit is hit - Tomato will probably just fail to insert the rule if greater than 20. There is also a bug in ipt_recent re the use of "jiffies" which doesn't deal with wrap-around properly - not sure if this is significant.

    I use "iptables -nvL" and "iptables -nvL -t nat" to view the effective rules and watch the counts.
  26. ntest7

    ntest7 Network Guru Member

    The limit specifies an average over the time period specified. The limit-burst parameter specifies an initial burst that's allowed. Here's a pretty good explanation:
    http://thelowedown.wordpress.com/2008/07/03/iptables-how-to-use-the-limits-module/

    It's likely my example limits of 10/s 20/burst are a bit low for real use, but I'm not yet sure what numbers would work better. This is just sort of a first try. With a 30 second UDP timeout, 10/s would allow a maximum of 300 unreplied connections, which seems kind of low.

    As for whether "limit" or "recent" is the right module to use, here's some differences:
    limit - averages traffic so the limit is not exceeded.
    recent - client is disabled until it stops talking for the specified time unit; the amount of time blocked may greatly exceed the time unit set if the client doesn't stop talking.
    --
    limit - global
    recent - per-client
    --
    limit - not much processing or memory used.
    recent - more processing and memory used.

    While the limit and recent modules are somewhat similar, I think the limit module is probably more appropriate for this purpose - it's kind of a "leaky bucket" approach that slows down connections that exceed the limit, but still allows traffic. But that's really more of a personal preference than a decree. It's unfortunate that it appears we can't have it both ways - per-client but not completely blocked.
  27. i1135t

    i1135t Network Guru Member

    Ntest, that worked, but it seems to severely limit my UDP connections. I have a VoIP phone at home and it drops the line completely. I will have to play with it a little more when I have time, but appears you got it working somewhat. I will have to see how those tables will effect the overall performance of my entire LAN since I added them to the FORWARD chain as well. Anyways, thanks all.
  28. ntest7

    ntest7 Network Guru Member

    I expect for "real" use you will need to significantly increase the limits. The 10/s 20/burst I showed is just for testing. Now that we know it works, it's just a matter of tweaking the values so that legit traffic isn't affected.

    I don't think you need them in both chains, probably just the INPUT chain is all that's needed.
  29. i1135t

    i1135t Network Guru Member

    Actually I may need it for the FORWARD chain as well, since outbound UDP connections not ending at the router will traverse through that chain. I will do some further testing to confirm...
  30. Toastman

    Toastman Super Moderator Staff Member Member

    Sorry guys, been busy lately and forgot to post back. ntest7 - that script works beautifully.

    I used scripts in both chains, I think it gives better control.

    i1135t, what were your findings?

    I used 10 and 20 and those settings seem fine for general use here.


    EDIT:

    Since using these scripts with the other scripts below also enabled, I have found on a few occasions that the incoming INBOUND LIMIT does not work correctly, it is limiting incoming traffic (only roughly) at about double the speed it is set to. Something is being broken. At the moment I cannot figure this out. It may not be connected but ... I feel probably the order of the lines in these scripts is wrong.


    So I would welcome any comment about faults or duplications, unnecessary lines and the order in the following. You can see what I want to do, but I'm not there yet.


    # Access modem at 192.168.0.1 on subnet on vlan1
    iptables -I POSTROUTING -t nat -o vlan1 -d 192.168.0.0/24 -j MASQUERADE

    #Limit UDP opens from all users - UDP to Router
    iptables -I INPUT -p udp -j DROP
    iptables -I INPUT -p udp -m limit --limit 10/s --limit-burst 20 -j ACCEPT

    #Limit UDP opens from all users - UDP out to WAN
    iptables -I FORWARD -p udp -j DROP
    iptables -I FORWARD -p udp -m limit --limit 10/s --limit-burst 20 -j ACCEPT

    #Limit UDP and other connections per user
    iptables -I FORWARD -m iprange --src-range 192.168.1.10-192.168.1.250 -p ! tcp -m connlimit --connlimit-above 50 -j DROP

    #Limit TCP connections per user
    iptables -I FORWARD -p tcp --syn -m iprange --src-range 192.168.1.10-192.168.1.250 -m connlimit --connlimit-above 250 -j DROP

    #Limit outgoing SMTP simultaneous connections
    iptables -I FORWARD -p tcp --dport 25 -m connlimit --connlimit-above 10 -j DROP



    Thanks!
  31. ntest7

    ntest7 Network Guru Member

    I think the --connlimit lines are interfering with the --limit lines.
    Try putting the three --connlimit lines first.
  32. Toastman

    Toastman Super Moderator Staff Member Member

    Will try it, thanks!

Share This Page