Summary: | ASTERISK-22851: Asterisk/SIP+RTP stops responding when compiled with DEBUG_THREADS | ||||||||||||||
Reporter: | Jeremy Kister (jkister) | Labels: | |||||||||||||
Date Opened: | 2013-11-12 23:15:09.000-0600 | Date Closed: | 2015-06-13 09:58:58 | ||||||||||||
Priority: | Critical | Regression? | Yes | ||||||||||||
Status: | Closed/Complete | Components: | Channels/chan_sip/General Resources/res_rtp_asterisk | ||||||||||||
Versions: | 11.6.0 11.7.0 | Frequency of Occurrence | Frequent | ||||||||||||
Related Issues: |
| ||||||||||||||
Environment: | Debian6 | Attachments: | ( 0) asterisk_cli.txt ( 1) backtrace1.txt ( 2) backtrace1-bt-full.txt ( 3) backtrace-2.txt ( 4) backtrace-2-full.txt ( 5) backtrace-3.txt ( 6) backtrace-3-full.txt ( 7) backtrace-4.txt ( 8) backtrace-4-full.txt ( 9) core-show-locks-2.txt (10) core-show-locks-3.txt (11) core-show-locks-4.txt (12) core_show_locks_gdb_bt_strace.txt (13) debugs.rar (14) iax.conf.txt (15) lsof.txt (16) qualify.txt (17) qualify-tcpdump.txt (18) rb2826_v11.patch (19) sip.conf (20) sip-show-peer.txt (21) strace-and-then-some.txt (22) valgrind.txt | ||||||||||||
Description: | I have regularly (once a week, once per few hundred calls?) been having
problems with Asterisk's SIP stack not responding to packets from any of my registered devices. In the past, I could not tolerate the outage, so i would restart asterisk to make things happy. My Asterisk server is currently in this broken state and I can leave it this way for a short while. Devices are registered to it and I can 'sip qualify peer xxx' - see [^qualify.txt]. notice the console/debug does not show the return packet, but on the network, the device has clearly replied - [^qualify-tcpdump.txt] 'sip show peer xxx' all show Status OK [^sip-show-peer.txt] but whenever one of the devices tries to make a new call, Asterisk just doesnt respond. 'sip set debug on' shows no packets. from the asterisk server (10.1.0.3), i can see one of my phones (10.1.0.111) trying to make a call but Asterisk won't talk back. {code} # tcpdump -i eth0 -s 0 -t -n host 10.1.0.111 ARP, Request who-has 10.1.0.3 tell 10.1.0.111, length 46 ARP, Reply 10.1.0.3 is-at 00:0c:29:07:39:8e, length 28 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.123 > 10.1.0.3.123: NTPv3, Client, length 48 IP 10.1.0.3.123 > 10.1.0.111.123: NTPv3, Server, length 48 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.123 > 10.1.0.3.123: NTPv3, Client, length 48 IP 10.1.0.3.123 > 10.1.0.111.123: NTPv3, Server, length 48 ARP, Request who-has 10.1.0.111 tell 10.1.0.3, length 28 ARP, Reply 10.1.0.111 is-at 00:13:c4:01:da:4a, length 46 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 IP 10.1.0.111.5060 > 10.1.0.3.5060: SIP, length: 926 {code} | ||||||||||||||
Comments: | By: Jeremy Kister (jkister) 2013-11-12 23:25:06.349-0600 here's a [^lsof.txt] of asterisk during the brokenness. By: Jeremy Kister (jkister) 2013-11-12 23:49:44.840-0600 I ran 'strace -f -p 22462' (22462 is pid of asterisk). My screen flooded with [^strace-and-then-some.txt] (ran out of buffer space) and a ton of my phones started ringing, all at once. I quickly hopped into asterisk -r to see what was going on. Everything was coming back online. not sure if it was a coincidence, or if strace -f somehow jiggled something. confused. By: Jeremy Kister (jkister) 2013-11-12 23:50:13.961-0600 and now my Asterisk is all happy again. I did not reload, restart, or change anything besides what I've shown in this ticket. By: Michael L. Young (elguero) 2013-11-13 09:02:03.459-0600 Jeremy, Have you checked your NAT settings? Your Asterisk server (10.1.0.3) thinks it needs to treat 10.1.0.111 as being behind NAT, which is what I saw in [^strace-and-then-some.txt]. [^sip-show-peer.txt] says that Force Rport and Symmetric RTP are turned on for that peer. That was the first thing that stood out to me. Hope that helps. By: Jeremy Kister (jkister) 2013-11-13 11:52:19.335-0600 hi michael, i indeed did 'upgrade' from the old "nat=yes" in sip.conf's general to "nat=force_rport,comedia" during my upgrade of 11.5 to 11.6 (before i noticed this breakage). with "nat=yes" depreciated, is that not the right thing to do? i am under the impression that we are /not/ supposed to have different nat settings for different devices since that leaks usernames. according to the samples/sip.conf: {code} ; IT IS IMPORTANT TO NOTE that if the nat setting in the general section differs from ; the nat setting in a peer definition, then the peer username will be discoverable ; by outside parties as Asterisk will respond to different ports for defined and ; undefined peers. For this reason it is recommended to ONLY DEFINE NAT SETTINGS IN THE ; GENERAL SECTION. Specifically, if nat=force_rport in one section and nat=no in the ; other, then valid peers with settings differing from those in the general section will ; be discoverable. {code} am i understanding this incorrectly ? By: Michael L. Young (elguero) 2013-11-13 12:25:53.576-0600 Jeremy, Do you have localnet and externaddr set? Those are part of the NAT settings I was referring to. Asterisk thinks that your phones on the local network are behind NAT though when they are not. You are correct that this is the recommended setting for 'nat='. By: Jeremy Kister (jkister) 2013-11-13 12:31:51.664-0600 here's my general section of [^sip.conf]. localnet and externhost are set. By: Michael L. Young (elguero) 2013-11-13 12:40:30.869-0600 Well, I am out of ideas right now... By: Walter Doekes (wdoekes) 2013-11-14 03:01:06.743-0600 Jeremy, go get that full backtrace. You can save a core dump while it's still running, if needed. (g gdb generate-core-file attach) By: Jeremy Kister (jkister) 2013-11-18 16:54:04.412-0600 [^backtrace1.txt] is the result of running {noformat} gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt" --batch -c core.pbx1.example.net-2013-11-18T17\:41\:40-0500 {noformat} on a core file made when i ran 'kill -ABRT <astpid>'. Asterisk stopped processing RTP and SIP while I was on a call. I still have the original core file if any different gdb info is needed. By: Matt Jordan (mjordan) 2013-11-18 17:02:22.740-0600 As Walter said, you actually don't have to kill Asterisk. You can attach {{gdb}} to a running instance of Asterisk and simply tell it to dump a backtrace of the running threads. Since you're running with DEBUG_THREADS enabled, can you do a 'core show locks'? By: Jeremy Kister (jkister) 2013-11-18 17:06:18.361-0600 I'll remember to run the 'core show locks' the next time I run into the problem. By: Rusty Newton (rnewton) 2013-12-09 18:18:24.226-0600 Jeremy, the others likely still need a backtrace captured while running, and the "core show locks" output. Has the problem occurred again? By: Walter Doekes (wdoekes) 2013-12-10 01:48:15.317-0600 Well. Running vs. sigabrt doesn't matter. For completeness sake: a "thread apply all bt *full*" would've been required (preferably split over two files, one with full, one without). But that's still a horrible quest. If there's any chance of getting a "core show locks", that'd be so much more pleasant. By: Jeremy Kister (jkister) 2013-12-10 17:42:44.310-0600 [^backtrace1-bt-full.txt] is the result of running {noformat} gdb -se "asterisk" -ex "thread apply all bt full" --batch -c core.pbx1.example.net-2013-11-18T17\:41\:40-0500 {noformat} By: Jeremy Kister (jkister) 2013-12-10 17:44:16.135-0600 the problem has not happened since my previous update (11/18/2013) but i have had to restart the asterisk server a few times in between, so i don't know if that kept it from happening or is irrelevant. the next time it happens, i will be sure to gather the 'core show locks' data as well as the appropriate backtraces. By: Jeremy Kister (jkister) 2013-12-26 14:14:53.003-0600 [^core-show-locks-2.txt] was made by CLI command {noformat} core show locks {noformat} [^backtrace-2.txt] was generated by {noformat} gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt" --batch -c corefile {noformat} [^backtrace-2-full.txt] was generated by {noformat} gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt full" --batch -c corefile {noformat} By: Jeremy Kister (jkister) 2013-12-29 17:05:54.159-0600 another set [^core-show-locks-3.txt] [^backtrace-3.txt] and [^backtrace-3-full.txt] By: Jeremy Kister (jkister) 2013-12-31 16:43:14.467-0600 until this is fixed up, is there a better way to kick asterisk than a little bash heart attack? {noformat} while [ 1 ] ; do if [ "`asterisk -rx 'core show locks' | grep \"Waiting for Lock #1.*MUTEX 9105 sip_pvt_lock\"`" ] ; then if [ $times -eq 1 ] ; then # asterisk is locked. echo "asterisk locked" | sendmail direct@example.com kill -9 `head -1 /var/run/asterisk/asterisk.pid` times=0 else echo "see lock. checking again in 5 sec." times=1 sleep 5 continue fi else times=0 fi sleep 30 done {noformat} By: Matt Jordan (mjordan) 2014-01-02 16:26:34.737-0600 The more I look at this, the more this looks like DEBUG_THREADS mucking up the system. This isn't the first instance this has occurred; there's a few other reports of DEBUG_THREADS randomly hanging up a system. What the backtraces/CLI reports show isn't a deadlock. In all of the backtraces you've attached, there isn't a clear deadlock. For example, in {{backtrace3}}, there are two threads waiting on a channel lock: however, that channel lock isn't in a circular waiting condition. The channel thread is simply out to lunch in the DEBUG_THREADS reentrancy lock. I'd back off of the DEBUG_THREADS. If you were always running it on this system, then it may have been the problem all along. If not, then it's certainly clouding up the problem. If you do have a thread that has gone out to lunch, you can still grab a backtrace using gdb without DEBUG_THREADS enabled and it should be useful. By: Jeremy Kister (jkister) 2014-01-02 18:07:00.293-0600 I've recompiled and reinstalled without DEBUG_THREADS. I suppose I'll keep this ticket and will report back. By: Richard Mudgett (rmudgett) 2014-01-03 11:09:26.468-0600 FYI: See patch for DEBUG_THREADS at https://reviewboard.asterisk.org/r/2826/ Also ASTERISK-22857 By: Rusty Newton (rnewton) 2014-01-21 09:14:16.789-0600 Jeremy, do you have any results from testing with the patch? By: Jeremy Kister (jkister) 2014-01-27 09:04:16.048-0600 Actually I've only been running without DEBUG_THREADS and have not had any problems since 01/02/14. I will recompile with the patch now. By: Jeremy Kister (jkister) 2014-01-27 09:14:41.780-0600 How do i get r2826 in patch format for asterisk 11.7 ? Too hard to reconcile rejects on raw diff. By: Richard Mudgett (rmudgett) 2014-01-27 12:11:16.911-0600 [^rb2826_v11.patch] - Asterisk v11 version of the reviewboard patch. The v1.8 reviewboard patch has suffered some bit rot as it doesn't apply cleanly to the v1.8 branch. I just applied the patch to v1.8 at the original revision (-r398421) and did an SVN up without a problem. The resulting updated v1.8 patch applies cleanly to v11. By: Jeremy Kister (jkister) 2014-01-27 19:45:11.222-0600 now running with DEBUG_THREADS + [^rb2826_v11.patch]. will follow up. By: Jeremy Kister (jkister) 2014-02-09 18:48:16.350-0600 Been running since 1/27 (about two weeks) without trouble. I cannot be positive that the patch has fixed it (can't prove a negative). Will follow up in another week. I can say that the patch has had no /adverse/ effect on my setup. By: Jeremy Kister (jkister) 2014-02-12 23:22:03.452-0600 The patch did not fix the problem. Asterisk has locked. see [^core-show-locks-4.txt] [^backtrace-4.txt] and [^backtrace-4-full.txt] By: Matteo (mpiazzatnetbug) 2014-06-27 04:51:21.817-0500 It's easy to me replicate issue ASTERISK-23928. is there any update patch to test about it? By: Rusty Newton (rnewton) 2014-06-27 11:20:15.993-0500 [~mpiazzatnetbug] If you can easily replicate the issue then please attach an archive file containing all of the files and configuration necessary to replicate the problem. You may also want to include data describing your hardware, operating system and other environmental factors. If you can provide configuration that allows us to replicate the issue we may be able to move forward quicker on a working patch. By: David Brillert (aragon) 2014-07-31 08:11:01.407-0500 I have to test this patch in order to debug a deadlock in Asterisk 11.10 that is heavily using iax2. Where is the version 11 patch on reviewboard? By: Richard Mudgett (rmudgett) 2014-07-31 11:08:49.354-0500 @David: If you're asking about this patch [^rb2826_v11.patch] for v11, then it is only available on *this* issue. The corresponding reviewboard patch is against v1.8 and has suffered some bit rot as I stated in an earlier comment on this issue when I attached the patch. Otherwise, there is no patch specifically for this issue. By: David Brillert (aragon) 2014-07-31 13:02:52.062-0500 @Richard I'm going to put rb2826_v11.patch into production and see what happens. What do you need from me if the patch goes pear shaped? By: David Brillert (aragon) 2014-07-31 15:34:30.882-0500 @Richard, the patch is installed onto Asterisk 11 SVN (July 30, 2014 build) and in production server. By: David Brillert (aragon) 2014-08-01 08:07:34.145-0500 I ran into a deadlock within 24 hours of applying the patch. All debug info attached as core show locks gdb bt strace.txt By: Ross Beer (rossbeer) 2014-08-01 08:12:58.272-0500 I believe this issue is also affecting Asterisk 1.8.26 to 1.8.29, we are experiencing the same issue. By: David Brillert (aragon) 2014-08-01 08:22:29.011-0500 More info from same deadlock iax2 show threads IAX2 Thread Information Idle Threads: Thread 5: state=0, update=2, actions=792, func='' Thread 4: state=0, update=2, actions=386, func='' Thread 2: state=0, update=2, actions=380, func='' Thread 7: state=0, update=2, actions=797, func='' Thread 3: state=0, update=2, actions=797, func='' Thread 1: state=0, update=2, actions=796, func='' Thread 10: state=0, update=1, actions=797, func='' Thread 9: state=0, update=1, actions=793, func='' Thread 6: state=0, update=1, actions=793, func='' Thread 8: state=0, update=1, actions=793, func='' netstat -anp Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp 128968 0 0.0.0.0:4569 0.0.0.0:* 6145/asterisk By: David Brillert (aragon) 2014-08-01 08:33:23.726-0500 @Matt I didn't have to do anything to reproduce. This deadlock report is from an idle development server. I have an IAXy registered and I have Hylafax with iaxmodem installed The last thing shown by the asterisk cli was a SIP brute force attacker and a reload in attachment asterisk cli.txt By: David Brillert (aragon) 2014-08-03 11:03:46.810-0500 I ran Asterisk in Valgrind until Asterisk CLI started spitting out stuff like: [2014-08-03 11:39:37] ERROR[30855]: chan_iax2.c:2556 peercnt_add: maxcallnumber limit of 2048 for 192.168.192.196 has been reached! [2014-08-03 11:40:27] ERROR[30850]: chan_iax2.c:2556 peercnt_add: maxcallnumber limit of 2048 for 192.168.192.196 has been reached! core show locks ======================================================================= === 11.11.0 === Currently Held Locks ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 0x383cb940 (iax2_process_thread started at [12609] chan_iax2.c start_network_thread()) === ---> Waiting for Lock #0 (chan_iax2.c): MUTEX 3626 __attempt_transmit &iaxsl[callno] 0x1c6b3518 (1) main/logger.c:1610 ast_bt_get_addresses() (0x510e47+1A) main/lock.c:275 __ast_pthread_mutex_lock() (0x50898c+D8) channels/chan_iax2.c:1731 send_lagrq() main/sched.c:614 ast_sched_runq() (0x587d08+163) main/sched.c:104 sched_run() main/utils.c:1169 dummy_start() libpthread.so.0 <unknown>() libc.so.6 clone() (0x3998ad5200+6D) === --- ---> Locked Here: chan_iax2.c line 10305 (socket_process_helper) === --- ---> Locked Here: chan_iax2.c line 2087 (pvt_destructor) === ------------------------------------------------------------------- === === Thread ID: 0x3c879940 (iax2_process_thread started at [12609] chan_iax2.c start_network_thread()) === ---> Lock #0 (chan_iax2.c): MUTEX 10305 socket_process_helper &iaxsl[fr->callno] 0x1c6b3518 (2) main/logger.c:1610 ast_bt_get_addresses() (0x510e47+1A) main/lock.c:275 __ast_pthread_mutex_lock() (0x50898c+D8) channels/chan_iax2.c:2089 pvt_destructor() main/astobj2.c:474 internal_ao2_ref() main/astobj2.c:543 __ao2_ref() (0x44ebb0+2B) channels/chan_iax2.c:3580 iax2_destroy() channels/chan_iax2.c:11229 socket_process_helper() channels/chan_iax2.c:12005 socket_process() channels/chan_iax2.c:12117 iax2_process_thread() main/utils.c:1169 dummy_start() libpthread.so.0 <unknown>() libc.so.6 clone() (0x3998ad5200+6D) === ------------------------------------------------------------------- === ======================================================================= *CLI> [2014-08-03 11:41:16] ERROR[30849]: chan_iax2.c:2556 peercnt_add: maxcallnumber limit of 2048 for 192.168.192.196 has been reached! [2014-08-03 11:42:05] ERROR[30851]: chan_iax2.c:2556 peercnt_add: maxcallnumber limit of 2048 for 192.168.192.196 has been reached! valgrind.txt attached. I am running another trace with verbose+debug+iax2 set debug on and will supply gdb and core show locks output once * locks again. All output is with the patch on v11 subversion. By: David Brillert (aragon) 2014-08-04 16:31:52.249-0500 Here is my iax.conf file to reproduce. To reproduce I use a Digium IAXy registered to my server and let the server soak idle (no calls processed) for approximately 24 hours until all ports are exhausted. By: David Brillert (aragon) 2014-10-23 08:52:42.942-0500 PING: This bug report looks dead and I still see deadlocks on sites using IAX2 I cannot debug any deadlocks until DEBUG_THREADS is fixed with iax2 module loaded. I have provided detailed debugs and steps to reproduce so I don't what else is needed on my part... By: David Brillert (aragon) 2014-12-16 09:38:36.360-0600 This issue appears to be fixed by commit ASTERISK-24614 patch = https://reviewboard.asterisk.org/r/4247/ But before we got Asterisk to compile we had to revert https://reviewboard.asterisk.org/r/2826/ due to conflict. I have been trying to reproduce the issue for 2 days now with r4247 and Asterisk is still stable. |