Summary: | ASTERISK-19611: SIP stack stops working (deadlock?) if a call to a snom phone is redirected by "302 Moved temporarily" to chan_local | ||||||
Reporter: | Volker Kettenbach (vsauer) | Labels: | |||||
Date Opened: | 2012-03-31 06:29:43 | Date Closed: | 2012-06-13 16:02:20 | ||||
Priority: | Major | Regression? | No | ||||
Status: | Closed/Complete | Components: | Applications/app_dial Channels/chan_local | ||||
Versions: | 1.8.10.0 1.8.11.0 10.3.0 10.3.1 10.4.0 | Frequency of Occurrence | Constant | ||||
Related Issues: |
| ||||||
Environment: | Debian Stable 2.6.32-5-686 Snom 370 FW 8.4.31 Snom 320 FW 8.4.31 Snom m9 9.5.14-a Asterisk 10.3.0 and all versions between 1.8.8.0 and 1.8.11.0 | Attachments: | ( 0) 2012-04-20_asterisk-10.3.0-coredump_crash_on_302.backtrace.txt ( 1) 2012-04-20_asterisk-10.3.0-sipdebug.txt ( 2) asterisk-10.3.0-deadlock_on_302.core_show_locks.gdb_backtrace.txt ( 3) asterisk-10.3.0-deadlock_on_302.res_timing_dahdi.txt ( 4) asterisk-10.3.0-deadlock_on_302.res_timing_dahdi.txt ( 5) asterisk-10.3.0-deadlock_on_302.res_timing_timerfd.txt ( 6) asterisk-10.3.0-deadlock_on_302.sip_debug.core_crash.txt ( 7) Asterisk-10.3.0-locked-by-302.pcap ( 8) Asterisk-10.3.1-immediate-crash-on-302.backtrace.txt ( 9) Asterisk-10.3.1-immediate-crash-on-302.CLI1.txt (10) Asterisk-10.3.1-immediate-crash-on-302.CLI2.txt (11) ASTERISK-19611.patch | ||||
Description: | This might be the same as ASTERISK-17802 If I set a call forwarding of one of the snom phones either to another phone or to an external phone number, asterisk stops processing (sip) calls and the call itself is either interupted after a few parts of a second (if res_timing_timerfd) or is processed to it's normal ending (in case of res_timing_dahdi). Here's what happens in CLI in case of res_timing_timerfd: (See full output of CLI, core show locks and dgb threads in attachments) {noformat} [....] [Mar 31 12:44:24] -- SIP/snomm9-2-00000008 answered Local/85@doLocalCallsPrivat-2130;2 [Mar 31 12:44:24] == Extension Changed 85[incoming] new state InUse for Notify User snomvs [Mar 31 12:44:24] == Extension Changed 85[incoming] new state InUse for Notify User snomhmk [Mar 31 12:44:24] -- Local/85@doLocalCallsPrivat-2130;1 answered SIP/sipgate-gmbh-e0-00000006 [Mar 31 12:44:24] == Spawn extension (macro-CallInt2Int, s, 5) exited non-zero on 'Local/85@doLocalCallsPrivat-2130;2' in macro 'CallInt2Int' [Mar 31 12:44:24] == Spawn extension (incoming, 85, 100) exited non-zero on 'Local/85@doLocalCallsPrivat-2130;2' [Mar 31 12:44:24] ERROR[7261]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 842 (audio_audiohook_write_list): Error obtaining mutex: Invalid argument [Mar 31 12:44:24] ERROR[7261]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked! [Mar 31 12:44:24] ERROR[7261]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): Error releasing mutex: Invalid argument [Message repeated ~ 100 times] [Asterisk stopped processing (sip) calls] {noformat} *CLI> core show locks {noformat} ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 0xaefffb70 (pbx_thread started at [ 5512] pbx.c ast_pbx_start()) === ---> Lock #0 (channel.c): MUTEX 4852 ast_write chan 0x875a3d8 (1) /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa] /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_trylock+0x77) [0x8129fc7] /opt/asterisk/sbin/asterisk(ast_write+0x37) [0x80bd017] /opt/asterisk/sbin/asterisk() [0x80c47a2] /opt/asterisk/sbin/asterisk(ast_channel_bridge+0x6d6) [0x80c54a6] /opt/asterisk/sbin/asterisk(ast_bridge_call+0x341) [0x80fa2e1] /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xc05d) [0xb67a105d] /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xda1d) [0xb67a2a1d] /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb] /opt/asterisk/sbin/asterisk() [0x816305b] /opt/asterisk/lib/asterisk/modules/app_macro.so(+0x248b) [0xb675c48b] /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb] /opt/asterisk/sbin/asterisk() [0x816305b] /opt/asterisk/sbin/asterisk() [0x81697ed] /opt/asterisk/sbin/asterisk() [0x816b1de] /opt/asterisk/sbin/asterisk() [0x81af8ce] /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb751c955] /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769058e] === ------------------------------------------------------------------- === ======================================================================= {noformat} In case of res_timing_dahdi: {noformat} [Mar 31 13:23:06] VERBOSE[18249] app_dial.c: [Mar 31 13:23:06] -- Called SIP/snomm9-2 [Mar 31 13:23:06] VERBOSE[18249] app_dial.c: [Mar 31 13:23:06] -- SIP/snomm9-2-00000002 is ringing [Mar 31 13:23:06] VERBOSE[18246] app_dial.c: [Mar 31 13:23:06] -- Local/85@doLocalCallsPrivat-ad88;1 is ringing [Mar 31 13:23:14] VERBOSE[18249] app_dial.c: [Mar 31 13:23:14] -- SIP/snomm9-2-00000002 answered Local/85@doLocalCallsPrivat-ad88;2 [Mar 31 13:23:14] VERBOSE[18246] app_dial.c: [Mar 31 13:23:14] -- Local/85@doLocalCallsPrivat-ad88;1 answered SIP/sipgate-gmbh-e0-00000000 [Mar 31 13:23:14] VERBOSE[18249] app_macro.c: [Mar 31 13:23:14] == Spawn extension (macro-CallInt2Int, s, 5) exited non-zero on 'Local/85@doLocalCallsPrivat- ad88;2' in macro 'CallInt2Int' [Mar 31 13:23:14] VERBOSE[18249] pbx.c: [Mar 31 13:23:14] == Spawn extension (incoming, 85, 100) exited non-zero on 'Local/85@doLocalCallsPrivat-ad88;2' [Mar 31 13:23:40] ERROR[18246]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 842 (audio_audiohook_write_list): Error obtaining mutex: Invalid argument [Mar 31 13:23:40] ERROR[18246]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked! [Mar 31 13:23:40] ERROR[18246]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): Error releasing mutex: Invalid argument [Mar 31 13:23:40] ERROR[18246]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 842 (audio_audiohook_write_list): Error obtaining mutex: Invalid argument [Mar 31 13:23:40] ERROR[18246]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked! [Mar 31 13:23:40] ERROR[18246]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): Error releasing mutex: Invalid argument [Mar 31 13:23:40] ERROR[18246]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 480 (ast_audiohook_update_status): Error obtaining mutex: Invalid argument [Message repeated thousands of times] [Asterisk stopped processing (sip) calls] {noformat} *CLI> core show locks {noformat} ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 0xaf3d9b70 (do_monitor started at [26703] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 26675 do_monitor &monlock 0xb68673c0 (1) /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa] /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_lock+0x7c) [0x812902c] /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0x66832) [0xb6816832] /opt/asterisk/sbin/asterisk() [0x81af8ce] /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb7525955] /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769958e] === ---> Waiting for Lock #1 (chan_sip.c): MUTEX 8175 sip_pvt_lock_full chan 0xa390928 (1) /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa] /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_lock+0x7c) [0x812902c] /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0xd0ca) [0xb67bd0ca] /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0x89149) [0xb6839149] /opt/asterisk/sbin/asterisk(ast_sched_runq+0x14d) [0x81952bd] /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0x6683f) [0xb681683f] /opt/asterisk/sbin/asterisk() [0x81af8ce] /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb7525955] /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769958e] === --- ---> Locked Here: channel.c line 2832 (ast_hangup) === ------------------------------------------------------------------- === === Thread ID: 0xaf0dab70 (pbx_thread started at [ 5512] pbx.c ast_pbx_start()) === ---> Lock #0 (channel.c): MUTEX 2832 ast_hangup chan 0xa390928 (1) /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa] /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_lock+0x7c) [0x812902c] /opt/asterisk/sbin/asterisk(ast_hangup+0x1de) [0x80bf66e] /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xbd1b) [0xb676dd1b] /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xda1d) [0xb676fa1d] /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb] /opt/asterisk/sbin/asterisk() [0x816305b] /opt/asterisk/lib/asterisk/modules/app_macro.so(+0x248b) [0xb672948b] /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb] /opt/asterisk/sbin/asterisk() [0x816305b] /opt/asterisk/sbin/asterisk() [0x81697ed] /opt/asterisk/sbin/asterisk() [0x816b1de] /opt/asterisk/sbin/asterisk() [0x81af8ce] /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb7525955] /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769958e] === ------------------------------------------------------------------- === ======================================================================= {noformat} | ||||||
Comments: | By: Volker Kettenbach (vsauer) 2012-03-31 07:01:19.223-0500 See asterisk-10.3.0-deadlock_on_302.core_show_locks.gdb_backtrace.txt for core show locks and gdb -ex "thread apply all bt" for the lock on "302 Moved...." By: Matt Jordan (mjordan) 2012-04-02 14:08:04.471-0500 A few things: 1. The 'crash' log doesn't contain any useful information. To debug a crash, a backtrace needs to be generated from the coredump file, with DONT_OPTIMIZE turned on. Instructions for doing this can be found here: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace 2. Your core show locks don't appear to have come from a version of Asterisk compiled with DONT_OPTIMIZE. Please reproduce with that. You may also want to enable BETTER_BACKTRACES. You only need to do that with a single timing source at this time. 3. Please attach a DEBUG log with sip set debug on that illustrates this problem. Thanks Matt By: Volker Kettenbach (vsauer) 2012-04-20 02:50:44.444-0500 This is the sip debug. I cannot attach "core show locks" since asterisk crashed after I enter this command By: Volker Kettenbach (vsauer) 2012-04-20 02:53:54.798-0500 The core file asterisk dumped is 100MB in size. I cannot upload it. You find it here: http://www.volker-sauer.de/asterisk/2012-04-20_asterisk-10.3.0-coredump_crash_on_302.core "core show locks" made asterisk crashed after it locked. By: Richard Mudgett (rmudgett) 2012-04-20 10:30:34.727-0500 Core files are only useful on the machine that created them. To debug a crash, a backtrace needs to be generated from the coredump file, with DONT_OPTIMIZE turned on. Instructions for doing this can be found here: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace By: Volker Kettenbach (vsauer) 2012-04-20 10:36:54.247-0500 Ah, sorry for this. Here's the backtrace file. By: David Woolley (davidw) 2012-04-23 05:49:00.326-0500 This crash could be due to a problem we found in an unsupported version (1.6.1.0). I think I may have tried reporting it to the developer list, but that is silently ignoring me. I could have reported it here, but can't find it. In that version, when formatting the backtrace, Asterisk reads the number of stack frames captured twice: once before formatting them into a temporary array; and a second time, when outputting that array. There are no locks on the list of stack frames, so, if it is being created at the time, Asterisk can try to read back more formatted frames than exist. I haven't checked the code in the version being used here. (Again I've had to abuse the pre-written response feature because the need to cancel feedback mode is just not obvious to reporters.) By: Volker Kettenbach (vsauer) 2012-04-24 09:30:20.997-0500 Is there a workaround? By: Matt Jordan (mjordan) 2012-04-24 12:02:08.994-0500 Let's treat this as two separate issues. 1) When attempting to generate backtrace information, a seg fault can occur. That's one problem. 2) The original problem: when a SNOM phone is redirected by a 302 Moved Temporarily to chan_local, chan_sip becomes unresponsive for a period of time. For the first issue, I'm going to go ahead and create a new issue and attach the required core dump backtrace (thanks for getting that). For the second issue, I think we have sufficient information. If you can, attaching a pcap of the callflow that leads up to the problem is also *very* useful, as that can help when reproducing the problem. For now, I'll move this out of Triage. By: Volker Kettenbach (vsauer) 2012-04-24 13:56:21.777-0500 Here's the pcap. 192.168.0.1 is Asterisk 192.168.0.10 snom-m9 (302 Destination) 192.168.0.14 snom 370 (Forwarding Device) By: Volker Kettenbach (vsauer) 2012-04-24 14:11:06.637-0500 I tried this once again with 10.3.1 and this version immediately crashed with this backtrace. I attached the CLI output of 2 trys as well. By: Volker Kettenbach (vsauer) 2012-05-04 02:42:25.263-0500 Tested 10.4.0. The same. By: Volker Kettenbach (vsauer) 2012-05-26 06:34:30.762-0500 Is there a chance we'll get this fixed soon? Otherweise I'd need to implement a workaround (with AstDB or so). By: Mark Michelson (mmichelson) 2012-06-05 18:08:18.995-0500 I am uploading ASTERISK-19611.patch. I created it against Asterisk 1.8.11.1 but I imagine it should apply cleanly to Asterisk 10 releases as well. The problem appears to be in func_volume.c Specifically, it destroys an audiohook's data without going through the proper process to synchronize with the channel thread. We now lock the audiohook, detach it from the channel, unlock, and then destroy the audiohook's data. Before the patch, I experienced this issue every time I tried to forward a call. With the patch applied, I see no audiohook error messages, valgrind does not complain, and core show locks produces no output. Please give this a try and report if it works for you. Thanks! By: Mark Michelson (mmichelson) 2012-06-13 15:55:05.935-0500 Since I am about 99% sure the patch I have provided fixes the issue, I am going to commit the patch. If there turns out to be an issue, feel free to re-open this issue. By: Volker Kettenbach (vsauer) 2012-06-15 09:10:44.959-0500 Yes, you are correct! It works great! Thank you so much! |