[Home]

Summary:ASTERISK-19611: SIP stack stops working (deadlock?) if a call to a snom phone is redirected by "302 Moved temporarily" to chan_local
Reporter:Volker Kettenbach (vsauer)Labels:
Date Opened:2012-03-31 06:29:43Date Closed:2012-06-13 16:02:20
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_dial Channels/chan_local
Versions:1.8.10.0 1.8.11.0 10.3.0 10.3.1 10.4.0 Frequency of
Occurrence
Constant
Related
Issues:
is the original version of this clone:ASTERISK-19775 Backgrace generation in Asterisk causes a seg fault
is a clone ofASTERISK-17802 SIP stack stops working if a Dial command if forwarded by a SIP physical phone
Environment:Debian Stable 2.6.32-5-686 Snom 370 FW 8.4.31 Snom 320 FW 8.4.31 Snom m9 9.5.14-a Asterisk 10.3.0 and all versions between 1.8.8.0 and 1.8.11.0Attachments:( 0) 2012-04-20_asterisk-10.3.0-coredump_crash_on_302.backtrace.txt
( 1) 2012-04-20_asterisk-10.3.0-sipdebug.txt
( 2) asterisk-10.3.0-deadlock_on_302.core_show_locks.gdb_backtrace.txt
( 3) asterisk-10.3.0-deadlock_on_302.res_timing_dahdi.txt
( 4) asterisk-10.3.0-deadlock_on_302.res_timing_dahdi.txt
( 5) asterisk-10.3.0-deadlock_on_302.res_timing_timerfd.txt
( 6) asterisk-10.3.0-deadlock_on_302.sip_debug.core_crash.txt
( 7) Asterisk-10.3.0-locked-by-302.pcap
( 8) Asterisk-10.3.1-immediate-crash-on-302.backtrace.txt
( 9) Asterisk-10.3.1-immediate-crash-on-302.CLI1.txt
(10) Asterisk-10.3.1-immediate-crash-on-302.CLI2.txt
(11) ASTERISK-19611.patch
Description:This might be the same as ASTERISK-17802

If I set a call forwarding of one of the snom phones either to another phone or to an external phone number, asterisk stops processing (sip) calls and the call itself is either interupted after a few parts of a second (if res_timing_timerfd) or is processed to it's normal ending (in case of res_timing_dahdi).

Here's what happens in CLI in case of res_timing_timerfd:

(See full output of CLI, core show locks and dgb  threads in attachments)

{noformat}
[....]
[Mar 31 12:44:24]     -- SIP/snomm9-2-00000008 answered Local/85@doLocalCallsPrivat-2130;2
[Mar 31 12:44:24]   == Extension Changed 85[incoming] new state InUse for Notify User snomvs
[Mar 31 12:44:24]   == Extension Changed 85[incoming] new state InUse for Notify User snomhmk
[Mar 31 12:44:24]     -- Local/85@doLocalCallsPrivat-2130;1 answered SIP/sipgate-gmbh-e0-00000006
[Mar 31 12:44:24]   == Spawn extension (macro-CallInt2Int, s, 5) exited non-zero on 'Local/85@doLocalCallsPrivat-2130;2' in macro 'CallInt2Int'
[Mar 31 12:44:24]   == Spawn extension (incoming, 85, 100) exited non-zero on 'Local/85@doLocalCallsPrivat-2130;2'
[Mar 31 12:44:24] ERROR[7261]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 842 (audio_audiohook_write_list): Error obtaining mutex: Invalid argument
[Mar 31 12:44:24] ERROR[7261]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked!
[Mar 31 12:44:24] ERROR[7261]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): Error releasing mutex: Invalid argument


[Message repeated ~ 100 times]


[Asterisk stopped processing (sip) calls]
{noformat}

*CLI> core show locks

{noformat}
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0xaefffb70 (pbx_thread           started at [ 5512] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 4852 ast_write chan 0x875a3d8 (1)
   /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa]
   /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_trylock+0x77) [0x8129fc7]
   /opt/asterisk/sbin/asterisk(ast_write+0x37) [0x80bd017]
   /opt/asterisk/sbin/asterisk() [0x80c47a2]
   /opt/asterisk/sbin/asterisk(ast_channel_bridge+0x6d6) [0x80c54a6]
   /opt/asterisk/sbin/asterisk(ast_bridge_call+0x341) [0x80fa2e1]
   /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xc05d) [0xb67a105d]
   /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xda1d) [0xb67a2a1d]
   /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb]
   /opt/asterisk/sbin/asterisk() [0x816305b]
   /opt/asterisk/lib/asterisk/modules/app_macro.so(+0x248b) [0xb675c48b]
   /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb]
   /opt/asterisk/sbin/asterisk() [0x816305b]
   /opt/asterisk/sbin/asterisk() [0x81697ed]
   /opt/asterisk/sbin/asterisk() [0x816b1de]
   /opt/asterisk/sbin/asterisk() [0x81af8ce]
   /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb751c955]
   /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769058e]
=== -------------------------------------------------------------------
===
=======================================================================
{noformat}



In case of res_timing_dahdi:

{noformat}
[Mar 31 13:23:06] VERBOSE[18249] app_dial.c: [Mar 31 13:23:06]     -- Called SIP/snomm9-2
[Mar 31 13:23:06] VERBOSE[18249] app_dial.c: [Mar 31 13:23:06]     -- SIP/snomm9-2-00000002 is ringing
[Mar 31 13:23:06] VERBOSE[18246] app_dial.c: [Mar 31 13:23:06]     -- Local/85@doLocalCallsPrivat-ad88;1 is ringing
[Mar 31 13:23:14] VERBOSE[18249] app_dial.c: [Mar 31 13:23:14]     -- SIP/snomm9-2-00000002 answered Local/85@doLocalCallsPrivat-ad88;2
[Mar 31 13:23:14] VERBOSE[18246] app_dial.c: [Mar 31 13:23:14]     -- Local/85@doLocalCallsPrivat-ad88;1 answered SIP/sipgate-gmbh-e0-00000000
[Mar 31 13:23:14] VERBOSE[18249] app_macro.c: [Mar 31 13:23:14]   == Spawn extension (macro-CallInt2Int, s, 5) exited non-zero on 'Local/85@doLocalCallsPrivat-
ad88;2' in macro 'CallInt2Int'
[Mar 31 13:23:14] VERBOSE[18249] pbx.c: [Mar 31 13:23:14]   == Spawn extension (incoming, 85, 100) exited non-zero on 'Local/85@doLocalCallsPrivat-ad88;2'
[Mar 31 13:23:40] ERROR[18246]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 842 (audio_audiohook_write_list): Error obtaining mutex: Invalid argument
[Mar 31 13:23:40] ERROR[18246]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked!
[Mar 31 13:23:40] ERROR[18246]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): Error releasing mutex: Invalid argument
[Mar 31 13:23:40] ERROR[18246]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 842 (audio_audiohook_write_list): Error obtaining mutex: Invalid argument
[Mar 31 13:23:40] ERROR[18246]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked!
[Mar 31 13:23:40] ERROR[18246]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 861 (audio_audiohook_write_list): Error releasing mutex: Invalid argument
[Mar 31 13:23:40] ERROR[18246]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 480 (ast_audiohook_update_status): Error obtaining mutex: Invalid argument


[Message repeated thousands of times]

[Asterisk stopped processing (sip) calls]
{noformat}

*CLI> core show locks

{noformat}
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0xaf3d9b70 (do_monitor           started at [26703] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 26675 do_monitor &monlock 0xb68673c0 (1)
   /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa]
   /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_lock+0x7c) [0x812902c]
   /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0x66832) [0xb6816832]
   /opt/asterisk/sbin/asterisk() [0x81af8ce]
   /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb7525955]
   /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769958e]
=== ---> Waiting for Lock #1 (chan_sip.c): MUTEX 8175 sip_pvt_lock_full chan 0xa390928 (1)
   /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa]
   /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_lock+0x7c) [0x812902c]
   /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0xd0ca) [0xb67bd0ca]
   /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0x89149) [0xb6839149]
   /opt/asterisk/sbin/asterisk(ast_sched_runq+0x14d) [0x81952bd]
   /opt/asterisk/lib/asterisk/modules/chan_sip.so(+0x6683f) [0xb681683f]
   /opt/asterisk/sbin/asterisk() [0x81af8ce]
   /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb7525955]
   /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769958e]
=== --- ---> Locked Here: channel.c line 2832 (ast_hangup)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xaf0dab70 (pbx_thread           started at [ 5512] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2832 ast_hangup chan 0xa390928 (1)
   /opt/asterisk/sbin/asterisk(ast_bt_get_addresses+0x1a) [0x812c8aa]
   /opt/asterisk/sbin/asterisk(__ast_pthread_mutex_lock+0x7c) [0x812902c]
   /opt/asterisk/sbin/asterisk(ast_hangup+0x1de) [0x80bf66e]
   /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xbd1b) [0xb676dd1b]
   /opt/asterisk/lib/asterisk/modules/app_dial.so(+0xda1d) [0xb676fa1d]
   /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb]
   /opt/asterisk/sbin/asterisk() [0x816305b]
   /opt/asterisk/lib/asterisk/modules/app_macro.so(+0x248b) [0xb672948b]
   /opt/asterisk/sbin/asterisk(pbx_exec+0x10b) [0x8154bdb]
   /opt/asterisk/sbin/asterisk() [0x816305b]
   /opt/asterisk/sbin/asterisk() [0x81697ed]
   /opt/asterisk/sbin/asterisk() [0x816b1de]
   /opt/asterisk/sbin/asterisk() [0x81af8ce]
   /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb7525955]
   /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb769958e]
=== -------------------------------------------------------------------
===
=======================================================================
{noformat}


Comments:By: Volker Kettenbach (vsauer) 2012-03-31 07:01:19.223-0500

See asterisk-10.3.0-deadlock_on_302.core_show_locks.gdb_backtrace.txt

for

core show locks

and

gdb -ex "thread apply all bt"

for the lock on "302 Moved...."

By: Matt Jordan (mjordan) 2012-04-02 14:08:04.471-0500

A few things:

1. The 'crash' log doesn't contain any useful information.  To debug a crash, a backtrace needs to be generated from the coredump file, with DONT_OPTIMIZE turned on.  Instructions for doing this can be found here: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

2. Your core show locks don't appear to have come from a version of Asterisk compiled with DONT_OPTIMIZE.  Please reproduce with that.  You may also want to enable BETTER_BACKTRACES.  You only need to do that with a single timing source at this time.

3. Please attach a DEBUG log with sip set debug on that illustrates this problem.

Thanks

Matt

By: Volker Kettenbach (vsauer) 2012-04-20 02:50:44.444-0500

This is the sip debug.

I cannot attach "core show locks" since asterisk crashed after I enter this command

By: Volker Kettenbach (vsauer) 2012-04-20 02:53:54.798-0500

The core file asterisk dumped is 100MB in size.
I cannot upload it.
You find it here:

http://www.volker-sauer.de/asterisk/2012-04-20_asterisk-10.3.0-coredump_crash_on_302.core

"core show locks" made asterisk crashed after it locked.

By: Richard Mudgett (rmudgett) 2012-04-20 10:30:34.727-0500

Core files are only useful on the machine that created them.  To debug a crash, a backtrace needs to be generated from the coredump file, with DONT_OPTIMIZE turned on. Instructions for doing this can be found here: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Volker Kettenbach (vsauer) 2012-04-20 10:36:54.247-0500

Ah, sorry for this.

Here's the backtrace file.

By: David Woolley (davidw) 2012-04-23 05:49:00.326-0500

This crash could be due to a problem we found in an unsupported version (1.6.1.0).  I think I may have tried reporting it to the developer list, but that is silently ignoring me.  I could have reported it here, but can't find it.

In that version, when formatting the backtrace, Asterisk reads the number of stack frames captured twice:  once before formatting them into a temporary array; and a second time, when outputting that array.  There are no locks on the list of stack frames, so, if it is being created at the time, Asterisk can try to read back more formatted frames than exist.

I haven't checked the code in the version being used here.

(Again I've had to abuse the pre-written response feature because the need to cancel feedback mode is just not obvious to reporters.)

By: Volker Kettenbach (vsauer) 2012-04-24 09:30:20.997-0500

Is there a workaround?

By: Matt Jordan (mjordan) 2012-04-24 12:02:08.994-0500

Let's treat this as two separate issues.

1) When attempting to generate backtrace information, a seg fault can occur.  That's one problem.

2) The original problem: when a SNOM phone is redirected by a 302 Moved Temporarily to chan_local, chan_sip becomes unresponsive for a period of time.

For the first issue, I'm going to go ahead and create a new issue and attach the required core dump backtrace (thanks for getting that).

For the second issue, I think we have sufficient information.  If you can, attaching a pcap of the callflow that leads up to the problem is also *very* useful, as that can help when reproducing the problem.  For now, I'll move this out of Triage.

By: Volker Kettenbach (vsauer) 2012-04-24 13:56:21.777-0500

Here's the pcap.

192.168.0.1 is Asterisk
192.168.0.10 snom-m9 (302 Destination)
192.168.0.14 snom 370 (Forwarding Device)

By: Volker Kettenbach (vsauer) 2012-04-24 14:11:06.637-0500

I tried this once again with 10.3.1 and this version immediately crashed with this backtrace.
I attached the CLI output of 2 trys as well.

By: Volker Kettenbach (vsauer) 2012-05-04 02:42:25.263-0500

Tested 10.4.0. The same.

By: Volker Kettenbach (vsauer) 2012-05-26 06:34:30.762-0500

Is there a chance we'll get this fixed soon?
Otherweise I'd need to implement a workaround (with AstDB or so).

By: Mark Michelson (mmichelson) 2012-06-05 18:08:18.995-0500

I am uploading ASTERISK-19611.patch. I created it against Asterisk 1.8.11.1 but I imagine it should apply cleanly to Asterisk 10 releases as well.

The problem appears to be in func_volume.c Specifically, it destroys an audiohook's data without going through the proper process to synchronize with the channel thread. We now lock the audiohook, detach it from the channel, unlock, and then destroy the audiohook's data.

Before the patch, I experienced this issue every time I tried to forward a call. With the patch applied, I see no audiohook error messages, valgrind does not complain, and core show locks produces no output. Please give this a try and report if it works for you. Thanks!

By: Mark Michelson (mmichelson) 2012-06-13 15:55:05.935-0500

Since I am about 99% sure the patch I have provided fixes the issue, I am going to commit the patch. If there turns out to be an issue, feel free to re-open this issue.

By: Volker Kettenbach (vsauer) 2012-06-15 09:10:44.959-0500

Yes, you are correct! It works great!  Thank you so much!