Summary: | ASTERISK-26203: res_fax: Deadlock when using FAXOPT(gateway)=yes with Local channels | ||||||
Reporter: | Etienne Lessard (hexanol) | Labels: | |||||
Date Opened: | 2016-07-15 15:55:28 | Date Closed: | 2016-08-26 14:05:03 | ||||
Priority: | Major | Regression? | |||||
Status: | Closed/Complete | Components: | Resources/res_fax | ||||
Versions: | 13.7.2 13.9.1 13.10.0-rc1 | Frequency of Occurrence | Occasional | ||||
Related Issues: |
| ||||||
Environment: | Debian 8 amd64 | Attachments: | ( 0) bt-threads.txt ( 1) core-show-locks.txt ( 2) graph-core-show-locks.svg | ||||
Description: | Asterisk sometimes deadlocks when you set "FAXOPT(gateway)=yes" on a channel and this channel dials a Local channel. Example:
Given I have two lines, SIP/alice and SIP/bob Given I have the following dialplan: {noformat} [foo] exten = 31,1,NoOp() same = n,Set(FAXOPT(gateway)=yes,10) same = n,Dial(Local/32@foo) same = n,Hangup() exten = 32,1,NoOp() same = n,Set(FAXOPT(gateway)=yes,10) same = n,Dial(SIP/bob) same = n,Hangup() {noformat} When Alice dials 31 Then a deadlock might occurs This does not happens systematically, but depending on the system and the exact setup, I got between 1 occurrences every 500 tries to 1 occurrence at almost every try. I'm attaching a backtrace and the output of core show locks. There's actually 3 independent groups of deadlocked threads in these files: * thread 24074 is deadlocked with thread 24078 * thread 28226 is deadlocked with thread 28230 * thread 32143 is deadlocked with thread 32149 From the analysis I've done, the problem is that there's a locking inversion issue occuring in situation like this one: {noformat} Thread 63 (Thread 0x7fd2cbcf8700 (LWP 24074)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007fd2ce2404d4 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fd2ce240336 in __GI___pthread_mutex_lock (mutex=0x7fd2b450cc80) at ../nptl/pthread_mutex_lock.c:114 #3 0x000000000053f3aa in __ast_pthread_mutex_lock (filename=0x6229cb "bridge.c", lineno=4005, func=0x624e10 <__PRETTY_FUNCTION__.18063> "ast_bridge_peer", mutex_name=0x622b1b "bridge", t=0x7fd2b450cc80) at lock.c:315 #4 0x000000000045e0ce in __ao2_lock (user_data=0x7fd2b450ccd8, lock_how=AO2_LOCK_REQ_MUTEX, file=0x6229cb "bridge.c", func=0x624e10 <__PRETTY_FUNCTION__.18063> "ast_bridge_peer", line=4005, var=0x622b1b "bridge") at astobj2.c:167 #5 0x000000000046937d in _ast_bridge_lock (bridge=0x7fd2b450ccd8, file=0x6229cb "bridge.c", function=0x624e10 <__PRETTY_FUNCTION__.18063> "ast_bridge_peer", line=4005, var=0x622b1b "bridge") at /root/ast-rebuild/asterisk-13.10.0-rc3/include/asterisk/bridge.h:377 #6 0x00000000004739ed in ast_bridge_peer (bridge=0x7fd2b450ccd8, chan=0x7fd2b44df1b8) at bridge.c:4005 #7 0x00000000004ccbbf in ast_channel_bridge_peer (chan=0x7fd2b44df1b8) at channel.c:10419 #8 0x00000000004f23e2 in ast_unreal_queryoption (ast=0x7fd2b4543bb8, option=10, data=0x7fd2cbcf4090, datalen=0x7fd2cbcf4094) at core_unreal.c:194 #9 0x00000000004c4395 in ast_channel_queryoption (chan=0x7fd2b4543bb8, option=10, data=0x7fd2cbcf4090, datalen=0x7fd2cbcf4094, block=0) at channel.c:7131 #10 0x00007fd2c9150e2b in ast_channel_get_t38_state (chan=0x7fd2b4543bb8) at /root/ast-rebuild/asterisk-13.10.0-rc3/include/asterisk/channel.h:2578 #11 0x00007fd2c9166f41 in fax_gateway_framehook (chan=0x7fd2b4543bb8, f=0x1444060, event=AST_FRAMEHOOK_EVENT_WRITE, data=0x7fd2b4187a48) at res_fax.c:3329 #12 0x0000000000528554 in framehook_list_push_event (framehooks=0x7fd2b4719f30, frame=0x1444060, event=AST_FRAMEHOOK_EVENT_WRITE) at framehook.c:118 #13 0x0000000000528e20 in ast_framehook_list_write_event (framehooks=0x7fd2b4719f30, frame=0x1444060) at framehook.c:317 #14 0x00000000004bba96 in ast_indicate_data (chan=0x7fd2b4543bb8, _condition=26, data=0x0, datalen=0) at channel.c:4685 #15 0x00000000004baedd in ast_indicate (chan=0x7fd2b4543bb8, condition=26) at channel.c:4288 #16 0x0000000000487f3a in bridge_channel_internal_join (bridge_channel=0x7fd2b415ec48) at bridge_channel.c:2713 #17 0x000000000046e018 in ast_bridge_join (bridge=0x7fd2b418bba8, chan=0x7fd2b4543bb8, swap=0x0, features=0x7fd2cbcf4610, tech_args=0x0, flags=(AST_BRIDGE_JOIN_PASS_REFERENCE | AST_BRIDGE_JOIN_INHIBIT_JOIN_COLP)) at bridge.c:1700 #18 0x0000000000511223 in ast_bridge_call_with_flags (chan=0x7fd2b4543bb8, peer=0x7fd2b44c5c78, config=0x7fd2cbcf4b50, flags=0) at features.c:659 #19 0x0000000000511309 in ast_bridge_call (chan=0x7fd2b4543bb8, peer=0x7fd2b44c5c78, config=0x7fd2cbcf4b50) at features.c:698 #20 0x00007fd29ce03d14 in dial_exec_full (chan=0x7fd2b4543bb8, data=0x7fd2cbcf5af0 "SIP/machine02", peerflags=0x7fd2cbcf5580, continue_exec=0x0) at app_dial.c:3089 #21 0x00007fd29ce040d7 in dial_exec (chan=0x7fd2b4543bb8, data=0x7fd2cbcf5af0 "SIP/machine02") at app_dial.c:3141 #22 0x000000000058f144 in pbx_exec (c=0x7fd2b4543bb8, app=0x11d2680, data=0x7fd2cbcf5af0 "SIP/machine02") at pbx_app.c:485 #23 0x000000000057bb8c in pbx_extension_helper (c=0x7fd2b4543bb8, con=0x0, context=0x7fd2b4544570 "xivo-extrafeatures", exten=0x7fd2b45445c0 "32", priority=3, label=0x0, callerid=0x7fd2b42a2280 "1101", action=E_SPAWN, found=0x7fd2cbcf7bcc, combined_find_spawn=1) at pbx.c:2834 #24 0x000000000057f2fb in ast_spawn_extension (c=0x7fd2b4543bb8, context=0x7fd2b4544570 "xivo-extrafeatures", exten=0x7fd2b45445c0 "32", priority=3, callerid=0x7fd2b42a2280 "1101", found=0x7fd2cbcf7bcc, combined_find_spawn=1) at pbx.c:4060 #25 0x000000000057fff7 in __ast_pbx_run (c=0x7fd2b4543bb8, args=0x0) at pbx.c:4235 #26 0x0000000000581824 in pbx_thread (data=0x7fd2b4543bb8) at pbx.c:4555 #27 0x000000000060c2dd in dummy_start (data=0x7fd2b45e58e0) at utils.c:1235 #28 0x00007fd2ce23e0a4 in start_thread (arg=0x7fd2cbcf8700) at pthread_create.c:309 #29 0x00007fd2cd2ed87d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {noformat} i.e.: * bridge B1 is locked in bridge_channel_internal_join (#16) * channel C1 is locked in ast_indicate_data (#14) * channel C1 is unlocked and locked in fax_gateway_framehook (#11) * channel C1 is locked (+1) in ast_channel_queryoption (#9) * channel C2 is unlocked (-1) in ast_unreal_queryoption (#8) -- but it is in fact still locked at this stage * bridge B2 (a different one than the one locked in #16 -- it's actually the bridge that the other local channel is part of) is locked in ast_bridge_peer (#6) So you then have a situation where a bridge is locked while a channel lock is held, which from what I understand is not the correct locking order, even if that channel is not part of the bridge. I've done my latest tests on asterisk 13.10.0-rc3. I've also seen it occured on 13.9.1 and 13.7.2. This bug is somewhat related to ASTERISK-22732. | ||||||
Comments: | By: Asterisk Team (asteriskteam) 2016-07-15 15:55:29.510-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Etienne Lessard (hexanol) 2016-07-19 07:22:47.420-0500 I forgot to include a remark from mjordan (from IRC): {quote} Note in the bug report that holding a channel lock while calling ast_channel_get_t38_state is the problem. By holding that lock, you will end up holding the channel lock twice after ast_channel_query_option, which means the implementation of the queryoption callback in core_unreal can't safely unlock the channel in order to get the bridge peer. {quote} By: Richard Mudgett (rmudgett) 2016-08-17 17:25:09.482-0500 This issue really is the deadlock described by ASTERISK-22732. By: Richard Mudgett (rmudgett) 2016-08-25 18:59:09.393-0500 Gerrit reviews: https://gerrit.asterisk.org/#/c/3698/ through https://gerrit.asterisk.org/#/c/3723/ |