[Home]

Summary:ASTERISK-26203: res_fax: Deadlock when using FAXOPT(gateway)=yes with Local channels
Reporter:Etienne Lessard (hexanol)Labels:
Date Opened:2016-07-15 15:55:28Date Closed:2016-08-26 14:05:03
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_fax
Versions:13.7.2 13.9.1 13.10.0-rc1 Frequency of
Occurrence
Occasional
Related
Issues:
duplicatesASTERISK-22732 Deadlock potential in res_fax and CCSS with local channels.
is related toASTERISK-24822 Deadlock: Fax Gateway framehook creates locking inversion in T.38 query option with features bridging code
Environment:Debian 8 amd64Attachments:( 0) bt-threads.txt
( 1) core-show-locks.txt
( 2) graph-core-show-locks.svg
Description:Asterisk sometimes deadlocks when you set "FAXOPT(gateway)=yes" on a channel and this channel dials a Local channel. Example:

Given I have two lines, SIP/alice and SIP/bob
Given I have the following dialplan:
{noformat}
[foo]
exten = 31,1,NoOp()
same  =   n,Set(FAXOPT(gateway)=yes,10)
same  =   n,Dial(Local/32@foo)
same  =   n,Hangup()

exten = 32,1,NoOp()
same  =   n,Set(FAXOPT(gateway)=yes,10)
same  =   n,Dial(SIP/bob)
same  =   n,Hangup()
{noformat}
When Alice dials 31
Then a deadlock might occurs

This does not happens systematically, but depending on the system and the exact setup, I got between 1 occurrences every 500 tries to 1 occurrence at almost every try.

I'm attaching a backtrace and the output of core show locks. There's actually 3 independent groups of deadlocked threads in these files:
* thread 24074 is deadlocked with thread 24078
* thread 28226 is deadlocked with thread 28230
* thread 32143 is deadlocked with thread 32149

From the analysis I've done, the problem is that there's a locking inversion issue occuring in situation like this one:
{noformat}
Thread 63 (Thread 0x7fd2cbcf8700 (LWP 24074)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fd2ce2404d4 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fd2ce240336 in __GI___pthread_mutex_lock (mutex=0x7fd2b450cc80) at ../nptl/pthread_mutex_lock.c:114
#3  0x000000000053f3aa in __ast_pthread_mutex_lock (filename=0x6229cb "bridge.c", lineno=4005, func=0x624e10 <__PRETTY_FUNCTION__.18063> "ast_bridge_peer", mutex_name=0x622b1b "bridge", t=0x7fd2b450cc80) at lock.c:315
#4  0x000000000045e0ce in __ao2_lock (user_data=0x7fd2b450ccd8, lock_how=AO2_LOCK_REQ_MUTEX, file=0x6229cb "bridge.c", func=0x624e10 <__PRETTY_FUNCTION__.18063> "ast_bridge_peer", line=4005, var=0x622b1b "bridge") at astobj2.c:167
#5  0x000000000046937d in _ast_bridge_lock (bridge=0x7fd2b450ccd8, file=0x6229cb "bridge.c", function=0x624e10 <__PRETTY_FUNCTION__.18063> "ast_bridge_peer", line=4005, var=0x622b1b "bridge") at /root/ast-rebuild/asterisk-13.10.0-rc3/include/asterisk/bridge.h:377
#6  0x00000000004739ed in ast_bridge_peer (bridge=0x7fd2b450ccd8, chan=0x7fd2b44df1b8) at bridge.c:4005
#7  0x00000000004ccbbf in ast_channel_bridge_peer (chan=0x7fd2b44df1b8) at channel.c:10419
#8  0x00000000004f23e2 in ast_unreal_queryoption (ast=0x7fd2b4543bb8, option=10, data=0x7fd2cbcf4090, datalen=0x7fd2cbcf4094) at core_unreal.c:194
#9  0x00000000004c4395 in ast_channel_queryoption (chan=0x7fd2b4543bb8, option=10, data=0x7fd2cbcf4090, datalen=0x7fd2cbcf4094, block=0) at channel.c:7131
#10 0x00007fd2c9150e2b in ast_channel_get_t38_state (chan=0x7fd2b4543bb8) at /root/ast-rebuild/asterisk-13.10.0-rc3/include/asterisk/channel.h:2578
#11 0x00007fd2c9166f41 in fax_gateway_framehook (chan=0x7fd2b4543bb8, f=0x1444060, event=AST_FRAMEHOOK_EVENT_WRITE, data=0x7fd2b4187a48) at res_fax.c:3329
#12 0x0000000000528554 in framehook_list_push_event (framehooks=0x7fd2b4719f30, frame=0x1444060, event=AST_FRAMEHOOK_EVENT_WRITE) at framehook.c:118
#13 0x0000000000528e20 in ast_framehook_list_write_event (framehooks=0x7fd2b4719f30, frame=0x1444060) at framehook.c:317
#14 0x00000000004bba96 in ast_indicate_data (chan=0x7fd2b4543bb8, _condition=26, data=0x0, datalen=0) at channel.c:4685
#15 0x00000000004baedd in ast_indicate (chan=0x7fd2b4543bb8, condition=26) at channel.c:4288
#16 0x0000000000487f3a in bridge_channel_internal_join (bridge_channel=0x7fd2b415ec48) at bridge_channel.c:2713
#17 0x000000000046e018 in ast_bridge_join (bridge=0x7fd2b418bba8, chan=0x7fd2b4543bb8, swap=0x0, features=0x7fd2cbcf4610, tech_args=0x0, flags=(AST_BRIDGE_JOIN_PASS_REFERENCE | AST_BRIDGE_JOIN_INHIBIT_JOIN_COLP)) at bridge.c:1700
#18 0x0000000000511223 in ast_bridge_call_with_flags (chan=0x7fd2b4543bb8, peer=0x7fd2b44c5c78, config=0x7fd2cbcf4b50, flags=0) at features.c:659
#19 0x0000000000511309 in ast_bridge_call (chan=0x7fd2b4543bb8, peer=0x7fd2b44c5c78, config=0x7fd2cbcf4b50) at features.c:698
#20 0x00007fd29ce03d14 in dial_exec_full (chan=0x7fd2b4543bb8, data=0x7fd2cbcf5af0 "SIP/machine02", peerflags=0x7fd2cbcf5580, continue_exec=0x0) at app_dial.c:3089
#21 0x00007fd29ce040d7 in dial_exec (chan=0x7fd2b4543bb8, data=0x7fd2cbcf5af0 "SIP/machine02") at app_dial.c:3141
#22 0x000000000058f144 in pbx_exec (c=0x7fd2b4543bb8, app=0x11d2680, data=0x7fd2cbcf5af0 "SIP/machine02") at pbx_app.c:485
#23 0x000000000057bb8c in pbx_extension_helper (c=0x7fd2b4543bb8, con=0x0, context=0x7fd2b4544570 "xivo-extrafeatures", exten=0x7fd2b45445c0 "32", priority=3, label=0x0, callerid=0x7fd2b42a2280 "1101", action=E_SPAWN, found=0x7fd2cbcf7bcc, combined_find_spawn=1) at pbx.c:2834
#24 0x000000000057f2fb in ast_spawn_extension (c=0x7fd2b4543bb8, context=0x7fd2b4544570 "xivo-extrafeatures", exten=0x7fd2b45445c0 "32", priority=3, callerid=0x7fd2b42a2280 "1101", found=0x7fd2cbcf7bcc, combined_find_spawn=1) at pbx.c:4060
#25 0x000000000057fff7 in __ast_pbx_run (c=0x7fd2b4543bb8, args=0x0) at pbx.c:4235
#26 0x0000000000581824 in pbx_thread (data=0x7fd2b4543bb8) at pbx.c:4555
#27 0x000000000060c2dd in dummy_start (data=0x7fd2b45e58e0) at utils.c:1235
#28 0x00007fd2ce23e0a4 in start_thread (arg=0x7fd2cbcf8700) at pthread_create.c:309
#29 0x00007fd2cd2ed87d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

i.e.:
* bridge B1 is locked in bridge_channel_internal_join (#16)
* channel C1 is locked in ast_indicate_data (#14)
* channel C1 is unlocked and locked in fax_gateway_framehook (#11)
* channel C1 is locked (+1) in ast_channel_queryoption (#9)
* channel C2 is unlocked (-1) in ast_unreal_queryoption (#8) -- but it is in fact still locked at this stage
* bridge B2 (a different one than the one locked in #16 -- it's actually the bridge that the other local channel is part of) is locked in ast_bridge_peer (#6)

So you then have a situation where a bridge is locked while a channel lock is held, which from what I understand is not the correct locking order, even if that channel is not part of the bridge.

I've done my latest tests on asterisk 13.10.0-rc3. I've also seen it occured on 13.9.1 and 13.7.2.

This bug is somewhat related to ASTERISK-22732.
Comments:By: Asterisk Team (asteriskteam) 2016-07-15 15:55:29.510-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Etienne Lessard (hexanol) 2016-07-19 07:22:47.420-0500

I forgot to include a remark from mjordan (from IRC):

{quote}
Note in the bug report that holding a channel lock while calling ast_channel_get_t38_state is the problem. By holding that lock, you will end up holding the channel lock twice after ast_channel_query_option, which means the implementation of the queryoption callback in core_unreal can't safely unlock the channel in order to get the bridge peer.
{quote}

By: Richard Mudgett (rmudgett) 2016-08-17 17:25:09.482-0500

This issue really is the deadlock described by ASTERISK-22732.

By: Richard Mudgett (rmudgett) 2016-08-25 18:59:09.393-0500

Gerrit reviews:
https://gerrit.asterisk.org/#/c/3698/ through https://gerrit.asterisk.org/#/c/3723/