[Home]

Summary:ASTERISK-27094: res_fax: Deadlock when using Local channels and fax gateway
Reporter:David Brillert (aragon)Labels:fax patch pjsip
Date Opened:2017-06-29 14:41:28Date Closed:2018-06-06 05:47:33
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_fax
Versions:13.15.1 13.17.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-27840 Deadlock bridge_channel.c line 2660 (bridge_channel_internal_join)
Environment:Attachments:( 0) bridge_lock.patch
( 1) bridge_lock2.patch
( 2) bridge_lock3.patch
( 3) core-asterisk-running-2018-05-16T10-28-36-0400.rar
( 4) coreshowlocks.txt
( 5) deadlock_2018_05_09.rar
( 6) gdbthreadapplyallbt.txt
( 7) gdbthreadapplyallbtfull1.txt
Description:All calling via SIP PSTN carrier.
progressinband = yes
directmedia =  yes
prematuremedia  =  no
Incoming call A is answered with progress
Then bridged with progress to external call B
Call is processed with audio OK
But no further SIP processing in console and all SIP further signalling dies including OPTIONS packets.
Comments:By: Asterisk Team (asteriskteam) 2017-06-29 14:41:28.915-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: David Brillert (aragon) 2017-06-29 14:42:42.513-0500

gdb trace

By: David Brillert (aragon) 2017-07-04 09:58:05.469-0500

I'm no expert on gdb traces but I think its locking somewhere in res_fax and I assume in fax_gateway_framehook

The setting which triggers the lock is in sip.conf
setvar          =  FAXOPT(gateway)=yes

Disabling the setting avoids the deadlock but disables all t38 gateway faxing over the SIP trunk.
setvar          =  FAXOPT(gateway)=no

Thread 6 (Thread 0x7fe132c32700 (LWP 2745)):
#0  0x00007fe1e4c26334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fe1e4c215f3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00007fe1e4c214d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000053b851 in __ast_pthread_mutex_lock ()
#4  0x000000000045d200 in __ao2_lock ()
#5  0x00000000004688aa in _ast_bridge_lock ()
#6  0x0000000000472cc8 in ast_bridge_peer ()
#7  0x00000000004ca87b in ast_channel_bridge_peer ()
#8  0x00000000004efd25 in ast_unreal_queryoption ()
#9  0x00000000004c2214 in ast_channel_queryoption ()
#10 0x00007fe1c1536cc0 in ast_channel_get_t38_state (chan=0x7fe1c400e5a0)
   at /dar/build/asterisk-13.15.1/include/asterisk/channel.h:2599
#11 0x00007fe1c154d92c in fax_gateway_framehook (chan=0x7fe16c0d0260, f=0x7fe16c062510,
   event=AST_FRAMEHOOK_EVENT_WRITE, data=0x7fe16c0906b0) at res_fax.c:3395
#12 0x0000000000525a6a in framehook_list_push_event ()


By: Richard Mudgett (rmudgett) 2017-07-06 10:53:03.966-0500

I can see a potential for deadlock with thread 6 *only if* the first ast_indicate() is being invoked in bridge_channel_internal_join().  That call to ast_indicate() happens with the bridge locked.  However, since the backtrace is missing a lot of information, I don't know if that is the line invoking ast_indicate().  I also cannot identify the deadlocked threads because the backtrace doesn't include parameter values to associate the deadlocked threads.  An unoptimized backtrace and/or DEBUG_THREADS for a "core show locks" is needed to see the deadlock scenario.

By: David Brillert (aragon) 2017-07-11 10:39:45.166-0500

The locking is pretty easy for me to reproduce but I can't get a new trace with a proper debug build until sometime next week.
I'll enter feedback as soon as I can.

By: David Brillert (aragon) 2017-07-18 16:26:57.648-0500

Attaching coreshowlocks.txt and gdbthreadapplyallbtfull(1).txt from a debug compiled 13.17 build

By: David Brillert (aragon) 2017-10-04 01:03:14.903-0500

Bump...
The deadlock occurs roughly 99% of the time, it's really service affecting

By: Joshua C. Colp (jcolp) 2017-10-04 01:21:53.185-0500

The issue has been triaged and accepted, if you would like it resolved faster you can attempt to elicit interest from a community member who may be interested in fixing it or finding and hiring a consultant to take a look.

By: Rusty Newton (rnewton) 2017-10-05 05:47:31.794-0500

There is always [bug bounties|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties]!

By: David Brillert (aragon) 2017-10-05 06:13:01.820-0500

LOL, I'll take both of those suggestions into account.
I presume since chan_sip is now in extended support I shouldn't hold my breath therefore I have this question:
Would this deadlock potential also be in pjsip?
If its not then I could assume a migration to pjsip would be an easier solution?


By: Richard Mudgett (rmudgett) 2017-10-05 06:52:43.708-0500

This deadlock is in the bridging core and is not channel driver specific.

The newer backtrace and core_show_locks files still don't have a lot of source file debug information.  The core_show_locks file does point out the two threads that are deadlocked.  The {{Thread ID: 0x7fceb6fe8700 LWP:16926}} and {{Thread ID: 0x7fce703cb700 LWP:16930}} threads are deadlocked.  There is a BETTER_BACKTRACES menuselect option that would actually make the core_show_locks backtraces very helpful and should absolutely confirm that it is the first ast_indicate() in the bridge_channel_internal_join() I mentioned back in July.  However, I'm quite certain that it is the first ast_indicate() that is causing the deadlock.


By: David Brillert (aragon) 2018-05-09 15:06:37.239-0500

OK I reproduced this on *13.21
Sorry for the delay but I lost access to the environment where I could reliably reproduce this problem and just got it back.
I have to forward the incoming call and I have to enable mixmonitor recording flag in order to reproduce this on 100% of calls.
Compiled with DONT_OPTIMIZE BETTER_BACKTRACES DEBUG_THREADS
Output of ast_coredumper and core show locks and Asterisk CLI is included in attached deadlock_2018_05_09.rar file


By: David Brillert (aragon) 2018-05-15 08:59:45.195-0500

@rmudgett Are the latest backtraces sufficient?

By: Richard Mudgett (rmudgett) 2018-05-15 21:12:08.097-0500

The backtraces in [^deadlock_2018_05_09.rar] don't add any new information as they don't really have any useful symbolic information.

By: David Brillert (aragon) 2018-05-16 09:23:04.049-0500

I don't understand why the backtraces are still missing symbols
Asterisk was compiled with:
DEBUG_THREADS, DONT_OPTIMIZE and BETTER_BACKTRACES
But not MALLOC_DEBUG

What flag could be missing?
I'll try recompiling and will reproduce. Which backtrace file should I be looking at and which text would tell me if my backtrace is useful before I upload it to the bug report?



By: David Brillert (aragon) 2018-05-16 09:40:13.586-0500

New backtrace after recompiling Asterisk with DEBUG_THREADS, DONT_OPTIMIZE and BETTER_BACKTRACES

core-asterisk-running-2018-05-16T10-28-36-0400.rar

The last backtrace should have worked, I'm hoping this one will do the trick.



By: Richard Mudgett (rmudgett) 2018-05-17 15:04:30.493-0500

The backtraces in the new file [^core-asterisk-running-2018-05-16T10-28-36-0400.rar] still don't have any useful symbolic information.  You can tell if you actually look at the core-asterisk-running-2018-05-16T10-28-36-0400-thread1.txt or core-asterisk-running-2018-05-16T10-28-36-0400-full.txt file as you will see "No symbol table info available." all over those files.  The output of the CLI "core show locks" command is also missing useful symbol information.  It is only giving public function names and hex offsets.  To be useful the output needs to give function names and line numbers.

It appears that you are *selectively* installing files with debugging information.  A few loadable modules like app_dial and chan_sip have debugging information but the asterisk executable does not have debugging information.  I stated nearly a year ago that this is *not* a channel driver (i.e., chan_sip) specific deadlock and that it is in the bridging core code.  The bridging core code is in the asterisk executable itself.

FYI: The {{.rar}} archive format is *very* inconvenient to deal with.  I not only have to explicitly download the file but I had to locate a utility that understood the format.  {{.zip}} would be much better as JIRA knows about that format and I don't have to download the file to look at it.


By: David Brillert (aragon) 2018-05-17 18:10:34.585-0500

Thanks Richard that is helpful. I will verify the symbols info before uploading another trace and send the build issue back to my developers to make sure Asterisk is getting compiled properly with the debug flags. Next round will be in .zip format.

By: nappsoft (nappsoft) 2018-05-24 06:19:44.313-0500

Just wanted to inform you that we have a similar problem on one of our systems (about 870 pjsip users, about 359 registered phones, sometimes around 30 concurrent calls). We use chan_pjsip and not chan_sip. Unfortunatelly I can only tell you that we are facing the problem but have no possibility to get more informations at the moment as this is a production system and we have a local sip-client that is automatically restarting asterisk when a deadlock is detected. (So I usually can only look at the system when we already got the alert from our error-recovery tool and when asterisk has already been restarted). => I've started to make sip-traces to confirm that it was not a false alert. It seems like asterisk would still send out "new" packets (outgoing Options for example) but not respond to any incoming traffic (like REGISTER or INVITE).

By: nappsoft (nappsoft) 2018-05-24 07:52:43.217-0500

@Rusty Newton: what amount do you think would be appropriate as bounty for such an unspecific bug?

By: Richard Mudgett (rmudgett) 2018-05-24 08:09:58.049-0500

[~nappsoft] Rusty is no longer with Digium.  Please see the bug bounties wiki page below.

https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties

By: nappsoft (nappsoft) 2018-05-24 08:28:58.716-0500

Well I've read that... However this issue is a bit difficult to reward with a bounty as: we have a deadlock that sounds similar to the issue described here. However there is no way to be sure that solving the issue described here will solve the issue we have. (Well... inband progress combined with MixMonitor and Local channels sounds suspicious however, as we use all of these things on this system as well...). I've spent at least two full weeks with debugging the PJSIP gethostbyname issue last autumn, so I know quite well how hard fixing one single issue can be. However debugging a deadlock, where one has "a concrete situation" that can be analyzed should be easyer, than finding the root cause of a crash that happens at a completely different place some seconds after the code has been executed that was causing the memory corruption... So I really have no idea what amount would be appropriate. 1500-2000$ maybe?



By: nappsoft (nappsoft) 2018-05-25 08:20:40.021-0500

Richard: refering to the the following backtrace: https://issues.asterisk.org/jira/secure/attachment/57446/thread-traces.1.txt

Just for clarification:

- isn't it the case that the channel locked on line 2660 of bridge_channel.c (unlocked on line 2722) is a different one than the one that is locked by res_fax in ast_bridge_peer (invoked on line 2718 by ast_indicate, so before the unlocking of this bridge happens)?

If so: isn't there "by design" a possibility that a lock of bridge b) can occur while bridge a) is still being locked by the same thread why on a different thread the same happens the other way round?

By: Richard Mudgett (rmudgett) 2018-05-25 17:27:13.658-0500

[~nappsoft] In Asterisk 13.21.0 bridge_channel.c, the bridge is locked on 2660 and unlocked on 2722.  Before unlocking the bridge the code calls ast_indicate() on 2718 which locks the channel.  Calling ast_indicate() there with the bridge locked allows the deadlock potential when dealing with local channels and the fax code.  The fax code is wanting to do a similar thing on the other end of the local channel.  I have mentioned seeing this deadlock potential in my earlier comments on this issue.  I was wanting debug confirmation that it really is this deadlock potential causing the issue and to better understand the deadlock scenario.  Locking in bridging code is easy to get wrong.

By: nappsoft (nappsoft) 2018-05-28 06:30:33.821-0500

Yes, you are right: you mentioned that. But I do not understand why more traces are needed? Somehow it seems to be the only possible root cause that makes sense (both involved locks are bridge locks) and on the other hand: even if the root cause would be a different one: it seems like ast_indicate inside a locked bridge has the potential to deadlock in some situations... so shouldn't this be fixed anyway? Or isn't it that easy (or even impossible) to invoke this kind of ast_indicate outside of the bridge lock? (I simply don't know as I do not know the concepts good enough). As David is able to reproduce the problem, he could simply try if the problem vanishes with a small patch? (Even with a "temproary trial and error debugging patch" instead of a "clean" one). So this might be a more pragmatic approach...

I am sorry, I really do not want to offend anyone. It's simply not always that easy to see the big picture as an "outsider". So I am sorry if I'm asking any stupid questions.

Btw: the deadlocks we observe do nothing have to do with the fax code. However we use lots of local channels as well (in Queues). So I'm not sure whether our problems are related with the problems described here or not as I was not yet able to get a trace.

By: nappsoft (nappsoft) 2018-05-28 07:22:06.714-0500

As there are other places in the same function invoking ast_indicate(bridge_channel->chan, AST_CONTROL_SRCCHANGE); doing this outside of the bridge lock seems to be ok... I just added a simple patch that should not harm in any way I guess

By: nappsoft (nappsoft) 2018-05-28 07:32:14.071-0500

use uint8_t instead of bool

By: nappsoft (nappsoft) 2018-05-28 07:38:26.553-0500

I've seen the "Must release any swap ref after unlocking the bridge" after uploading the patch.  Moved the ast_indicate code because of that.

By: David Brillert (aragon) 2018-05-28 10:27:15.394-0500

Thanks nappsoft I will test your patch asap. Easy for me to see if it avoids the deadlock since my lab scenario will deadlock on 100% of all test calls.  And if your patch works then at least its a great proof of concept and could see a review.

@richard we have found the reason for the missing symbols here. Asterisk is compiled correctly but our team uses rpms to manage Asterisk versions.  It is the packaging process that is removing the debugging and symbols.  Someone on my team is trying to resolve that issue so that we provide Digium with proper backtraces now and in the future.

By: David Brillert (aragon) 2018-05-29 09:56:10.669-0500

napppsoft's patch 3 remotes the deadlock in my lab.  I can no longer reproduce the deadlock or show any locks with the core show locks command.  Can we get a review of the patch so something can get committed?

By: Joshua C. Colp (jcolp) 2018-05-29 09:58:38.583-0500

[~aragon] All changes need to go through Gerrit[1] for review and inclusion.

[1] https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

By: David Brillert (aragon) 2018-05-30 16:10:07.565-0500

nappsoft has submitted his patch to gerrit https://gerrit.asterisk.org/9069
How about a review?

By: Joshua C. Colp (jcolp) 2018-05-30 16:26:40.625-0500

All changes that go onto Gerrit get reviewed. There is no need to ask for one.

By: Friendly Automation (friendly-automation) 2018-06-06 05:47:35.076-0500

Change 9069 merged by Joshua Colp:
bridge_channel.c: Fix Deadlock when using Local channels and fax gateway

[https://gerrit.asterisk.org/9069|https://gerrit.asterisk.org/9069]

By: Friendly Automation (friendly-automation) 2018-06-06 05:47:43.412-0500

Change 9095 merged by Joshua Colp:
bridge_channel.c: Fix Deadlock when using Local channels and fax gateway

[https://gerrit.asterisk.org/9095|https://gerrit.asterisk.org/9095]

By: Friendly Automation (friendly-automation) 2018-06-06 05:47:51.730-0500

Change 9096 merged by Joshua Colp:
bridge_channel.c: Fix Deadlock when using Local channels and fax gateway

[https://gerrit.asterisk.org/9096|https://gerrit.asterisk.org/9096]