[Home]

Summary:ASTERISK-28093: pbx: Deadlock from holding channel lock when it shouldn't be
Reporter:Nicolas Tanski (nicolastanski)Labels:crash deadlock
Date Opened:2018-10-05 11:42:12Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Core/PBX
Versions:13.18.5 13.22.0 13.23.1 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Debian 8.11 64 bits Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz - 8 cores 32GB RAMAttachments:( 0) backtrace-threads.txt
( 1) messages-20181005
Description:Hi Team,

we are experiencing frequent crash issues in the Asterisk instance, we are using our own dialer solution with AMI protocol.

The Asterisk service operates on average with 250 simultaneous calls, 40 record calls, the load is between 1 and 3, in some cases the load quickly rises to 10, which causes the rejection of new calls and the instance of Asterisk does not respond.

We were only able to restart the service by killing the Asterisk service with the Kill PID command.

It was enabled the flags DONT_OPTIMIZE, BETTER_BACKTRACES and MALLOC_DEBUG to try identify the problem.

Summary
Average simultaneous calls: 347
Average record calls: 40
Load Average: > 1 and < 3
High Load Average: 25.26

After the crash in Asterisk, we execute the commands and the service returns to normal operation
\# kill -9 PID_ASTERISK
\# /etc/init.d/asterisk start
Comments:By: Asterisk Team (asteriskteam) 2018-10-05 11:42:14.206-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2018-10-05 12:44:11.160-0500

You've provided a backtrace but did not state what version of Asterisk it is against (you've got multiple affected versions). Can you also provide console output before the deadlock?

By: Richard Mudgett (rmudgett) 2018-10-05 14:33:15.015-0500

[^backtrace-threads.txt] - It's a deadlock involving the channels container.  It appears to be a deadlock between thread 64 and 796.

* Thread 64 has the channels container lock and wants the 0x7f20ce706e00 channel lock.  It is trying to check if the uniqueid is unique.
* Thread 793 Wants the 0x7f20ce706e00 channel lock to queue a frame onto the local channel.  Without the ASTERISK-27094 fix in 13.22.0 and later this thread would have deadlock potential.
* Thread 796 Likely has the 0x7f20ce706e00 channel lock and wants the channels container lock to hangup that channel.  (I've found another reason to not like RAII_VAR.  In this case the backtrace doesn't show which exit path in pbx_outgoing_attempt() RAII_VAR is cleaning up.)


By: Cirillo Ferreira (cirillor@lbv.org.br) 2018-10-08 13:33:06.904-0500

[~rmudgett],

The channel's lock in 'ast_channel_by_uniqueid_cb' may not be necessary if uniqueid never is changed.

If remove it, maybe the deadlock is eliminated.

{code}
ast_channel_lock(chan);
if ((!id_len && strcasecmp(ast_channel_uniqueid(chan), uniqueid))
|| (id_len && strncasecmp(ast_channel_uniqueid(chan), uniqueid, id_len))) {
ret = 0; /* uniqueid match failed, keep looking */
}
ast_channel_unlock(chan);
{code}

By: Nicolas Tanski (nicolastanski) 2018-10-08 13:50:04.004-0500

The backtrace-threads.txt refers to version 13.23.1.

By: Richard Mudgett (rmudgett) 2018-10-08 13:50:26.316-0500

That will not eliminate the deadlock.  All you would be doing is moving the deadlock.

The locking order is the channels container lock then the individual channel lock.  There are a lot of loops that hold the channels container lock to walk the container and lock each channel in turn.

Some exit path in thread 796's stack frame for pbx_outgoing_attempt() still has the channel lock held when it should not.

By: Nicolas Tanski (nicolastanski) 2018-10-08 13:53:36.280-0500

Log generated in version 13.23.1

Problem it occurred at 12:23 PM.

By: Cirillo Ferreira (cirillor@lbv.org.br) 2018-10-29 12:30:49.026-0500

Apparently the deadlock occurs when the maximum load limit is reached.

ps: the load reached 25.26 in a few seconds.

{code:title=backtrace-threads.txt}
Thread 796 (Thread 0x7f20945b3700 (LWP 59199)):
...
#15 0x00000000005841cf in pbx_outgoing_attempt (type=0x7f20cc780a90 "Local", cap=0x7f20cf571798, addr=0x7f20cc780a98 "0111143370692@agentes", timeout=27000, context=0x7f20cc780ab0 "agentes", exten=0x7f20cc780aba "10516", priority=1, app=0x0, appdata=0x0, reason=0x7f20945b2d6c, synchronous=1, cid_num=0x0, cid_name=0x0, vars=0x0, account=0x8e007a <__ast_string_field_empty_buffer+2> "", locked_channel=0x7f20945b2d60, early_media=0, assignedids=0x7f20945b2cf0) at pbx.c:7612
#16 0x0000000000584265 in ast_pbx_outgoing_exten (type=0x7f20cc780a90 "Local", cap=0x7f20cf571798, addr=0x7f20cc780a98 "0111143370692@agentes", timeout=27000, context=0x7f20cc780ab0 "agentes", exten=0x7f20cc780aba "10516", priority=1, reason=0x7f20945b2d6c, synchronous=1, cid_num=0x0, cid_name=0x0, vars=0x0, account=0x8e007a <__ast_string_field_empty_buffer+2> "", locked_channel=0x7f20945b2d60, early_media=0, assignedids=0x7f20945b2cf0) at pbx.c:7792
#17 0x000000000054eff6 in fast_originate (data=0x7f20cd6b6680) at manager.c:5252
#18 0x0000000000600ed3 in dummy_start (data=0x7f20cfab1cb0) at utils.c:1239
#19 0x00007f21be162064 in start_thread (arg=0x7f20945b3700) at pthread_create.c:309
#20 0x00007f21bd10562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{code}

{code:title=messages}
[2018-10-05 12:23:02.214] WARNING[59202] pbx.c: Maximum loadavg limit of 10.000000 load exceeded by 'Local/0111143370692@agentes-0002aa80;1' (currently 25.260000)!
[2018-10-05 12:23:02.214] ERROR[59202] pbx.c: Failed to start PBX on Local/0111143370692@agentes-0002aa80;1
{code}