ASTERISK-25000: Deadlock in ast_do_masquerade (specifically in ast_hangup on the zombie clone if it's hungup during the masquerade)

[Home]

Summary: ASTERISK-25000: Deadlock in ast_do_masquerade (specifically in ast_hangup on the zombie clone if it's hungup during the masquerade)

Reporter: William luke (greenlightcrm) Labels:

Date Opened: 2015-04-23 05:45:48 Date Closed: 2018-01-02 08:30:35.000-0600

Priority: Critical Regression?

Status: Closed/Complete Components: Core/Channels

Versions: 11.16.0 Frequency of
Occurrence Frequent

Related
Issues:

Environment: CentOS 6. Dual Xeon Dell server. Under relatively heavy load (250k calls/day), with lots of AMI actions. Attachments: ( 0) backtrace-threads-20150422.txt
( 1) backtrace-threads-20150427-first.txt
( 2) dialplan_snippet.txt
( 3) verboselog.rar

Description: We're seeing a deadlock where the AMI thread completely locks up. (Thread ID 19109 in backtrace attachment)

A backtrace shows that it's while doing a dual redirect.
When redirecting the second channel (from manager.c:3895), inside ast_do_masquerade, we decide the clone was a zombie, and then in channel.c line 7331 call ast_hangup on it.
This ast_hangup tries to grab a channel lock (channel.c:2885) and hangs here indefinitely.
What's peculiar is that a few lines higher up it's successfully managed to grab and then release this same channel lock.
It would seem that as the masquerade begun, this channel (clonechan) had at the same moment hungup. (see line 212288 in the verboselog attachment. The channel in question is "SIP/gl-agw-01-000f7f0e")
So something has happened to the state of this channel and something has not release it's channel_lock.
I'm unable to see which other thread is holding the lock.

The issue occured at 15:32:20 in the verboselog file. The first part of the dual redirect can be seen at line 212279.

I executed a "core restart" via the CLI, but this hung the CLI, and I had to kill the Asterisk process.

Comments: By: William luke (greenlightcrm) 2015-04-23 05:47:04.032-0500

backtrace
By: William luke (greenlightcrm) 2015-04-23 06:00:25.264-0500

verbose log of issue occuring
By: William luke (greenlightcrm) 2015-04-23 06:05:55.378-0500

snippet of the parts of the dialplan that the effected channels are being redirected to.
The first channel in the dual redirect (SIP/cri-ast-01-000ca069) gets redirected to 1622@gl-agentconf:1
The second channel, the one that causes the deadlock (SIP/gl-agw-01-000f7f0e) gets redirected to 1622@gl-customerconf:1
The purpose of these extensions is to put the channels into a conference.
This used to happen locally, but for performance reasons they are passed to another box which then conferences then together.
Once these channels had been redirected, then we'd put another call into the conference.
By: William luke (greenlightcrm) 2015-04-23 06:12:08.137-0500

I also have a full log of all AMI traffic during the issue. Let me know if any of that would be useful.

The AMI action that was sent is:

Action: Redirect
ActionID: cd9e08bf-646e-494c-a36c-f7259ff9f4d0
Channel: SIP/cri-ast-01-000ca069
ExtraChannel: SIP/gl-agw-01-000f7f0e
Exten: 1622
ExtraExten: 1622
Priority: 1
ExtraPriority: 1
Context: gl-agentconf
ExtraContext: gl-customerconf
By: Rusty Newton (rnewton) 2015-04-27 15:31:10.444-0500

William, can you get the output of "core show locks" as described [at this wiki page|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock] ?

By: William luke (greenlightcrm) 2015-04-27 16:19:41.484-0500

Hi,

Thanks for looking over this.

Enabling DEBUG_THREADS was one of the things I tried - unfortunately it kills the server very quickly due to the additional overhead.

I've got some additional backtraces of what I believe is the same issue occurring again. It appears that a <ZOMBIE> is being left locked after a masquerade. In the latest example it's the devicestate thread that's getting deadlocked when iterating over the list of channels (it's holding the channel container lock, and then trying to lock the channel and stalling there).

Why do we add the clonechan back to the container in ast_do_masquerade?:

if (!clone_was_zombie) {
ao2_link(channels, clonechan);
}

Is the intention that the null frame we queued to it will then cause it to be cleared off? If so, where would this happen, and what about other threads accessing it inbetween?

Thanks again!

By: William luke (greenlightcrm) 2015-04-28 03:51:50.359-0500

This is what looks like the same issue occurring, although presenting slightly differently as in this case it was the devicestate thread (12268) that fell over the rogue zombie locked channel (SIP/gl-agw-01-000a6da5<ZOMBIE>), whilst holding the lock to the channel container, and then everything piles up behind that.
A masquerade had recently completed, leaving that zombie thread behind, but the thread seemingly never hangs up.
The thread running this channel (20167) is also locked, which is odd, as I was half expecting it to have been stuck inside hangup perhaps, but it would seem not.
Is it remotely possible ast_channel_unlock could fail?
By: Joshua C. Colp (jcolp) 2017-12-18 09:11:18.004-0600

We have made major changes and reduced the amount that masquerading is done in Asterisk 13. Have you seen this problem there?
By: Asterisk Team (asteriskteam) 2018-01-02 08:30:35.075-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].
[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines