ASTERISK-22936: Deadlock during masquerade when a PJSIP channel attended transfers a 3+ party bridge to dialplan

[Home]

Summary: ASTERISK-22936: Deadlock during masquerade when a PJSIP channel attended transfers a 3+ party bridge to dialplan

Reporter: Jonathan Rose (jrose) Labels:

Date Opened: 2013-12-03 11:37:20.000-0600 Date Closed: 2013-12-19 11:24:01.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_pjsip Resources/res_pjsip_refer

Versions: SVN 12.0.0-beta2 12.0.0 Frequency of
Occurrence Constant

Related
Issues:

Environment: Ubuntu 13.10 64-bit Attachments: ( 0) lockdata.txt
( 1) pjsip_masquerade_fixup_blocked.diff

Description: If a PJSIP channel attended transfers a three-way call to dialplan (as opposed to another bridge), Asterisk will attempt to masquerade swap a local channel with the transferer's channel that is currently in the dial plan. I've tracked this to a sticking point in ast_do_masquerade. An attempt to lock the channels container never returns.

I tried a similar transfer scenario with chan_sip and that worked fine, so it seems likely that we are breaking some (possibly unwritten) rules when issuing ast_bridge_transfer_attended from res_pjsip_refer.c

I've marked this issue as critical since this bug can easily be triggered by any PJSIP line with the ability to do attended transfers and once it has happened Asterisk becomes less than usable since the channels container lock is stuck. Yay masquerades.

Steps for reproduction:

* Start a three way call. At least one of the three channels must be a PJSIP line.
* Perform a SIP attended transfer from the PJSIP line to an extension that will linger in the dialplan. Answer to Wait(20) works just fine.
* Complete the transfer. The transfer will fail and the channels lock will be stuck preventing any new channels from being made and preventing
old channels from... doing lots of stuff.

Comments: By: Mark Michelson (mmichelson) 2013-12-03 11:44:38.559-0600

How are you creating the three-way call? Are you using ARI, using attended transfer three-way feature, or something else?

Getting output of "core show locks" could be useful here since the channel container lock may be held by some thread when we aren't expecting it to be.
By: Jonathan Rose (jrose) 2013-12-03 12:19:24.657-0600

I created the three way call by using feature attended transfers to dial completed with *3
I'll try to grab core show locks output.
By: Jonathan Rose (jrose) 2013-12-03 12:34:48.135-0600

Attached core show locks data as lockdata.txt

This reflects the state of the system after attempting to complete the transfer. Prior to that the output was empty at all points I checked.
By: Mark Michelson (mmichelson) 2013-12-03 12:45:21.640-0600

Yuck. It looks like there are two masquerades happening at the same time and they are deadlocking. This...this is just the most disgusting thing I've ever seen.

It's weird that the one in the bottom thread of lockdata.txt is not completing though. I wonder what it is that's causing it to block indefinitely. Attaching gdb during the deadlock could probably be useful so you can inspect the stacks and find out the circumstances (like what channels are involved) behind the masquerades.
By: Jonathan Rose (jrose) 2013-12-03 15:08:50.501-0600

Attached is a patch that fixes the deadlock (which occurred because a synchronous task is pushed when the task processor is also blocked). According to Josh, this might open up something of a can of worms if other tasks in the task processor perform certain functions against what will become a zombie channel once the action is queued and the masquerade continues.
By: Matt Jordan (mjordan) 2013-12-07 18:38:39.737-0600

I don't think this patch will work. You can't delay a masquerade. Failing to complete the fix up here will result in a masqueraded (zombie) channel with an active session. Anything using that session will be pointing to the Zombie channel, and - assuming they don't blow up in some spectacular fashion - have a high likelihood of simply failing.

Even if we assume that the task can be put at the front of the taskprocessor queue, any operation currently in flight will be operating on an invalid channel.

I only see two possibilities here:

# Fix the locking inversion between the task processor and the entity pushing the synchronous task. That is, you can't hold a channel lock when pushing a synchronous task.
# Go with a modified version of the patch, but assume that the session object has to be protected as well.
By: Jonathan Rose (jrose) 2013-12-09 17:02:12.331-0600

The patch on https://reviewboard.asterisk.org/r/3042/ was committed based on current discourse, but I'm leaving the issue open for the time being. We may need to address this more properly down the road.