[Home]

Summary:ASTERISK-28125: app_queue: Revert broken queue channel reference patch
Reporter:laszlovl (lvl)Labels:
Date Opened:2018-10-22 07:46:06Date Closed:2018-12-03 15:09:17.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Applications/app_queue
Versions:16.0.0 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:Commit 6409e7b11a2310196a9978b30a6b79e2760be592 introduces references from a queue to the caller & member channels, in an attempt to fix ASTERISK-25185. The patch is incomplete however and introduces regressions, so I propose reverting it until the root cause can be properly fixed. In the mean time, safely catch the fallout from this problem and report an error instead of crashing.

This all started with ASTERISK-25185. It was reported that app_queue's handle_transfer methods are occasionally unable to retrieve a channel snapshot from stasis. This would segfault Asterisk because of referencing a NULL pointer. In an attempt to fix it, each queue call got a reference to the member_channel and caller_channel, which are supposed to get cleaned up after the queue call is done.

However, that patch doesn't work properly. I'm not sure why, but despite the added channel references it's possible for `ast_channel_snapshot_get_latest(queue_data->caller_uniqueid)` or `ast_channel_snapshot_get_latest(queue_data->member_uniqueid)` to return NULL. I've experienced this myself on multiple occasions, and there are various other reports at ASTERISK-27006.

Secondly, the patch introduces a regression where a queue call causes ghost channels that don't disappear until Asterisk is restarted. Again I'm not sure about the root cause, but apparently queue_stasis_data_alloc is not always executed for some scenarios. I suspect this is caused by the Local channel driver not always properly passing data through in case of transfers/bridges/masquerades/surrogates/etc. Other symptoms besides ghost channels are missing hangup events on local channels. This was already reported here: ASTERISK-25844

I wrote a patch that reverts 6409e7b11a2310196a9978b30a6b79e2760be592 and adds proper NULL-checking for all queue event code, so a warning is logged when the scenario occurs. Not perfect, but certainly better than the segfaults that have been present for the past two years. I recommend merging this until the root cause for these stasis timing issues has been fixed. This patch passes all of my tests, I'll submit it to Gerrit so it's checked against the Asterisk testsuite as well.
Comments:By: Asterisk Team (asteriskteam) 2018-10-22 07:46:08.045-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Friendly Automation (friendly-automation) 2018-12-03 15:09:17.815-0600

Change 10726 merged by George Joseph:
app_queue: Revert broken queue channel reference patch

[https://gerrit.asterisk.org/10726|https://gerrit.asterisk.org/10726]

By: Friendly Automation (friendly-automation) 2018-12-03 15:09:54.087-0600

Change 10523 merged by George Joseph:
app_queue: Revert broken queue channel reference patch

[https://gerrit.asterisk.org/10523|https://gerrit.asterisk.org/10523]

By: Friendly Automation (friendly-automation) 2018-12-03 15:10:10.571-0600

Change 10727 merged by George Joseph:
app_queue: Revert broken queue channel reference patch

[https://gerrit.asterisk.org/10727|https://gerrit.asterisk.org/10727]