[Home]

Summary:ASTERISK-26469: Infinite loop after a dual Redirect
Reporter:Etienne Allovon (etienne_pf)Labels:
Date Opened:2016-10-14 09:17:49Date Closed:
Priority:MajorRegression?No
Status:Open/NewComponents:Bridges/bridge_simple
Versions:13.11.2 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-26789 Audit manipulation of channel flags without locks
Environment:Debian 8.1Attachments:( 0) 20161011-asterisk-bt-threads.txt
( 1) 20161011-asterisk-full.C-00006c9d.log
( 2) asterisk-26469-reproduce.py
( 3) gstack-infinite-loop.txt
Description:*Summary :*

Given I have two channels bridged together by asterisk.
Gvien I send an AMI commend Redirect (with channel and extra_channel set)
Then an asterisk thread may end up using 100% cpu. One of the channel thread is stuck in the while loop at the end of {{bridge_channel_internal_join()}} where it waits for the flag {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}} to be cleared.

*More details :*

The problem occurs when sending an ami command to redirect two bridged channels to another extension (Redirect with channel and extra_channel set). During the process, one of the channel thread is stuck in the while loop at the end of {{bridge_channel_internal_join()}} where it waits for the flag {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}} to be cleared:
{code}
   /*
    * Wait for any dual redirect to complete.
    *
    * Must be done while "still in the bridge" for ast_async_goto()
    * to work right.
    */
   while (ast_test_flag(ast_channel_flags(bridge_channel->chan), AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT)) {
       sched_yield();
   }
{code}

The other channel thread seems to complete successfully.

One possible source of the problem lies in {{bridge_channel_internal_pull()}} which clears the flag {{AST_FLAG_OUTGOING}} without locking the channel first:
{code}
   /* If we are not going to be hung up after leaving a bridge, and we were an
    * outgoing channel, clear the outgoing flag.
    */
   if (ast_test_flag(ast_channel_flags(bridge_channel->chan), AST_FLAG_OUTGOING)
       && (ast_channel_is_leaving_bridge(bridge_channel->chan)
           || bridge_channel->state == BRIDGE_CHANNEL_STATE_WAIT)) {
       ast_debug(2, "Channel %s will survive this bridge; clearing outgoing (dialed) flag\n", ast_channel_name(bridge_channel->chan));
       ast_clear_flag(ast_channel_flags(bridge_channel->chan), AST_FLAG_OUTGOING);
   }
{code}
When in the manager thread, the {{action_redirect()}} clears the {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}}:
{code}
   /* Release the bridge wait. */
   if (chan1_wait) {
       ast_channel_lock(chan);
       ast_clear_flag(ast_channel_flags(chan), AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT);
       ast_channel_unlock(chan);
   }
   if (chan2_wait) {
       ast_channel_lock(chan2);
       ast_clear_flag(ast_channel_flags(chan2), AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT);
       ast_channel_unlock(chan2);
   }
{code}

So maybe the {{ast_clear_flag}} done inside the {{action_redirect()}} from the manager thread is overwritten by this call to {{ast_clear_flags}} in {{bridge_channel_internal_pull()}}
Comments:By: Asterisk Team (asteriskteam) 2016-10-14 09:17:49.833-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Etienne Allovon (etienne_pf) 2016-10-14 09:49:33.800-0500

Attached is
* an extract of the full log for the call which triggered the 100% CPU
* and the backtrace that was created from the core after having killed asterisk with 6ABRT signal
Note that the threads that were stuck are : 13468 and 21447.

By: George Joseph (gjoseph) 2016-10-14 11:07:46.157-0500

This was originally reported on IRC and I looked at it briefly.   bridge_channel_internal_pull doesn't do the lock and the channel isn't locked when bridge_channel_internal_pull is called via bridge_channel_internal_join (or bridge_channel_internal_push_full for that matter).


By: Richard Mudgett (rmudgett) 2016-10-14 11:30:58.176-0500

You are quite right.  It is possible for a reentrancy problem to happen with the channel flags if the timing were just so.  I did a quick look to see where {{ast_channel_flags()}} is called and found many places where the flags are modified that don't have the channel lock held.  The fix is to lock the channel while modifying the flags.

By: Sébastien Duthil (sduthil) 2017-04-07 16:45:30.283-0500

I'll attach a script that reproduces the issue within a few minutes.

What you need on the script side:
- Python 2 with requests and ari-py
- a shell

What you need on the server side:
- be able to run multiple threads simultaneously (I used 3 virtual CPU on a VM)
- an endpoint that can autoanswer (I used chan-test, a little test channel driver https://github.com/wazo-pbx/chan-test)
- a little bit of dialplan (detailed in the script)
- Asterisk listening on the HTTP port
- an ARI account
- a manager account

Change the variables in the script accordingly and run the the script in a loop, twice in parallel (e.g. in two different shells):
{{while true ; do python asterisk-26469-reproduce.py ; done}}

After a few minutes, one of the Asterisk threads will be stuck at 100% CPU and the associated channel cannot be hung up anymore.

We tried locking the channel when the {{AST_FLAG_OUTGOING}} flag is cleared as mentioned above, but this was not enough.

We made a patch that circumvents the problem, but is not The Right Way: https://github.com/wazo-pbx/asterisk/blob/master/debian/patches/wazo_dual_redirect. I'm just mentioning it in case someone desperately needs a fix, but I would not submit this patch for review. What the patch does is replace the {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}} flag with a list of channels waiting to be dual-redirected.

I could not reproduce the issue when this patch was applied.

By: Jatin Jain (jatinjain) 2017-05-24 04:09:18.948-0500

I also faced the same issue. On some digging I found that in addition to bridge_channel_internal_pull where locking is not done, there is another instance possible in the same flow. In my case its in local_ast_moh_stop which is called from some other thread originated from ast_async_goto. This is called right after the AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT flag was set in manager.c.

By: Joshua C. Colp (jcolp) 2017-05-26 10:11:27.474-0500

ASTERISK-26789 has now been merged and this was fixed on that.

By: Jatin Jain (jatinjain) 2017-05-26 11:31:36.206-0500

There seems to be a mistake while merging.
In manager.c where we clear AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT flag. We need to do it for both chan and chan2. But you have called it for chan twice.

if (chan1_wait) {
ast_channel_clear_flag(chan, AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT);
}
if (chan2_wait) {
ast_channel_clear_flag(chan, AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT); //Need to call for chan2
}

By: Joshua C. Colp (jcolp) 2017-05-26 11:44:52.845-0500

Fix is up for it on Gerrit.

By: Friendly Automation (friendly-automation) 2017-05-26 12:57:49.369-0500

Change 5725 merged by Jenkins2:
manager: Clear the flag on the other channel.

[https://gerrit.asterisk.org/5725|https://gerrit.asterisk.org/5725]

By: Friendly Automation (friendly-automation) 2017-05-26 13:00:05.888-0500

Change 5726 merged by Jenkins2:
manager: Clear the flag on the other channel.

[https://gerrit.asterisk.org/5726|https://gerrit.asterisk.org/5726]

By: Friendly Automation (friendly-automation) 2017-05-26 13:00:33.533-0500

Change 5727 merged by Jenkins2:
manager: Clear the flag on the other channel.

[https://gerrit.asterisk.org/5727|https://gerrit.asterisk.org/5727]

By: Jatin Jain (jatinjain) 2017-06-05 03:12:53.117-0500

I applied the patch but still facing the same issue.

By: Jatin Jain (jatinjain) 2017-06-05 04:07:44.375-0500

Following threads stuck in while loop

TID - 715, 27247, 6210, 24800, 4987, 17574, 28482, 27470, 9872, 24622, 9969

By: Jatin Jain (jatinjain) 2017-06-05 04:08:41.872-0500

I have attached the gtack in which the threads were still stuck in while loop.

By: Mark Michelson (mmichelson) 2017-06-05 15:45:43.648-0500

If you don't have https://gerrit.asterisk.org/#/c/5624/ applied, you might want to try with that applied too. It is a patch that modified a lot of channel flags manipulations to ensure that the channel lock was held as necessary.

By: Jatin Jain (jatinjain@drishti-soft.com) 2017-06-05 16:32:07.183-0500

I have applied the patch. Its the same one as mentioned by Joshua. But the issue remains. The attached gstack is after the patch was applied.