Summary: | ASTERISK-26469: Infinite loop after a dual Redirect | ||||
Reporter: | Etienne Allovon (etienne_pf) | Labels: | |||
Date Opened: | 2016-10-14 09:17:49 | Date Closed: | |||
Priority: | Major | Regression? | No | ||
Status: | Open/New | Components: | Bridges/bridge_simple | ||
Versions: | 13.11.2 | Frequency of Occurrence | Frequent | ||
Related Issues: |
| ||||
Environment: | Debian 8.1 | Attachments: | ( 0) 20161011-asterisk-bt-threads.txt ( 1) 20161011-asterisk-full.C-00006c9d.log ( 2) asterisk-26469-reproduce.py ( 3) gstack-infinite-loop.txt | ||
Description: | *Summary :*
Given I have two channels bridged together by asterisk. Gvien I send an AMI commend Redirect (with channel and extra_channel set) Then an asterisk thread may end up using 100% cpu. One of the channel thread is stuck in the while loop at the end of {{bridge_channel_internal_join()}} where it waits for the flag {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}} to be cleared. *More details :* The problem occurs when sending an ami command to redirect two bridged channels to another extension (Redirect with channel and extra_channel set). During the process, one of the channel thread is stuck in the while loop at the end of {{bridge_channel_internal_join()}} where it waits for the flag {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}} to be cleared: {code} /* * Wait for any dual redirect to complete. * * Must be done while "still in the bridge" for ast_async_goto() * to work right. */ while (ast_test_flag(ast_channel_flags(bridge_channel->chan), AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT)) { sched_yield(); } {code} The other channel thread seems to complete successfully. One possible source of the problem lies in {{bridge_channel_internal_pull()}} which clears the flag {{AST_FLAG_OUTGOING}} without locking the channel first: {code} /* If we are not going to be hung up after leaving a bridge, and we were an * outgoing channel, clear the outgoing flag. */ if (ast_test_flag(ast_channel_flags(bridge_channel->chan), AST_FLAG_OUTGOING) && (ast_channel_is_leaving_bridge(bridge_channel->chan) || bridge_channel->state == BRIDGE_CHANNEL_STATE_WAIT)) { ast_debug(2, "Channel %s will survive this bridge; clearing outgoing (dialed) flag\n", ast_channel_name(bridge_channel->chan)); ast_clear_flag(ast_channel_flags(bridge_channel->chan), AST_FLAG_OUTGOING); } {code} When in the manager thread, the {{action_redirect()}} clears the {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}}: {code} /* Release the bridge wait. */ if (chan1_wait) { ast_channel_lock(chan); ast_clear_flag(ast_channel_flags(chan), AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT); ast_channel_unlock(chan); } if (chan2_wait) { ast_channel_lock(chan2); ast_clear_flag(ast_channel_flags(chan2), AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT); ast_channel_unlock(chan2); } {code} So maybe the {{ast_clear_flag}} done inside the {{action_redirect()}} from the manager thread is overwritten by this call to {{ast_clear_flags}} in {{bridge_channel_internal_pull()}} | ||||
Comments: | By: Asterisk Team (asteriskteam) 2016-10-14 09:17:49.833-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Etienne Allovon (etienne_pf) 2016-10-14 09:49:33.800-0500 Attached is * an extract of the full log for the call which triggered the 100% CPU * and the backtrace that was created from the core after having killed asterisk with 6ABRT signal Note that the threads that were stuck are : 13468 and 21447. By: George Joseph (gjoseph) 2016-10-14 11:07:46.157-0500 This was originally reported on IRC and I looked at it briefly. bridge_channel_internal_pull doesn't do the lock and the channel isn't locked when bridge_channel_internal_pull is called via bridge_channel_internal_join (or bridge_channel_internal_push_full for that matter). By: Richard Mudgett (rmudgett) 2016-10-14 11:30:58.176-0500 You are quite right. It is possible for a reentrancy problem to happen with the channel flags if the timing were just so. I did a quick look to see where {{ast_channel_flags()}} is called and found many places where the flags are modified that don't have the channel lock held. The fix is to lock the channel while modifying the flags. By: Sébastien Duthil (sduthil) 2017-04-07 16:45:30.283-0500 I'll attach a script that reproduces the issue within a few minutes. What you need on the script side: - Python 2 with requests and ari-py - a shell What you need on the server side: - be able to run multiple threads simultaneously (I used 3 virtual CPU on a VM) - an endpoint that can autoanswer (I used chan-test, a little test channel driver https://github.com/wazo-pbx/chan-test) - a little bit of dialplan (detailed in the script) - Asterisk listening on the HTTP port - an ARI account - a manager account Change the variables in the script accordingly and run the the script in a loop, twice in parallel (e.g. in two different shells): {{while true ; do python asterisk-26469-reproduce.py ; done}} After a few minutes, one of the Asterisk threads will be stuck at 100% CPU and the associated channel cannot be hung up anymore. We tried locking the channel when the {{AST_FLAG_OUTGOING}} flag is cleared as mentioned above, but this was not enough. We made a patch that circumvents the problem, but is not The Right Way: https://github.com/wazo-pbx/asterisk/blob/master/debian/patches/wazo_dual_redirect. I'm just mentioning it in case someone desperately needs a fix, but I would not submit this patch for review. What the patch does is replace the {{AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT}} flag with a list of channels waiting to be dual-redirected. I could not reproduce the issue when this patch was applied. By: Jatin Jain (jatinjain) 2017-05-24 04:09:18.948-0500 I also faced the same issue. On some digging I found that in addition to bridge_channel_internal_pull where locking is not done, there is another instance possible in the same flow. In my case its in local_ast_moh_stop which is called from some other thread originated from ast_async_goto. This is called right after the AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT flag was set in manager.c. By: Joshua C. Colp (jcolp) 2017-05-26 10:11:27.474-0500 ASTERISK-26789 has now been merged and this was fixed on that. By: Jatin Jain (jatinjain) 2017-05-26 11:31:36.206-0500 There seems to be a mistake while merging. In manager.c where we clear AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT flag. We need to do it for both chan and chan2. But you have called it for chan twice. if (chan1_wait) { ast_channel_clear_flag(chan, AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT); } if (chan2_wait) { ast_channel_clear_flag(chan, AST_FLAG_BRIDGE_DUAL_REDIRECT_WAIT); //Need to call for chan2 } By: Joshua C. Colp (jcolp) 2017-05-26 11:44:52.845-0500 Fix is up for it on Gerrit. By: Friendly Automation (friendly-automation) 2017-05-26 12:57:49.369-0500 Change 5725 merged by Jenkins2: manager: Clear the flag on the other channel. [https://gerrit.asterisk.org/5725|https://gerrit.asterisk.org/5725] By: Friendly Automation (friendly-automation) 2017-05-26 13:00:05.888-0500 Change 5726 merged by Jenkins2: manager: Clear the flag on the other channel. [https://gerrit.asterisk.org/5726|https://gerrit.asterisk.org/5726] By: Friendly Automation (friendly-automation) 2017-05-26 13:00:33.533-0500 Change 5727 merged by Jenkins2: manager: Clear the flag on the other channel. [https://gerrit.asterisk.org/5727|https://gerrit.asterisk.org/5727] By: Jatin Jain (jatinjain) 2017-06-05 03:12:53.117-0500 I applied the patch but still facing the same issue. By: Jatin Jain (jatinjain) 2017-06-05 04:07:44.375-0500 Following threads stuck in while loop TID - 715, 27247, 6210, 24800, 4987, 17574, 28482, 27470, 9872, 24622, 9969 By: Jatin Jain (jatinjain) 2017-06-05 04:08:41.872-0500 I have attached the gtack in which the threads were still stuck in while loop. By: Mark Michelson (mmichelson) 2017-06-05 15:45:43.648-0500 If you don't have https://gerrit.asterisk.org/#/c/5624/ applied, you might want to try with that applied too. It is a patch that modified a lot of channel flags manipulations to ensure that the channel lock was held as necessary. By: Jatin Jain (jatinjain@drishti-soft.com) 2017-06-05 16:32:07.183-0500 I have applied the patch. Its the same one as mentioned by Joshua. But the issue remains. The attached gstack is after the patch was applied. |