[Home]

Summary:ASTERISK-29775: CPU spike
Reporter:Dan Cropp (daninmadison)Labels:confbridge
Date Opened:2021-11-23 16:50:20.000-0600Date Closed:
Priority:MinorRegression?
Status:Open/NewComponents:Applications/app_confbridge Channels/chan_local
Versions:16.22.0 18.8.0 Frequency of
Occurrence
Constant
Related
Issues:
is related toASTERISK-29474 core: Exceptionally long queue length with Record and ConfBridge
Environment:Ubuntu 18 and also Ubuntu 20Attachments:( 0) confbridge.conf
( 1) debug_log.txt
( 2) messages.txt
Description:We use AMI to do the following.  I don't think it's AMI causing the issue, but mentioning it.

Create 2 ConfBridges (A and B)
Create a Local channel 1234.  
Add one end of the local channel 1234;1 to ConfBridge A.
Add the other end of local channel 1234;2 to ConfBridge B.
Create another Local channel 5678.
Add one end of local channel 5678;2 to ConfBridge B.
Now on local channel 5678;1, we execute a Record.

The CPU spikes at this point.  I suspect the code is in some loop with processing the multiple calls with the ConfBridge.
Even when the Record completes, the CPU is still running 95+%.  The top PIDs are all asterisk running CPU% 95-96%, others four are 10-20%.
Basically, the box becomes so bogged down it runs into problems and can't process other things.

Even if I remove local channel 5678;2 from ConfBridge B, the CPU stays running at peak.

Once either ConfBridge has the calls kicked or either end of the local channel 1234 is removed from either ConfBridge, the CPU issue is resolved.

While local channels like this doesn't make sense.  We originally found this because we had PJSIP calls in different ConfBridges but needed the ConfBridges to be connected with a local channel.  All was fine until a customer started requiring us to have separate recording for each participant in one of the bridge.  That's when we started adding local channels as each participant joined and began recording.

By eliminating the PJSIP calls in the sample, it eliminates areas to look at.
Comments:By: Asterisk Team (asteriskteam) 2021-11-23 16:50:22.132-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Kevin Harwell (kharwell) 2021-11-23 17:19:30.587-0600

Sounds like you might have a scenario already setup. If so can you attach it to the issue?

Also, if you are able to easily replicate then please attach an Asterisk [debug log|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information] (debug set to at least 5) as it could be helpful.

As well please attach the dialplan, and _confbridge.conf_ file(s) to this issue.

Thanks!

By: N A (InterLinked) 2021-11-23 19:50:57.790-0600

Seems like this might be related to ASTERISK-29474?

By: Kevin Harwell (kharwell) 2021-11-24 09:37:19.009-0600

Looks like it might be. I've linked the issues.

By: Dan Cropp (daninmadison) 2021-11-24 11:40:17.113-0600

The confbridge.conf is a probably not too important to this.  As I mentioned, we're using AMI to control most of the call flow and confbridges.  If a setting doesn't match what is required, we change the variable on the fly through AMI actions on the channel.

In the debug_log, at 11:24:30:434, we initiate the EXEC RECORD and this is when things spike.  
debug_log file shows "res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 2 instead" happening regularly.
Even after the record completes dues to silence, the CPU remains high (ticks issue)..

I hangup the local channel that performed the recording at 11:24:48.712.
At this point, there are still the two ends of the other local channel.  Each end is in a different ConfBridge.
The ticks issue remains.

At 11:25:05.781, I perform the ConfBridgeKick to remove one end of the local channel from one of the ConfBridges.  At this point things are running much nicer.

I left a channel in the ConfBridge A, so it is continuing to process this portion even at the end of these logs.
The ConfBridge B is ended because there are no longer any channels remaining in it.

By: Kevin Harwell (kharwell) 2021-11-24 12:39:41.177-0600

Seeing lots of these in [^messages.txt]:
{noformat}
[11/24 11:24:31.227] WARNING[76060][C-00000007] channel.c: Exceptionally long queue length queuing to Local/5678@IS-00000005;1
[11/24 11:24:31.227] WARNING[76060][C-00000007] channel.c: Exceptionally long queue length queuing to Local/5678@IS-00000005;1
[11/24 11:24:31.227] WARNING[76060][C-00000007] channel.c: Exceptionally long queue length queuing to Local/5678@IS-00000005;1
{noformat}
Based on the descriptions and warnings it does indeed appear to be the same if not very similar to ASTERISK-29474. I'll open it up for now as such.