[Home]

Summary:ASTERISK-25655: core taskprocessors sorcery-control memory not cleared
Reporter:yaron nahum (yaronna)Labels:
Date Opened:2016-01-03 09:05:29.000-0600Date Closed:2018-07-10 17:13:04
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Core/Sorcery Core/Stasis
Versions:13.6.0 13.7.0 Frequency of
Occurrence
Related
Issues:
cannot be resolved before completion ofASTERISK-25683 res_ari: Asterisk fails to start if compiled with MALLOC_DEBUG
Environment:Attachments:( 0) asterisk_13.7_mem_allocation.tar.zip
( 1) asterisk_13.7_mem_sum_before_problem.txt
( 2) asterisk_13.7_mem_summary.txt
( 3) backrace-2016-02-02T15_52_09
( 4) J7_IVR_astersik_extensions_ael.txt
( 5) J7_IVR_astersik_messages-2016-1-3.txt
( 6) J7_IVR_astersik_top_log-2016-1-3.txt
( 7) J7_IVR_astersik_uptime_log-2016-1-3.txt
( 8) J7_IVR_server_props.txt
( 9) sorcery-control_mem_leak.txt
Description:production env is running with pjsip.
after 2-3 days the memory on the machine is raises rapidly to around 85% (4G ram).asterisk start responding slowly and when "core show taskprocessors" output shows high In Queue allocation for "sorcery" somethimes "sorcery-control" and even "stasis-core-control".

in the case here the sorcery-control queue is high.

Further detail:

The issue started at 13.6 and was not solved at 13.7. before that we had 13.2 and no issue.
I do not understand why it happens. there is no clear pattern. the issue started 15-16 days after installing the first 13.6 server. On another server (we have 7 with 13.7) it started 3 hours after installing 13.7.

The asterisk server handles around 13000 calls per day. We use PJSIP, very little ODBC queries to mysql DB. (the extensions.ael attached).  We have Kamailio in front of the asterisk and in some cases it redirects the call back to the asterisk using 302 to another extension.

The application is compiled with DONT_OPTIMIZE and BETTER_BACKTRACES flags. And no other "special" installation.

I do not see something special in the logs other then some warning about the ael script (fixed :-)),
PJSIP and dial unable to frame type...
and one exactly for the taskprocessor that reached 500 tasks "task processor queue reached 500 scheduled tasks"

The files attached are the AEL, the messages log, the server top and uptime log and the server VMware properties.
Comments:By: Asterisk Team (asteriskteam) 2016-01-03 09:05:31.007-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2016-01-04 08:37:51.121-0600

Can you give us some snapshots or ideas of what the system is doing?

How many calls is it processing?

What kind of calls is it processing? What are the calls doing? Can you provide example dialplan? Logs?

Did this issue begin in 13.6.0? Or is this only the first installation for this particular environment?

By: yaron nahum (yaronna) 2016-01-06 09:12:59.782-0600

Hi

[Edit by Rusty - Moved this information to the issue description]

If you need any information from us just let us know.


By: Rusty Newton (rnewton) 2016-01-06 17:53:25.612-0600

Thanks! Everything you said in that comment should be in the description so I'm going to copy paste your response there.

By: Joshua C. Colp (jcolp) 2016-01-07 06:27:19.077-0600

There does indeed seem to be a taskprocessor with a large number of entries in queue. Can you please enable MALLOC_DEBUG and attach the output of:

"memory show summary"
"memory show allocations"

This should allow the source of the items in the taskprocessor to be determined.

By: yaron nahum (yaronna) 2016-01-07 07:32:49.813-0600

When I compile with MALLOC_DEBUG the server doesn't come up. I had the same issue on release 13.5 (issue 25383) and it should have been fixed.

Any suggestions?

By: Joshua C. Colp (jcolp) 2016-01-07 07:39:14.827-0600

Can you provide additional information besides "the server doesn't come up"? That doesn't help to determine why you can't use MALLOC_DEBUG.

By: yaron nahum (yaronna) 2016-01-11 00:59:58.086-0600

I have a backtrace. I will open another issue.
Regrading this issue - I have another server with release 13.7 and I am able to load asterisk with MALLOC_DEBUG - it seems to be configuration related. I will try to find a way to activate MALLOC_DEBUG on the original server and get back to you with information.

By: yaron nahum (yaronna) 2016-01-11 06:46:30.054-0600

Attached "memory show summary" output before the problem occured. I will attach another output once the problem is reproduced.


By: Rusty Newton (rnewton) 2016-01-11 19:00:03.746-0600

Linking to ASTERISK-25683 that Yaron opened in regards to the MALLOC_DEBUG issue.

By: Joshua C. Colp (jcolp) 2016-01-12 19:30:15.705-0600

Assigning as waiting for memory information.

By: Daniel Heckl (DanielYK) 2016-01-19 15:01:00.120-0600

For me, a phone call was aborted because of the error (version 13.7.0).

[2016-01-19 21:51:48] WARNING[13710][C-0000003b]: taskprocessor.c:737 taskprocessor_push: The '1282acba-8062-4026-855c-559466685ca4' task processor queue reached 500 scheduled tasks.

By: yaron nahum (yaronna) 2016-01-26 03:50:58.410-0600

The memory leak occur after a week. It only happened on servers that had the ari module loaded.
attached memory summary after the leak. I have memory allocations output but it is too big to upload. if you need anything from it just specify what and I will look for it and attach it.

By: Joshua C. Colp (jcolp) 2016-01-26 07:54:35.387-0600

Anything from stasis_channels.c and json.c in it would be useful. Can you create a tarball with it so it is compressed?

By: yaron nahum (yaronna) 2016-01-28 01:28:34.280-0600

Zipped and uploaded it. Have Fun.

By: Rusty Newton (rnewton) 2016-01-31 09:42:48.010-0600

[~yaronna] have you found a simple way to reproduce this at a small scale?

Using your dialplan can you instruct us on a way to reproduce the issue with a few calls - where the issue will still be observable ?



By: yaron nahum (yaronna) 2016-02-02 04:09:14.058-0600

Sorry. I was not able to reproduce the issue on my lab.
It only occurs on my live servers. Sometimes it takes a few hours to reproduce it and sometimes several weeks. Seems to me related to a certain event the triggers this issue.
I can say that it only occurs when the ari modules are loaded. I have unloaded the ari modules on half of my servers and the problem never reproduced on these servers, while on the other half with ari modules it was reproduced several times.

By: yaron nahum (yaronna) 2016-02-03 01:43:56.897-0600

Yesterday we had a segmentation fault on one of our servers that were running with ari modules.
I am attaching it to this case. If you think it is not related and I will open a new issue.

By: Rusty Newton (rnewton) 2016-02-04 09:54:35.565-0600

You should attach the new backtrace as .txt so it can be easily accessed without downloading.

By: Corey Farrell (coreyfarrell) 2017-12-29 20:24:49.398-0600

Can you check if this issue still exists in a current version of Asterisk?  A large number of fixes have been made since this issue has been reported so it would be good to see if the issue remains.

The crash you had is fixed by ASTERISK-27340, currently that patch is only released in 13.19.0-rc1 or 15.2.0-rc1.  That patch is important to avoid crashes if you enable BETTER_BACKTRACES with MALLOC_DEBUG.

By: Richard Mudgett (rmudgett) 2017-12-30 09:49:35.947-0600

Specifically, many fixes involving blocked task processors have been made since you reported this issue.  In v13.8.0, task processors got meaningful names instead of UUIDs specifically to help with identifying deadlocked task processors as you have in [^sorcery-control_mem_leak.txt].