[Home]

Summary:ASTERISK-28229: Asterisk not responding/reloading
Reporter:Jerry Corrion (dice_telephony)Labels:
Date Opened:2019-01-02 10:44:55.000-0600Date Closed:2020-01-14 11:14:14.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/General Core/PBX PBX/General
Versions:13.22.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:CentOS 6.5 (Final), 4GB RAM, 2 core processor, 40GB HD FreePBX 12.0.33Attachments:( 0) ast_coredumper
( 1) core-asterisk-running-2019-01-10T09-05-23-0800-brief.txt
( 2) core-asterisk-running-2019-01-10T09-05-23-0800-full.txt
( 3) core-asterisk-running-2019-01-10T09-05-23-0800-thread1.txt
( 4) core-asterisk-running-2019-01-10T11-49-33-0800-brief.txt
( 5) core-asterisk-running-2019-01-10T11-49-33-0800-full.txt
( 6) core-asterisk-running-2019-01-10T11-49-33-0800-thread1.txt
( 7) core-asterisk-running-2019-01-10T11-53-34-0800-brief.txt
( 8) core-asterisk-running-2019-01-10T11-53-34-0800-full.txt
( 9) core-asterisk-running-2019-01-10T11-53-34-0800-thread1.txt
Description:Asterisk is reloading itself, sometimes several times a day. We do receive notifications from monit when this happens (failed protocol test [SIP] at [127.0.0.1]:5060 [UDP/IP] -- SIP: error receiving data -- Resource temporarily unavailable) and it will restart itself. We've been looking through logs trying to find something that triggers this. The best we can get is Asterisk stops communicating/responding to SIP until asterisk is reloaded.
Comments:By: Asterisk Team (asteriskteam) 2019-01-02 10:44:55.736-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Joshua C. Colp (jcolp) 2019-01-02 10:52:44.120-0600

We suspect that you have a deadlock occurring within Asterisk. Please follow the instructions on the wiki [1] for obtaining debug relevant to a deadlock. Once you have that information, attach it to the issue. Be sure the instructions are followed exactly as the debug may otherwise not be useful.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock



By: Joshua C. Colp (jcolp) 2019-01-02 10:54:24.638-0600

We would also need the configuration and console log you mentioned.

By: Jerry Corrion (dice_telephony) 2019-01-02 11:48:18.558-0600

Thank you for the quick response. For your convenience I've uploaded shortened versions of the logs to https://www.dropbox.com/sh/i176cvxgry9uf7a/AAAHW_1vSPIWEk1MGGoqfkw-a?dl=0
If you would like the full days' logs, they can be found at https://www.dropbox.com/sh/95izg8rd1jq6txy/AABbHr9D-mOzvYtzNo9wWHCsa?dl=0. Unfortunately we were unable to capture a debug on the 28th so I'm not sure how much help that particular file will be...

We are also working on getting the deadlock debug information, but that may take a while. Apologies for the question, but could you elaborate on what you mean by "configuration"?

By: Joshua C. Colp (jcolp) 2019-01-02 11:55:41.919-0600

Please attach logs so that they don't disappear. There is no time frame on when this will get looked into, so ensuring the logs stay here associated with the issue makes it easier for anyone who may look into it.

As for configuration information about how the system is used is useful.

As it is you use Asterisk with SIP, that's all anyone can determine from your comment - which makes it difficult to see potential problems.

By: Joshua C. Colp (jcolp) 2019-01-02 11:56:06.103-0600

Another example - are you using a database at all? are you using .conf files only? what system is this running on? is it a VM?

By: Jerry Corrion (dice_telephony) 2019-01-02 12:39:00.066-0600

Oh, that makes sense, sorry about the confusion. I've attached the shorter versions of the debug log files- unfortunately the full logs are much too large to be attached. Hopefully they are of some help. As for our setup, we're running on a VM with the following:

Hypervisor:    VMware ESXi, 5.5.0, 1623387
Model:    PowerEdge R720
Processor Type:    Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
Utilizing both mysql and conf files.

This build in particular is used in a 24/7 call center environment with roughly 20 users utilizing an autodialer service. Call flow consists of incoming calls ringing into a queue of about 14 people utilizing a ring-all strategy. Typically there are no more than a couple dozen calls during peak times (this includes calls waiting in queue). Lately we've begun seeing consistent issues around 8:30 - 9:00 a.m. log time (it also occurs at other times of the day, but most consistently here.)

Please let me know if there is any other further information I can provide that will be of assistance!
(still working on getting that debug log)

By: Joshua C. Colp (jcolp) 2019-01-02 12:47:06.328-0600

Sending back to feedback until a backtrace is available.

By: Jerry Corrion (dice_telephony) 2019-01-02 13:50:44.221-0600

Just a quick question about the backtrace... will you need data from when the issue occurs, or will it just need to be run once? Thanks in advance, my apologies for floundering so much - never dealt with an issue like this before.

By: Joshua C. Colp (jcolp) 2019-01-02 13:53:27.081-0600

Information like console log would be useful at the time, but getting the backtrace itself will show the state of the system.

By: Jerry Corrion (dice_telephony) 2019-01-02 14:15:52.548-0600

Back again with another stupid question... We've recompiled asterisk and followed the instructions in the link you gave, but when trying to run "sudo /var/lib/asterisk/scripts/ast_coredumper --running" we're receiving the following message:
"Asterisk is not running
No coredumps found"
I feel like we're missing something obvious here... Thanks again for your patience.

By: Joshua C. Colp (jcolp) 2019-01-02 14:18:44.577-0600

Was Asterisk running? Do you run it under a user? Do you have it installed in a different path?

By: Jerry Corrion (dice_telephony) 2019-01-02 14:24:39.948-0600

We have asterisk running under the user "asterisk", so I don't think that should be a problem... running ps -ef | grep asterisk shows the following:

asterisk 304 302 1 15:11 pts/2 00:00:04 /usr/sbin/asterisk -f -vvvg -c

We are attempting to run this on a dev machine (didn't want to fiddle around on a production machine until we knew what we were doing!) without any registered endpoints. Could this be causing a problem?

By: Joshua C. Colp (jcolp) 2019-01-03 05:10:18.551-0600

Please try the attached version of ast_coredumper which fixes a bug when finding the process.

By: Jerry Corrion (dice_telephony) 2019-01-07 13:34:38.547-0600

Apologies for the delay- here are the results of ast_coredumper. Hopefully this is what you were looking for. If not, please advise and we will work to get what you need.

By: Joshua C. Colp (jcolp) 2019-01-07 13:43:14.416-0600

The attached documents are not useful - they don't appear to contain any data. How have you built Asterisk? Did you follow the instructions to ensure it would produce viable information? Do you strip the binaries?

By: Jerry Corrion (dice_telephony) 2019-01-07 14:10:41.441-0600

My bad, looks like I forgot to recompile that one... Need more coffee.

Here are the files generated by running sudo /var/lib/asterisk/scripts/ast_coredumper --running. Asterisk 13.22.0 was recompiled with DEBUG_THREADS, MALLOC_DEBUG, DONT_OPTIMIZE and BETTER_BACKTRACES all selected.

By: Joshua C. Colp (jcolp) 2019-01-07 14:31:26.897-0600

Was the problem being experienced at the time? The backtrace itself shows no deadlocks or threads blocked inside of Asterisk itself.

By: Jerry Corrion (dice_telephony) 2019-01-08 07:11:00.306-0600

I don't believe the issue would have been happening during this backtrace (but we also wouldn't have a way to know until it crashed, either). The problem itself hasn't shown in a few days, so we're back to playing the waiting game. Is it safe to run Asterisk with all the debug flags on while we wait for this to happen? The article you linked earlier indicated the following... {quote}Enabling DEBUG_THREADS can seriously impact performance and should only be enabled if you suspect a deadlock may be occurring.{quote}
The article doesn't go into detail about how performance can be impacted, so we didn't think it was a good idea to keep this running on a production machine, since none of us knew what to expect. If we do keep this running, how would performance be impacted? Would we expect calls to drop/delay, or would this just be an instance of high CPU use?

It bears mentioning, but this PBX is used at a call center for public safety events and we cannot have extended downtime - for this reason, we use monit on the system to reboot asterisk once it experiences an issue. Am I correct in assuming that to get the data you need, we will need to let this system go down without an automatic reboot so we can get the deadlock information while it's in this state?

By: Joshua C. Colp (jcolp) 2019-01-08 07:15:18.821-0600

DEBUG_THREADS can incur over a 50% penalty hit, it wouldn't be needed for this.

In order to get a backtrace to show what is going on the problem has to be occurring at the time - it can't be collected afterwards.

By: Jerry Corrion (dice_telephony) 2019-01-08 07:39:00.888-0600

Guess I'm confused. You told us to follow the directions exactly as given in the wiki link, which we did, but now you're saying we shouldn't? Please let us know exactly what you want us to do to get the information you need. I'm more than willing to admit we don't have experience in this kind of debugging, which is exactly why we've come to you for help, but we can't keep taking this machine down - as I said before, this is a public safety PBX and necessity dictates that we coordinate closely with the location to schedule this sort of thing.

Currently I'm compiling with DEBUG_THREADS, MALLOC_DEBUG, DONT_OPTIMIZE and BETTER_BACKTRACES enabled. Aside from DEBUG_THREADS, are there others we should disable?

By: Joshua C. Colp (jcolp) 2019-01-08 07:44:54.360-0600

In an ideal world it would be nice if DEBUG_THREADS were present but due to the performance impact it is not necessary initially for it to be present. If you can and things operate fine, then it's okay to keep it. If it causes a problem then it can be disabled and the backtrace taken regardless. The DEBUG_THREADS functionality can just make it easier to get the information without having to dissect the output of the backtrace in complicated scenarios.

It very much depends on the particular environment and usage patterns as to the over all impact and whether it can be present or not.

The rest of the options are fine to keep on.

By: Jerry Corrion (dice_telephony) 2019-01-08 07:46:51.808-0600

Alright, we'll go ahead and disable that for now. I'll update the ticket once the pbx crashes again. Thanks for the help.

By: Jerry Corrion (dice_telephony) 2019-01-14 09:27:47.162-0600

Uploaded samples from a couple different crashes. Hope this helps. Thanks!

By: Kevin Harwell (kharwell) 2019-01-17 11:51:06.772-0600

[~dice_telephony], just want to clarify to make sure. Is Asterisk actually crashing or is it "hanging"/stuck and another process is stopping and then restarting Asterisk?

I noticed this in one of the log files (2018_12_31_crash.txt) right before a restart:
{noformat}
[2018-12-31 09:15:11] WARNING[12694] db.c: Couldn't bind key to stmt: out of memory
{noformat}
What's the memory and cpu utilization look like on the system and by Asterisk? You can use something like "top" to see how much memory is available, being used, and free.

By: Asterisk Team (asteriskteam) 2019-01-31 12:00:01.655-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines