ASTERISK-27247: Asterisk not responding for 5 to 15 seconds

[Home]

Summary: ASTERISK-27247: Asterisk not responding for 5 to 15 seconds

Reporter: Daniel Journo (journo) Labels:

Date Opened: 2017-09-03 10:59:32 Date Closed:

Priority: Major Regression?

Status: Open/New Components: Channels/chan_pjsip

Versions: 13.5.0 Frequency of
Occurrence Constant

Related
Issues:
is related to ASTERISK-27242 Asterisk stops responding to packets

Environment: Attachments:

Description: Asterisk is randomly stopping to respond to inbound SIP packets every so often. The outage lasts for about 10 to 15 seconds. During that time, Asterisk is still able to send OPTIONS, but doesnt respond to REGISTERs and INVITEs.
This is happening to two servers. Both servers have been running for over 4 months without this issue occurring. No config changes were made when the issue started.
George Joseph will be attaching a packet capture showing the issue. Due to the random frequency of the issue, it's almost impossible to obtain a running core dump while the issue is occurring.

Comments: By: Asterisk Team (asteriskteam) 2017-09-03 10:59:34.187-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Richard Mudgett (rmudgett) 2017-09-03 14:17:16.180-0500

This behavior is intentional for pjsip channels when task processors in the system get a high backlog of work to do. The pjsip channel driver intentionally stops responding to *new* work while the task processors work off the overload.

I think you meant to set the version to 13.15.0 as 13.5.0 would give you a wall of UUID's for task processor names where 13.8.0 and later gives human readable names [1].

[1] http://blogs.asterisk.org/2016/07/13/asterisk-task-processor-queue-size-warnings/
By: Charlie Smurthwaite (catphish) 2017-09-04 05:09:43.219-0500

I'd just like to +1 this bug. I am running a combination of Asterisk 13.15.rc1 and 13.17.0. On 30th September (the same day reported by Dan) all of our Asterisk installations started exhibiting this same behaviour when under load (ie it only occurs during the working day when call volume is high). The SIP unresponsiveness can last anywhere up to 5 minutes.

When this occurs, I am seeing some task backlog in core show taskprocessors, specifically in "subm:rtp_topic-0000025d".

Edit: When unloading the HEP module, the backlog shown in "core show taskprocessors" disappears, but the issue continues to occur.
By: Charlie Smurthwaite (catphish) 2017-09-04 10:05:49.259-0500

It appears that my issue was likely the result of a DNS name (used by a SIP provider I peer with) not responding in a timely manner, causing long hangs while Asterisk waited for a DNS response. I have confirmed via IRC that Daniel Journo peers with the same provider and I would suggest that this is possibly the underlying cause of both of our issues. If I'm correct, I would suggest the correct solution is some form of negative DNS caching in the resolver that Asterisk uses. I have not seen any further issues since blacklisting the domain name in my DNS, but I will keep an eye on things. Thanks to everyone on IRC for the assistance with this.
By: Daniel Journo (journo) 2017-12-19 05:57:32.866-0600

As a side note, I just realised that my astdb has over 100k entries. I think that was causing Asterisk to stop responding as it was waiting for queries to complete on that database.
I've not had any issues since clearing out the database.

Maybe add a CLI warning when the number of records reaches a threshold that could cause problems?
By: Charlie Smurthwaite (catphish) 2017-12-19 06:14:44.505-0600

I'd be interested to know why this causes a problem. 100,000 entries should not really be a problem, unless for some reason it's fetching all the entries.
By: Daniel Journo (journo) 2017-12-19 06:24:21.296-0600

PJSIP uses SQL LIKE queries which are slow when dealing with that number of records.
By: Charlie Smurthwaite (catphish) 2017-12-19 06:27:22.375-0600

That's interesting, I don't know much about SQLite, but I wonder if these queries could be improved to make them more index-friendly. Glad you were able to easily resolve it by clearing the database though.