[Home]

Summary:ASTERISK-25411: PJSIP functionality becomes unresponsive after some time
Reporter:Rafik Djabrailov (the_tuss)Labels:
Date Opened:2015-09-22 03:57:32Date Closed:2020-01-14 11:14:06.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.5.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:CentOS 6.7 minimal x86 Linux 2.6.32-504.el6.i686 #1 SMP Wed Oct 15 03:02:07 UTC 2014 i686 i686 i386 GNU/LinuxAttachments:( 0) 23-10-2015_debug.txt
( 1) acl.conf
( 2) backtrace-threads.txt
( 3) bad_pjsip_call.txt
( 4) bad_pjsip_options_request.txt
( 5) core-show-locks.txt
( 6) core-show-taskprocessors.txt
( 7) core-show-threads.txt
( 8) dialplan_excerpt.ael
( 9) good_pjsip_call.txt
(10) good_pjsip_options_request.txt
(11) modules.conf
(12) new_info.txt
(13) pjsip.conf
Description:After some time since Asterisk start (may be several hours or days), calls to PJSIP trunks are hanging with no actions or errors. 'pjsip set logger on' command executes, but SIP messages are not appearing in the console completely. 'pjsip show registrations' shows trunks as 'Registered' and 'pjsip show contacts' can display 'Avail' or sometimes 'Unavail' on contacts. It seems that res_pjsip completely hangs in it's last successfull state. 'module reload' executes, but does not help. Only full Asterisk restart brings PJSIP functionality to normal state.
Maybe this problem can be linked to Internet downtimes, but when Internet brings up, all is functioning on OS level automatically (DHCP, DNS resolutions, default gateway routing, etc).
I think maybe this problem is somehow related to sip_resolver.c and resolver.c not being executed by PJSIP for some reason.

As I've noted before, there is no single SIP message in the console, only DEBUG messages (files bad_pjsip_call.txt and good_pjsip_call.txt).
Also there are some additional debug logs with OPTIONS requests (qualify) to trunk (bad_pjsip_options_request and good_pjsip_options_request).
Some IPs and numbers are "XXXed" for security reasons.
PJSIP configure string "./configure --prefix=/usr/local --disable-sound --disable-resample --disable-video --disable-opencore-amr --enable-shared CFLAGS='-O2 -DNDEBUG'"
Comments:By: Asterisk Team (asteriskteam) 2015-09-22 03:57:36.139-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rafik Djabrailov (the_tuss) 2015-09-22 04:04:55.210-0500

Successful and failed calls

By: Rafik Djabrailov (the_tuss) 2015-09-22 04:13:43.541-0500

Asterisk configuration files.

By: Rafik Djabrailov (the_tuss) 2015-09-22 04:26:16.059-0500

Successfull and failed PJSIP OPTIONS request to trunk.

By: Rusty Newton (rnewton) 2015-09-25 12:46:35.735-0500

The next time the issue happens, please attach all the same debug logs for the next instance along with the output of "core show channels" "pjsip show channels" and "core show channel X" and "pjsip show channel X" where X is each of the hung channels.

By: Rafik Djabrailov (the_tuss) 2015-10-06 00:47:48.494-0500

Since dialplan was slightly corrected, issue never reproduces.
What i've found, is that whenever (inbound) call, involving PJSIP channel is placed into Queue (with IAX members) without preceeding Answer, Ringing (180) or Progress (183), it somehow "glitches" and hangs the PBX.
For ex. Queue behaves as if its "timeout" option was set to 1 sec (flooding CLI), so the call never reaches queue members. And caller on the other side never hears ringback or MOH. Even after original caller hangs the call, Queue continues tries to reach its members. So after some time channels (and especially PJSIP) become unresponsive with the original issue of this ticket.
Adding preceeding Progress to the dialplan effectively resolves this behavior.

By: Rusty Newton (rnewton) 2015-10-07 18:01:47.249-0500

Can you reverse the configuration (remove the Progress) and reproduce the issue in order to provide the debug needed?

I still don't think we have enough information here to reproduce and investigate the issue.

By: Rafik Djabrailov (the_tuss) 2015-10-08 02:46:22.646-0500

I've removed Progress or any Answer/Playback/Read before Queue and made incoming call. All information along with dialplan excerpts is within 'new_info.txt'.
However, I can't reproduce the initial issue right now (it may be needed more time for Asterisk being in this state), but I can reproduce situation, when caller don't hear ringback/MOH, and Queue app calling members indefinetely after caller had hung up.
I think this leads to the original issue of my ticket.

By: Rafik Djabrailov (the_tuss) 2015-10-23 02:07:32.459-0500

Actually, I've reproduced this issue with "Progress" in dialplan and after relatively long uptime. Again, blank PJSIP log in CLI and no active channels on PBX (see 23-10-2015_debug.txt). PJSIP is loaded, but unresponsive - not accepting registrations, no qualifications on contacts, not accepting calls.

Server CPU is almost idle, approx 900 MB free RAM, no extensive tasks running.

Only Asterisk service reload helps to retain normal functionality for the next couple of days or weeks.

By: JoshE (n8ideas) 2015-10-29 15:08:07.507-0500

We are also seeing the identical issue.  I have not yet messed with the dialplan and Progress() application, but we're having the exact behavior.  We have about 100 endpoints on pjsip that are all realtime to a remote database.

None of the commands respond, the pjsip stack is mostly unresponsive... endpoints, logger, etc.. return nothing.  The pjsip show registrations command does actually return properly.

Right now, it takes a full reboot to restore normal functionality.

By: Rusty Newton (rnewton) 2015-11-04 17:18:35.742-0600

[~n8ideas] do you want to attach a tarball with your own environment details, deadlock backtraces and debug logs? That would be helpful for comparison.

By: Rusty Newton (rnewton) 2015-11-05 09:59:22.701-0600

We suspect that you have a deadlock occurring within Asterisk. Please follow the instructions on the wiki [1] for obtaining debug relevant to a deadlock. Once you have that information, attach it to the issue. Be sure the instructions are followed exactly as the debug may otherwise not be useful.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock



By: Rafik Djabrailov (the_tuss) 2015-11-17 01:19:37.127-0600

Updated Asterisk to 13.6.0, compiled with DEBUG_THREADS, DONT_OPTIMIZE and BETTER_BACKTRACES.
Bug reproduced (all the same), I've run gdb, then immideately 'core show locks'.
Attaching backtrace-threads.txt, core-show-locks.txt, core-show-taskprocessors.txt, core-show-threads.txt to the ticket.
Thank you for your support in advance!

By: Rusty Newton (rnewton) 2015-11-19 18:04:22.350-0600

I see that 'core show locks' output was empty? Was that a mistake when creating the .txt file or was that the case as far as you know?

By: Richard Mudgett (rmudgett) 2015-11-19 18:18:30.850-0600

This might be fixed by ASTERISK-25546

By: Rafik Djabrailov (the_tuss) 2015-11-20 00:35:51.814-0600

Yes, the output of the 'core show locks' was definitely empty. I've run this command two times within a minute interval.

By: Rusty Newton (rnewton) 2015-11-21 10:37:26.066-0600

Rafik, please perform the same tests and gather the same debug as you previously did but with the latest pull from Asterisk 13 GIT branch via Gerrit.

https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

That way you can test to see if the fix [~rmudgett] mentioned solves the problem.

By: Asterisk Team (asteriskteam) 2015-12-05 12:00:20.000-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines