[Home]

Summary:ASTERISK-26145: pjsip: Deadlock with suspend + masquerade + indicate
Reporter:Ross Beer (rossbeer)Labels:
Date Opened:2016-06-24 05:47:47Date Closed:2016-08-24 11:02:24
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_pjsip Resources/res_pjsip
Versions:13.10.0-rc1 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-26388 PJSIP stops working, the only clue being the log message "taskprocessor.c: The 'pjsip/distributor-0000XXXX' task processor queue reached 500 scheduled tasks."
is related toASTERISK-26284 pjsip: Deadlock with suspend + masquerade (Alternate scenarios)
Environment:Fedora 23Attachments:( 0) 2016-06-28_backtrace-threads.txt
( 1) 2016-06-28_taksprocessor.txt
( 2) 2016-06-28_threads.txt
( 3) backtrace-threads_20160627_CLEAN.txt
( 4) backtrace-threads_2016-06-30_CLEAN.txt
( 5) backtrace-threads_2016-07-06_CLEAN.txt
( 6) backtrace-threads_2016-07-07_12_00_CLEAN.txt
( 7) backtrace-threads_20160711_CLEAN.txt
( 8) backtrace-threads_2016-07-12_CLEAN.txt
( 9) backtrace-threads_Deadlock_CLEAN.txt
(10) core-show-taskprocessors_2016-06-30_CLEAN.txt
(11) core-show-taskprocessors_2016-07-06_CLEAN.txt
(12) core-show-taskprocessors_2016-07-07_12_00_CLEAN.txt
(13) core-show-taskprocessors_20160711_CLEAN.txt
(14) core-show-threads_2016-06-30.txt
(15) core-show-threads_2016-07-06_CLEAN.txt
(16) core-show-threads_2016-07-07_12_00.txt
(17) core-show-threads_20160711.txt
(18) TaskProcessor_20160803.txt
(19) taskprocessor_CLEAN.txt
(20) taskprocessors_CLEAN.txt
(21) threads_CLEAN.txt
Description:Asterisk stops processing SIP packets due to the task processor exceeding 500 tasks.

From the below, it looks like the RTP streams are not closing tasks related to RTP timeouts.

[2016-06-24 11:35:39] WARNING[313]: taskprocessor.c:887 taskprocessor_push: The 'subm:rtp_topic-000000de' task processor queue reached 500 scheduled tasks.
[2016-06-24 11:36:08] NOTICE[313]: res_pjsip_sdp_rtp.c:165 rtp_check_timeout: Disconnecting channel 'PJSIP/<<Username>>-000040dd' for lack of RTP activity in 65 seconds

No core dump is created as asterisk doesn't crash.
Comments:By: Asterisk Team (asteriskteam) 2016-06-24 05:47:48.643-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2016-06-24 10:37:05.384-0500

When this happens, grab the following:

* A debug log.
* A backtrace.
* 'core show taskprocessors'
* 'core show threads'


By: Ross Beer (rossbeer) 2016-06-27 04:47:58.901-0500

I have attached as requested.

I have a feeling this related to STUN looking at the threads.

STUN is used to identify the public IP address and ports, therefore should this lookup only happen at startup instead of every dialog setup?

By: Joshua C. Colp (jcolp) 2016-06-27 06:30:15.736-0500

It is used to identify the public IP address and port for the specific RTP session, it can't be looked up once as each session can have a different external mapped port.

By: Ross Beer (rossbeer) 2016-06-27 06:36:04.023-0500

So it's looking up the ports and IP addresses which Asterisk has assigned?

I've disabled ICE and the STUN server in the rtp.conf file, is this enough to stop these lookups?

PJSIP is set to ice_support=no

By: Joshua C. Colp (jcolp) 2016-06-27 06:46:15.664-0500

No, it's looking up what a remote STUN server sees as the source IP address and port which will be different if going through a NAT. Disabling ICE at both of those levels will stop the usage you see. You could also remove the STUN server configured and not use STUN at all for RTP.

By: Ross Beer (rossbeer) 2016-06-27 07:22:01.919-0500

Ok, Asterisk is on public IP addresses and not behind NAT. Therefore I have disabled this, on doing so I have reloaded 'res_rtp_asterisk.so' however, I am still seeing the following in the CLI:

STUN Debugging Enabled
STUN Packet, msg Binding Request (0001), length: 8
Found STUN Attribute Change Request (0003), length 4
Ignoring STUN attribute Change Request (0003), length 4
STUN Bind Request, username: <none>
STUN Packet, msg Binding Response (0101), length: 36
Found STUN Attribute Mapped Address (0001), length 8
Ignoring STUN attribute Mapped Address (0001), length 8
Found STUN Attribute Source Address (0004), length 8
Ignoring STUN attribute Source Address (0004), length 8
Found STUN Attribute Changed Address (0005), length 8
Ignoring STUN attribute Changed Address (0005), length 8
Dunno what to do with STUN message 0101 (Binding Response)

Does this module require a full restart to stop the STUN lookup?

Hopefully this will resolve the issue with the taskprocessor.


By: Joshua C. Colp (jcolp) 2016-06-27 12:46:04.720-0500

It shouldn't require a full restart, but I haven't tested changing stuff like that.

By: Joshua C. Colp (jcolp) 2016-06-27 12:47:15.528-0500

I'll leave this assigned to you for feedback to see how not using STUN works out.

By: Ross Beer (rossbeer) 2016-06-28 08:44:35.093-0500

Ok, the STUN theory hasn't resolved the issues.

In the backtrace I see that there is a HEP RTP task, could these be getting stuck?

Also HEP isn't in use so not sure why these are being set.

By: Joshua C. Colp (jcolp) 2016-06-28 08:55:21.592-0500

I don't see it getting stuck, but it can take awhile to process depending on the system when it tries to retrieve additional information. You could disable loading of the HEP modules.

By: Ross Beer (rossbeer) 2016-07-06 08:25:26.139-0500

Attaching additional information after a deadlock

By: Ross Beer (rossbeer) 2016-07-07 04:06:30.905-0500

I'm wondering if this issue is related to ASTERISK-26177 we use ODBC to look up information in the dial plan.

By: Joshua C. Colp (jcolp) 2016-07-07 05:10:40.017-0500

Your backtraces don't show anything ODBC related, unless you've removed those parts from it.

By: Ross Beer (rossbeer) 2016-07-07 06:13:33.425-0500

Further deadlock information

By: Ross Beer (rossbeer) 2016-07-11 07:03:24.210-0500

This issue is occurring more and more.

When using commits up to ff018e28a0fd6e0671ef6c5077ef2e99e93951ac this issue didn't happen.

By: Ross Beer (rossbeer) 2016-07-12 03:45:24.862-0500

When testing asterisk commit 5ee205d8bb0dfbae5be89661aa3a787fbdf9986b and PJSIP commit 5378 asterisk locks almost imediatly causing all SIP packets to stop being processed.

Please see attached backtrace (backtrace-threads_2016-07-12_CLEAN.txt)

By: Ross Beer (rossbeer) 2016-07-14 04:07:17.859-0500

When running asterisk with debug locks asterisk is dramatically delayed in processing and therefore can not be used to identify the issue

By: Ross Beer (rossbeer) 2016-07-29 08:26:39.864-0500

Could this issue be similar to ASTERISK-26229

Currently, ODBC voicemail is in use and has a lot of peers.

Does the code up for review also work for ODBC storage? (https://gerrit.asterisk.org/#/c/3318/)

Also looking on gerrit, this may also offer a fix? https://gerrit.asterisk.org/#/c/3320/

By: Ross Beer (rossbeer) 2016-08-01 05:30:43.982-0500

Additional information for the task processor issue

By: Ross Beer (rossbeer) 2016-08-03 11:34:47.567-0500

Further issues with taskprocessors even with patches for MWI which have helped but not resolved the issue

By: Ross Beer (rossbeer) 2016-08-04 04:23:32.278-0500

The patches being committed soon on gerrit:

https://gerrit.asterisk.org/#/c/3318/
https://gerrit.asterisk.org/#/c/3320/

Have helped to mitigate this issue with MWI, however, I am still getting the following task processor fill up less frequently:

pjsip/distributor-000000be                        220910        507         20        450        500

Is there any way to identify the tasks that are backing up in the taskprocessor so that we can work out the cause?

By: Joshua C. Colp (jcolp) 2016-08-04 04:57:28.960-0500

Your backtraces already show why it is piling up, and the title of this issue includes that.

A masquerade operation is happening which is suspending the distributor from continuing, the masquerade operation deadlocks when doing an indicate, the distributor is never unsuspended. Why that's happening requires deep investigation into things with masquerades being one of the most complex operations in Asterisk.

By: Ross Beer (rossbeer) 2016-08-07 13:41:14.484-0500

There is a patch up for review:

https://gerrit.asterisk.org/#/c/3424/

By: Ross Beer (rossbeer) 2016-08-09 15:27:06.957-0500

The patch has been running for a few days and appears to fix the issue. Can it be reviewed and committed?

By: Joshua C. Colp (jcolp) 2016-08-09 15:36:02.464-0500

As the change is in Gerrit it will be reviewed, I have no time frame on when that will occur and when it will go in.

By: Richard Mudgett (rmudgett) 2016-08-24 11:02:25.132-0500

The patch was merged.

By: Marcelo Bello (mbello) 2016-09-27 13:23:27.888-0500

Could this be released in 13.11.3 (as in VERY soon)? I just saw that 13.11.2 was released with a PJSIP fix for a minor issue, while this one was left behind. Right now PJSIP is quite unstable without this fix, we are having to move all our extensions to CHAN_SIP because PJSIP is crashing way too often.





By: Carl Fortin (phonefxg) 2016-10-19 09:18:42.392-0500

This bug is a vicious one. I have upgraded 13.11.2 to 14.0.2 to fix the problem and as I can see the target release is for the 13.12 and 14.1.
Meanwhile, what version can I download to fix the issue?


By: Joshua C. Colp (jcolp) 2016-10-19 09:20:55.005-0500

There is no currently released version with the fix, there are just the release candidates for 13.12.0 and 14.1.0 which should hopefully reach release status soon.

By: Carl Fortin (phonefxg) 2016-10-19 09:23:57.319-0500

Can I install the release candidate then? Is it too risky?


By: Joshua C. Colp (jcolp) 2016-10-19 09:26:32.401-0500

That's up to you. I haven't seen any reported problems yet and continuous integration hasn't uncovered anything.

By: Carl Fortin (phonefxg) 2016-10-19 09:30:28.469-0500

Thanks Joshua! I'll give a try because it cannot be any worse right now.
It's frustrating because even after testing each release in a test environment the bug appears in production.
Test machine cannot tests everything I guess.




By: Carl Fortin (phonefxg) 2016-11-15 10:16:39.494-0600

I just got another taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks. after running Asterisk 14.1.0-rc1 for 3 weeks.
PJSIP stopped working with nothing in the console.I had time to save the task_processor output before restarting asterisk.

By: Joshua C. Colp (jcolp) 2016-11-15 10:20:12.418-0600

[~phonefxg] Please open a new issue.

By: Friendly Automation (friendly-automation) 2016-12-02 07:06:23.355-0600

Change 4550 merged by Joshua Colp:
pjsip: Fix deadlock with suspend taskprocessor on masquerade

[https://gerrit.asterisk.org/4550|https://gerrit.asterisk.org/4550]