[Home]

Summary:ASTERISK-28213: res_pjsip: Threads pile up needlessly when AOR is blocked
Reporter:Ross Beer (rossbeer)Labels:patch pjsip
Date Opened:2018-12-17 08:37:32.000-0600Date Closed:2019-01-24 05:51:18.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Channels/chan_pjsip
Versions:13.24.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:CentOS 7.6Attachments:( 0) ASTERISK-28213_remove_all_monitors.diff
( 1) ASTERISK-28213.diff
( 2) ASTERISK-28213-followup.diff
Description:A deadlock occurs on AOR removal when a transport has gone away, and a NOTIFY is being sent the endpoint. The AOR is blocked in mariadb and the second thread blocks on the first.

The full backtrace has been sent to Geroge Joseph as it contains private information.
Comments:By: Asterisk Team (asteriskteam) 2018-12-17 08:37:34.073-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2018-12-17 10:11:00.725-0600

I've marked this down to Minor as the deadlock itself is from the database, and the threads that pile up are a result of that. There is a possible improvement in that we could make it a bit smarter on that, so the threads don't pile up but if the database itself blocked this would still cause potential ripples.

By: Kevin Harwell (kharwell) 2019-01-11 14:07:52.564-0600

[~rossbeer], is the endpoint registering multiple contacts?

By: Ross Beer (rossbeer) 2019-01-11 15:07:22.874-0600

Asterisk is configured to remove existing contacts so there is only one contact for each endpoint.

Quite often phones removes the registration before re-registering, but not all endpoints do this.

By: Kevin Harwell (kharwell) 2019-01-11 15:47:57.135-0600

What's your _max_contacts_ set to (or can you just post the config for an endpoint/aor)?

Also would it be possible to get a log file with sip tracing enabled during an occurrence of the problem? Or at the very least a sip trace of a few registrations of a problematic endpoint?

If you are able to collect this info, and don't want to post it here feel free to email it directly to me (unless you have already emailed that stuff to George then I can get it from him later - Note, I did see the backtraces).

Thanks!

By: Kevin Harwell (kharwell) 2019-01-11 17:23:08.972-0600

In the mean time if you want give [^ASTERISK-28213.diff] a try and see if it helps any.

By: Ross Beer (rossbeer) 2019-01-14 04:17:15.413-0600

max_contacts is set to 1

I have applied the patch to a gateway which often has the issue and will let you know if the outcome

By: Kevin Harwell (kharwell) 2019-01-17 11:08:23.250-0600

Had a chance to check the patch out yet? Did it help to free up some of the threads being blocked in this particular scenario?

By: Ross Beer (rossbeer) 2019-01-21 10:06:46.480-0600

I believe the patch resolves the issue. Looking at change https://gerrit.asterisk.org/#/c/asterisk/+/10886/  I think this would have also helped to lessen the issue. Though I have not tested the latter.

By: Friendly Automation (friendly-automation) 2019-01-24 05:51:20.149-0600

Change 10908 merged by Joshua C. Colp:
res_pjsip_registrar: mitigate blocked threads on reliable transport shutdown

[https://gerrit.asterisk.org/10908|https://gerrit.asterisk.org/10908]

By: Friendly Automation (friendly-automation) 2019-01-24 05:51:46.372-0600

Change 10906 merged by Joshua C. Colp:
res_pjsip_registrar: mitigate blocked threads on reliable transport shutdown

[https://gerrit.asterisk.org/10906|https://gerrit.asterisk.org/10906]

By: Friendly Automation (friendly-automation) 2019-01-24 05:54:21.430-0600

Change 10909 merged by Joshua C. Colp:
res_pjsip_registrar: mitigate blocked threads on reliable transport shutdown

[https://gerrit.asterisk.org/10909|https://gerrit.asterisk.org/10909]

By: Kevin Harwell (kharwell) 2019-01-31 13:52:45.921-0600

[~rossbeer] I believe some locking needed to be added to the original patch to make it more consistent. Give the [^ASTERISK-28213-followup.diff] patch a try and see if that helps.

Note, the follow up patch depends on the first patch, so the first patch needs to be applied first and then the followup one.

By: Ross Beer (rossbeer) 2019-02-07 04:57:51.539-0600

The first patch has made its way into the released code base however the follow up patch has not.

Can this be added?

By: Asterisk Team (asteriskteam) 2019-02-07 04:57:52.098-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Kevin Harwell (kharwell) 2019-02-07 09:31:06.941-0600

[~rossbeer] Yep no problem. Pushed up to gerrit for review.

https://gerrit.asterisk.org/#/c/asterisk/+/10968/

By: Friendly Automation (friendly-automation) 2019-02-08 09:36:08.386-0600

Change 10968 merged by Joshua C. Colp:
res_pjsip_registrar: lock transport monitor when setting 'removing' flag

[https://gerrit.asterisk.org/10968|https://gerrit.asterisk.org/10968]

By: Friendly Automation (friendly-automation) 2019-02-08 09:38:40.221-0600

Change 10970 merged by Friendly Automation:
res_pjsip_registrar: lock transport monitor when setting 'removing' flag

[https://gerrit.asterisk.org/10970|https://gerrit.asterisk.org/10970]

By: Friendly Automation (friendly-automation) 2019-02-08 09:50:04.133-0600

Change 10969 merged by Joshua C. Colp:
res_pjsip_registrar: lock transport monitor when setting 'removing' flag

[https://gerrit.asterisk.org/10969|https://gerrit.asterisk.org/10969]

By: Friendly Automation (friendly-automation) 2019-02-11 05:09:08.714-0600

Change 10975 merged by Joshua C. Colp:
res_pjsip_registrar: lock transport monitor when setting 'removing' flag

[https://gerrit.asterisk.org/10975|https://gerrit.asterisk.org/10975]

By: Friendly Automation (friendly-automation) 2019-02-11 05:09:22.139-0600

Change 10976 merged by Joshua C. Colp:
res_pjsip_registrar: lock transport monitor when setting 'removing' flag

[https://gerrit.asterisk.org/10976|https://gerrit.asterisk.org/10976]

By: Kevin Harwell (kharwell) 2019-02-13 15:43:26.806-0600

The previous two patches, [^ASTERISK-28213.diff] and [^ASTERISK-28213-followup.diff] have been merged and _should_ go out into the next release (13.25.0 and 16.2.0).

This patch, [^ASTERISK-28213_remove_all_monitors.diff], is yet another followup patch that intends to mitigate the problem further in cases where there can be multiple transport monitors for a single aor/contact combination.

[~rossbeer] please test [^ASTERISK-28213_remove_all_monitors.diff] when you get a chance and let me know how things go.

Thanks!

By: Friendly Automation (friendly-automation) 2019-03-05 07:08:36.292-0600

Change 11014 merged by Joshua Colp:
res_pjsip_registrar: blocked threads on reliable transport shutdown take 3

[https://gerrit.asterisk.org/c/asterisk/+/11014|https://gerrit.asterisk.org/c/asterisk/+/11014]

By: Friendly Automation (friendly-automation) 2019-03-05 07:17:19.383-0600

Change 11015 merged by Joshua Colp:
res_pjsip_registrar: blocked threads on reliable transport shutdown take 3

[https://gerrit.asterisk.org/c/asterisk/+/11015|https://gerrit.asterisk.org/c/asterisk/+/11015]

By: Friendly Automation (friendly-automation) 2019-03-05 07:17:44.248-0600

Change 11016 merged by Joshua Colp:
res_pjsip_registrar: blocked threads on reliable transport shutdown take 3

[https://gerrit.asterisk.org/c/asterisk/+/11016|https://gerrit.asterisk.org/c/asterisk/+/11016]