ASTERISK-25455: Deadlock of PJSIP realtime over res_config

[Home]

Summary: ASTERISK-25455: Deadlock of PJSIP realtime over res_config_pgsql

Reporter: mdu113 (mdu113) Labels:

Date Opened: 2015-10-08 15:09:19 Date Closed: 2015-10-15 14:43:27

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_pjsip Resources/res_config_pgsql Resources/res_pjsip

Versions: 11.20.0 13.5.0 13.6.0 Frequency of
Occurrence Frequent

Related
Issues:

Environment: linux 64bit kernel 3.10.17 distro slackware64 14.1 Attachments: ( 0) backtrace-threads2.txt
( 1) core-show-locks.txt
( 2) core-show-locks2.txt
( 3) lock-bt-full.txt
( 4) res_config_pgsql.c-connlock.diff
( 5) res_config_pgsql.c-connlock2.diff

Description: Asterisk intermittently deadlocks during initial loading of pjsip endpoints. It seems to depend on number of endpoints loaded. With small number of endpoints (up to a 100) it loads ok most of the time. With several hundreds of endpoints it deadlocks sometimes. With several thousand of endpoints it deadlocks most of the time.
Attached are backtrace and output of "core show locks". The testing was done on asterisk 13.6.0-rc2

Comments: By: Asterisk Team (asteriskteam) 2015-10-08 15:09:21.187-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Richard Mudgett (rmudgett) 2015-10-08 15:49:40.042-0500

[^core-show-locks.txt] - The file is not showing a classic deadlock. It is showing a thread that is blocking for an unknown reason holding a lock that other threads want to get. If you had {{BETTER_BACKTRACES}} enabled using {{make menuselect}} the backtrace in the file would be able to tell us where the thread got the lock and we might be able to infer where the thread is blocked.

[^lock-bt-full.txt] - This file is not showing much. It would have been better if you had captured the {{thread apply all bt}} command output that can be mapped with the [^core-show-locks.txt] output. As it is I don't know what the blocked thread shown in [^core-show-locks.txt] is doing.

Please follow [1] to get the needed information for debugging deadlocks.
[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace
By: Richard Mudgett (rmudgett) 2015-10-08 15:52:17.279-0500

We suspect that you have a deadlock occurring within Asterisk. Please follow the instructions on the wiki [1] for obtaining debug relevant to a deadlock. Once you have that information, attach it to the issue. Be sure the instructions are followed exactly as the debug may otherwise not be useful.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock

By: mdu113 (mdu113) 2015-10-08 17:21:47.087-0500

Here's new files.

By: Richard Mudgett (rmudgett) 2015-10-08 17:54:46.134-0500

According to the backtrace the thread holding the reentrancy lock {{pgsql_lock}} is inside the {{libpq.so}} library and presumably waiting for a reply from the database since it is waiting in {{poll()}}. You may have a connectivity problem with your database or there is a bug in the {{libpq.so}} library itself.

When you enable debug logging do you see the following message?
{noformat}
PostgreSQL query attempt %d failed, trying again
{noformat}

It could be that you have so many database queries on startup that they just take more time to complete.

By: mdu113 (mdu113) 2015-10-09 11:08:12.318-0500

Your comment inspired me to check my postgresql libs installation and I found some problems, likely residual from previous experiments.
I've reinstalled the libs and now I can't replicate the problem. It loads ok and doesn't deadlock.
Please close the issue. Sorry for the noise.

By: Joshua C. Colp (jcolp) 2015-10-09 11:11:17.530-0500

Closed per reporter comment.
By: mdu113 (mdu113) 2015-10-13 11:14:10.338-0500

Unfortunately I was too quick to dismiss this issue. It still happens, just less often.
I took some time to analyze the problem myself and I think I found the problem.
I believe a code path exists that attempts to use pgsql connection without locking pgsql_lock. I believe what happens during that deadlock that I see is two concurrent threads are both attempting to send query to pgsql, one of the thread is using a code path without locking pgsql_lock. If they managed to send queries at the same time, it seems postgres ignores one of the queries and replies only to the one of them. If it happens so that the thread holding the lock didn't receive the reply it will wait for it (and hold the lock) forever (or at least for very long time), thus completely blocking all access to db.

I found one such code path. In res_config_pgsql.c find_table() function is issuing querying without acquiring the lock. I've attached a simple patch that fixing it.
I'm not sure if more code paths like this exist, but after applying this patch I haven't seen the problem so far.
By: Asterisk Team (asteriskteam) 2015-10-13 11:14:11.175-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Richard Mudgett (rmudgett) 2015-10-13 11:46:36.433-0500

Thanks for the contribution! If you'd like your contribution to be included faster, you should submit your patch for code review by the Asterisk Developer Community. To do so, please follow the Code Review [1] instructions on the wiki. Be sure to:
* Verify that your patch conforms to the Coding Guidelines [2]
* Review the Code Review Checklist [3] for common items reviewers will look for
* If necessary, provide tests for the Asterisk Test Suite that verify the correctness of your patch [4]
* As this is a new feature, please read the New Feature Guidelines [5]
* Make sure your new feature applies cleanly to Asterisk trunk

When ready, submit your patch and any tests to Review Board [6] for code review.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Code+Review
[2] https://wiki.asterisk.org/wiki/display/AST/Coding+Guidelines
[3] https://wiki.asterisk.org/wiki/display/AST/Code+Review+Checklist
[4] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Test+Suite+Documentation
[5] https://wiki.asterisk.org/wiki/display/AST/New+Feature+Guidelines
[6] https://wiki.asterisk.org/wiki/display/AST/Review+Board+Usage

Since this is a bug fix patch, it should be applied to the v11, v13, and master branches.
https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage#GerritUsage-BranchCherry-Picking

With your patch there is now a deadlock potential between find_table() and unload_module() involving the locking order of the psql_tables list lock and the pgsql_lock. I think the deadlock potential can be avoided simply by destroying the cached table info after the pgsql_lock is released in unload_module().
By: mdu113 (mdu113) 2015-10-13 14:15:23.449-0500

I guess you're right. To avoid deadlock it should be enough to unlock pgsql_lock before destroying cached table.
Then there's a possibility that after unlocking pgsql_lock in unload_module(), the find_table() or another function waiting for pgsql_lock will reconnect to postgres in _pgsql_exec().
I guess the logic then should be:

int unloading=0;

_pgsql_exec() {
if (unloading) return NULL;
...
}

find_tables() {
lock pgsql_tables;
if (unloading) return NULL;
...
}

unload_module() {
lock pgsql_tables;
lock pgsql_lock;
unloading=1;
destroy cached table;
destroy pgsql connection;
unlock pgsql_lock;
unlock pgsql_tables;
}

or something like that

By: Richard Mudgett (rmudgett) 2015-10-13 16:19:56.814-0500

Don't work too hard at fixing the deadlock because that module really does not unload safely anyway. Fixing the module so it unloads safely should be done by another patch. When unload_module() returns zero there cannot be any threads executing code within the module without expecting to get a segfault because the code segments are no longer loaded.
By: mdu113 (mdu113) 2015-10-14 10:57:28.525-0500

I'm sorry, I've tried to follow all those guidelines, but that seems like complicated process and very overkill for such a simple patch. I'm attaching the 2nd version that also unlocks pgsql_lock before destroying cached table. Please review/apply at your convenience.
Thanks.
By: Richard Mudgett (rmudgett) 2015-10-14 14:21:22.898-0500

Patch up on gerrit:
https://gerrit.asterisk.org/#/c/1439/ v11
https://gerrit.asterisk.org/#/c/1440/ v13
https://gerrit.asterisk.org/#/c/1441/ master