[Home]

Summary:ASTERISK-27285: AstDB Locks taking a while to unlock
Reporter:Zach R (zrothy)Labels:pjsip
Date Opened:2017-09-21 16:09:08Date Closed:2020-01-14 11:14:00.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Core/AstDB
Versions:13.10.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Centos 6.9 Final 64 bit, using chan_pjsip with realtime for AORs, Endpoints, Auths, Global. Contacts are in the ASTDBAttachments:( 0) astdb_stuff.tar.gz
( 1) locks_no_better_stack
( 2) locks_with_better_stack_trace
Description:I've noticed when I use the AstDB for chan_pjsip's contacts for a large amount of registrations per second the sorcery/contacts task processor will eventually overflow and lock up. I first traced what that task processor does in the code and saw it handles callbacks for deleting and creating the contacts.

I then tested to see if it was teh AstDB write time that was causing it to back up, as I read that asterisk and/or sqlite3 lock the file/table to avoid corruption or any race conditions for the sqlite3 database. While using realtime on res_odbc I did not have any issues even when sending 40 registrations/calls per second using sipp. This doesn't count the response to the 401 with the nonce.

After that I added some code using the clock function from C to time how long it was taking in main/db.c for db_puts. I put one set of timestamps/clock objects being set surrounding (on the outside) of the lock done and one on the inside.

I noticed that it seemed it waited quite a while for the lock to be released sometimes, while the inside of the lock took at most 0.01 seconds.

I've attached the scenario file I used for sipp, the logs I printed out in console with the timing, as well as the code I changed to do the time logs.

Some of these took as long as 0.5 seconds around the lock but 0.0 inside it.

Ex:
{noformat}
[Sep 21 17:06:39] NOTICE[11520]: db.c:345 ast_db_put: ZACH - (Outside Lock) Total ASTDB Time: 0.690000 (Inside Lock) Total ASTDB Time: 0.000000
{noformat}
An ex of the cmd I used for sipp and the scenario to send the registrations to the server:
{noformat}
sipp $ASTERISK_SERVER_IP -sf REGISTER_client.xml -inf REGISTER_client.csv -trace_msg -trace_err -r 40
{noformat}
You can modify -r to be a new rate of calls/sec if and add -m to exit after N successful calls.
Comments:By: Asterisk Team (asteriskteam) 2017-09-21 16:09:09.929-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Zach R (zrothy) 2017-09-21 16:11:14.935-0500

A tarball file containing the logs and scenario files I was using when I found this issue.

By: Kevin Harwell (kharwell) 2017-10-24 17:35:32.233-0500

I ran your scenario on a Ubuntu 14.04 machine and did not see the same results (meaning I did not see the slow slow down).

Would it be possible for you to try another OS to see if there was a difference? Are you running in a VM or a container? If so what happens if not in one?

Another thing you could try is recompiling Asterisk with DEBUG_THREADS compiler flag turned on (menuselect->Compiler Flags->DEBUG_THREADS). Enabling this flag will impact performance, but it can output additional logging from the mutex unlock code. Maybe something will turn up?

By: Asterisk Team (asteriskteam) 2017-11-08 12:00:01.200-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Zach R (zrothy) 2017-12-12 16:44:16.694-0600

I'm attaching the output of core show locks when better stack trace and debug threads are enabled. Though I found when these are enabled I am not seeing the same times from the timing code I previously used to output the lock times. Without the two flags I did see some earlier along with the task processor for contacts slowly reaching its high water limits.

For these tests teh astdb is written to a ramdisk rather than the SSDs this server is running on. The server itself is not a VM, but a physical dedicated machine, and I don't think running a different OS to test is an option at the moment.

The only change in configuration I have at the moment is adding caching for aors, endpoints and auths.

By: Asterisk Team (asteriskteam) 2017-12-12 16:44:17.054-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Zach R (zrothy) 2017-12-12 16:46:17.313-0600

I've attached both outputs of core show locks while I ran the sipp profile/test at a rate of 60 per second instead of 40 due to configuration changes mentionedi n my last comment.

The file no_better_stack has BETTER_STACKTRACE flag disabled, while the other file has it enabled.

By: Joshua C. Colp (jcolp) 2017-12-22 07:51:56.839-0600

[~seanbright] has done work to improve the performance of the astdb in relation to PJSIP contacts and I've got a patch up at ASTERISK-26806 which rewrites OPTIONS support and improves performance a lot.

If you use the latest version of Asterisk with Sean's changes and my patch does the performance improve?

By: Zach R (zrothy) 2017-12-22 11:01:56.550-0600

Is the patch applicable to the most recent head of the 13 branch? When I went to apply it failed on error: patch failed: res/res_pjsip/pjsip_options.c:1003.

The current commit I am at is ce3d56920b15facbb64b3caf0d823a3f57c0dded. In the meanwhile I'll see if I can just manually apply it rather than rely on git apply.

By: Joshua C. Colp (jcolp) 2017-12-22 11:13:30.225-0600

I've now got a review up at https://gerrit.asterisk.org/#/c/7710/ for eliciting feedback which applies to the head of 13.

By: Asterisk Team (asteriskteam) 2018-01-05 12:00:02.171-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines