Summary: | ASTERISK-27285: AstDB Locks taking a while to unlock | ||
Reporter: | Zach R (zrothy) | Labels: | pjsip |
Date Opened: | 2017-09-21 16:09:08 | Date Closed: | 2020-01-14 11:14:00.000-0600 |
Priority: | Minor | Regression? | |
Status: | Closed/Complete | Components: | Core/AstDB |
Versions: | 13.10.0 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | Centos 6.9 Final 64 bit, using chan_pjsip with realtime for AORs, Endpoints, Auths, Global. Contacts are in the ASTDB | Attachments: | ( 0) astdb_stuff.tar.gz ( 1) locks_no_better_stack ( 2) locks_with_better_stack_trace |
Description: | I've noticed when I use the AstDB for chan_pjsip's contacts for a large amount of registrations per second the sorcery/contacts task processor will eventually overflow and lock up. I first traced what that task processor does in the code and saw it handles callbacks for deleting and creating the contacts.
I then tested to see if it was teh AstDB write time that was causing it to back up, as I read that asterisk and/or sqlite3 lock the file/table to avoid corruption or any race conditions for the sqlite3 database. While using realtime on res_odbc I did not have any issues even when sending 40 registrations/calls per second using sipp. This doesn't count the response to the 401 with the nonce. After that I added some code using the clock function from C to time how long it was taking in main/db.c for db_puts. I put one set of timestamps/clock objects being set surrounding (on the outside) of the lock done and one on the inside. I noticed that it seemed it waited quite a while for the lock to be released sometimes, while the inside of the lock took at most 0.01 seconds. I've attached the scenario file I used for sipp, the logs I printed out in console with the timing, as well as the code I changed to do the time logs. Some of these took as long as 0.5 seconds around the lock but 0.0 inside it. Ex: {noformat} [Sep 21 17:06:39] NOTICE[11520]: db.c:345 ast_db_put: ZACH - (Outside Lock) Total ASTDB Time: 0.690000 (Inside Lock) Total ASTDB Time: 0.000000 {noformat} An ex of the cmd I used for sipp and the scenario to send the registrations to the server: {noformat} sipp $ASTERISK_SERVER_IP -sf REGISTER_client.xml -inf REGISTER_client.csv -trace_msg -trace_err -r 40 {noformat} You can modify -r to be a new rate of calls/sec if and add -m to exit after N successful calls. | ||
Comments: | By: Asterisk Team (asteriskteam) 2017-09-21 16:09:09.929-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Zach R (zrothy) 2017-09-21 16:11:14.935-0500 A tarball file containing the logs and scenario files I was using when I found this issue. By: Kevin Harwell (kharwell) 2017-10-24 17:35:32.233-0500 I ran your scenario on a Ubuntu 14.04 machine and did not see the same results (meaning I did not see the slow slow down). Would it be possible for you to try another OS to see if there was a difference? Are you running in a VM or a container? If so what happens if not in one? Another thing you could try is recompiling Asterisk with DEBUG_THREADS compiler flag turned on (menuselect->Compiler Flags->DEBUG_THREADS). Enabling this flag will impact performance, but it can output additional logging from the mutex unlock code. Maybe something will turn up? By: Asterisk Team (asteriskteam) 2017-11-08 12:00:01.200-0600 Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1]. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines By: Zach R (zrothy) 2017-12-12 16:44:16.694-0600 I'm attaching the output of core show locks when better stack trace and debug threads are enabled. Though I found when these are enabled I am not seeing the same times from the timing code I previously used to output the lock times. Without the two flags I did see some earlier along with the task processor for contacts slowly reaching its high water limits. For these tests teh astdb is written to a ramdisk rather than the SSDs this server is running on. The server itself is not a VM, but a physical dedicated machine, and I don't think running a different OS to test is an option at the moment. The only change in configuration I have at the moment is adding caching for aors, endpoints and auths. By: Asterisk Team (asteriskteam) 2017-12-12 16:44:17.054-0600 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Zach R (zrothy) 2017-12-12 16:46:17.313-0600 I've attached both outputs of core show locks while I ran the sipp profile/test at a rate of 60 per second instead of 40 due to configuration changes mentionedi n my last comment. The file no_better_stack has BETTER_STACKTRACE flag disabled, while the other file has it enabled. By: Joshua C. Colp (jcolp) 2017-12-22 07:51:56.839-0600 [~seanbright] has done work to improve the performance of the astdb in relation to PJSIP contacts and I've got a patch up at ASTERISK-26806 which rewrites OPTIONS support and improves performance a lot. If you use the latest version of Asterisk with Sean's changes and my patch does the performance improve? By: Zach R (zrothy) 2017-12-22 11:01:56.550-0600 Is the patch applicable to the most recent head of the 13 branch? When I went to apply it failed on error: patch failed: res/res_pjsip/pjsip_options.c:1003. The current commit I am at is ce3d56920b15facbb64b3caf0d823a3f57c0dded. In the meanwhile I'll see if I can just manually apply it rather than rely on git apply. By: Joshua C. Colp (jcolp) 2017-12-22 11:13:30.225-0600 I've now got a review up at https://gerrit.asterisk.org/#/c/7710/ for eliciting feedback which applies to the head of 13. By: Asterisk Team (asteriskteam) 2018-01-05 12:00:02.171-0600 Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1]. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines |