[Home]

Summary:ASTERISK-27983: pjsip_options: rework may have left concurrency issue
Reporter:Gregory Massel (gmza)Labels:pjsip
Date Opened:2018-07-23 19:04:22Date Closed:2018-07-26 03:56:07
Priority:MinorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.22.0 Frequency of
Occurrence
Related
Issues:
Environment:Ubuntu, Asterisk 13.22.0, pjsip, thousands of endpointsAttachments:
Description:Following the excellent work done in terms of ASTERISK-26806, I have been doing some performance testing on Asterisk 13.22.0 with res_pjsip.

If I get sipp to run more than 200 registrations per second, it starts logging:

res_pjsip/pjsip_distributor.c: Request 'REGISTER' from '"[redacted]" <sip:[redacted]>' failed for '[redacted]:5060' (callid: [redacted]) - No matching endpoint found after X tries in 0.000 ms

Where X is between 5 and 9 and >99% of the log entries show 0.000 ms.

At first I thought the time was logging incorrectly, but there are a few up to 2.731ms.

It is odd, however, that the endpoint cannot be found after numerous tries in such a short period of time (almost always less than 0.001ms). All the endpoints are valid and, if I reduce the registration rate to 100 reg per second, no such errors log.

Throughout this, CPU usage remains extremely modest across all cores.

This leads me to believe that there may be some sort of locking or contention across the endpoints/aors/contacts tables that is causing the registration performance to be restricted in spite of the hardware.

It seems that, despite significant gains in performance since ASTERISK-26806 was resolved, pjsip is still performing slower than chan_sip in terms of registrations (despite chan_sip using only a single core and res_pjsip threading across 8 cores).
Comments:By: Asterisk Team (asteriskteam) 2018-07-23 19:04:23.459-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2018-07-23 19:23:47.196-0500

You haven't given enough context for this, including how things are configured. Once the config is loaded there isn't really any contention on the endpoint identification process between it and the OPTIONS support. If using a config file then it's an immutable container with no lock, and endpoints themselves have no locks.

[~kharwell] has also been doing some testing and has not seen this kind of result with PJSIP. In his testing it was able to go further than chan_sip.

By: Gregory Massel (gmza) 2018-07-24 01:54:59.584-0500

My apologies!!!

Since the previous post I have moved astdb.sqlite3 onto a tmpfs file system and have more then tripled the registrations per second.
This must have been an I/O issue and the contention was either on writing the contact to astdb.sqlite3 or reading it from astdb.sqlite3.
If that is the case, then perhaps there may be a way to re-word the log entry written such that it actually makes reference to AstDB? This would make debugging a lot simpler!

I've re-benchmarked and I agree that res_pjsip is now performing registrations faster than chan_sip, but only *slightly*. Interestingly, however, in both cases, the bottleneck is sqlite3, and, as a result, the difference between the two drivers in respect of registration performance isn't huge.

Given that registrations are, by their very nature transient, enormous performance gains could be achieved if one had the option to keep the registrations in memory (i.e. not rely on a sqlite3 database). The performance gain would justify the loss of registrations on startup/shutdown/crash. Also, it may be possible to merely dump the entire memory cache to AstDB on shutdown and pull it from AstDB on startup. Similarly, it may be possible to dump the memory cache to AstDB every few minutes in a separate thread (that can run in parallel to registrations, rather then holding them up).

By: Joshua C. Colp (jcolp) 2018-07-24 04:46:18.888-0500

The log message you are referring to occurs EARLY before it is even treated as a registration, so I have no idea what exactly the interaction was like and what really happened to cause the message. It's not something that would be changed though because it fundamentally has nothing to do with a registration. It's "I got this SIP message, who is it from?".

By: Joshua C. Colp (jcolp) 2018-07-24 05:13:35.979-0500

Can you provide the information needed to reproduce this so we can understand what exactly was happening?

By: Gregory Massel (gmza) 2018-07-24 05:41:17.286-0500

I am using configurations similar to those at https://github.com/flaviogoncalves/astricon2017 (but with my own usernames/passwords) and then running:

/usr/local/bin/sipp x.x.x.x -sf register_options.xml -i x.x.x.x  -d 1000 -inf passwords.csv -oocsf ooc.xml -r 100

I have ~6800 users in a pjsip.conf file and default sorcery settings (i.e. manual config files).
This registers all the users and deals with OPTIONS packets (from qualify=yes).

I then just increase and decrease the rate in real time using + and - in sipp. Once the rate of registrations gets too high, pjsip starts logging "No matching endpoint found after X tries in 0.000 ms". Moving astdb.sqlite3 to tmpfs allows for pjsip to process the registrations much faster (as it's writing to RAM rather than hard drive), however, the same messages start logging irrespective, just at ~100 registrations per second when AstDB is on a hard drive versus at ~300 registrations per second when AstDB is on a memory file system. The higher the rate of registrations (above the achievable maximum), the faster these messages log.

I'm running htop at the same time to verify that the CPU cores are not getting too busy.

sqlite3 is not multi-threaded (although it is thread-safe) and is subject to storage / IO performance. Is it possible that any part of the endpoint identification process (in other threads) could get held up by a lock while a different thread is writing the registration information to AstDB?

By: Joshua C. Colp (jcolp) 2018-07-24 05:44:20.258-0500

Doubtful. The identification process blocks until a result is returned, it shouldn't return until everything is consulted which should return an endpoint according to your details.

By: Kevin Harwell (kharwell) 2018-07-24 09:46:29.172-0500

[~gmza] Just to be sure you stated you are using configurations similar to Flavio's. He was using 4000 endpoints and you are using ~6800 correct?

Also what version of pjsip are you using? Bundled? How are you running the test? For instance are you running it all locally (sipp and asterisk on same machine) or is sipp on a remote system? Can you also post your system specs?

Thanks!

By: Kevin Harwell (kharwell) 2018-07-24 11:20:07.984-0500

I just ran a test with 9000 configured endpoints at 2000 registers per second (started seeing a few retrans at this point) and have not seen this error yet.

Could you also attach you pjsip.conf and if possible your passwords.csv?

By: Gregory Massel (gmza) 2018-07-26 02:56:34.571-0500

Unfortunately it's not possible to share the passwords (as they're actual passwords for a live server on the internet).
Based on your feedback, however, I added the following to my pjsip.conf:
unidentified_request_count = 5000
unidentified_request_period = 1

After that, the previous error has stopped and I'm able to sustain high volumes (I only tested up to 2000 registrations per second, but it seemed like I could probably go higher).

Intermittently (every 5 to 10 seconds), however, I am now getting:
res_pjsip/pjsip_distributor.c: Request 'REGISTER' from '"xxx" <sip:xxx@xxx>' failed for 'x.x.x.x:5060' (callid: 41453-12472@x.x.x.x) - Failed to authenticate
However, given how infrequent this is relative to the test volume and that it represents less than a 0.03% failure rate, it seems statistically irrelevant.

Thank you for your assistance and apologies for my incorrect conclusion that this was a bug; it was obviously a config issue (unidentified_request_count has to be set extremely high when testing from a single IP).
Once again, thank you for the excellent work in terms of improving the PJSIP performance (ASTERISK-26806); these results prove that the performance is, as of Asterisk 13.22.0, orders of magnitude faster and more scalable than seen before.