[Home]

Summary:ASTERISK-26061: [patch] res_pjsip: improve realtime performance - remove updating all endpoints status on startup
Reporter:Alexei Gradinari (alexei gradinari)Labels:
Date Opened:2016-05-25 15:48:27Date Closed:2016-08-04 07:00:00
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.9.1 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-26269 res_pjsip: Wrong state for aors without registered contacts after startup
Environment:Attachments:( 0) ast13.diff
Description:The patch removes updating all Endpoints' status on startup.
Instead, only non-qualified aors with static contact
and non-qualified non-expired contacts are retrieved from the realtime to
update the endpoint status to ONLINE.
The endpoint name was added to the contact object to simply find the endpoint
that created this contact.

The status of endpoints with qualified aors will be updated by 'qualify'
functions.
Comments:By: Asterisk Team (asteriskteam) 2016-05-25 15:48:28.147-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: JoshE (n8ideas) 2016-06-25 19:19:18.874-0500

Alexei-

This is definitely a needed patch.  We've been working on something similar for a project on our end.

I think we'd propose taking this a bit further.  We have a database of roughly 20,000 endpoints and AORs.  They'reall qualified, so we are still seeing 20,000 SQL lookups on startup.

DEBUG[1106] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id

Boot to Asterisk Ready with this patch takes us around 10 minutes per server, which is far better than 13.10 boot time of 15 minutes and 13.9 or earlier of 30+ minutes.

I think we'd like to propose adding a knob that that would allow us to disable the contacts lookup, regardless of the qualify condition.  Probably not appropriate for a number of use cases, but for larger "unified" clusters, where we would like to have every endpoint able to register to any node, it seems like this may be a good solution.

Thoughts there?

By: Alexei Gradinari (alexei gradinari) 2016-06-27 10:22:13.333-0500

JoshE,

It's very strange that there are 20k SQL lookups to ps_contacts on startup.

We have about 1000 endpoints with dynamic contact and 10 endpoints with static contact.
On startup there are
1x SELECT * FROM ps_endpoints WHERE id LIKE '%' ORDER BY id
1x SELECT * FROM ps_aors WHERE contact != '' AND qualify_frequency <= 0 ORDER BY contact
1x SELECT * FROM ps_contacts WHERE expiration_time > 1467036685 AND qualify_frequency <= 0 ORDER BY expiration_time
1x SELECT * FROM ps_aors WHERE qualify_frequency > 0 ORDER BY  qualify_frequency
10x SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
1x SELECT * FROM ps_contacts WHERE qualify_frequency > 0 ORDER BY qualify_frequency
1x SELECT * FROM ps_endpoints WHERE mailboxes != '' ORDER BY mailboxes

Are your 20k endpoints with static or dynamic contact?

Be sure that
- the contact value is '' (blank) instead of NULL in ps_aors for dynamic contacts.
- the field expiration_time on ps_contacts was changed to BigInteger;
- the field endpoint was added to ps_contacts;

Do you use Sorcery Memory Caching?
We don't.



By: JoshE (n8ideas) 2016-07-06 22:17:29.266-0500

Thanks for the response Alexei!

So on my 20,000, with your configuration, I am still seeing the 20,000 lookups at boot:

[2016-07-06 20:57:58] DEBUG[23520] res_odbc.c: Reusing ODBC handle 0x26dff78 from class 'config'
[2016-07-06 20:57:58] DEBUG[23520] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-06 20:57:58] DEBUG[23520] res_config_odbc.c: Parameter 1 ('id LIKE') = '130;@%'

Prior to the patch, having contact field in ps_aors as NULL would produce the "Endpoint ... is unavailable" at boot.  With your patch, those console logs are gone, but my 8 minute boot persists.  No real effect on startup time.

With sorcery cacheing on or off, it actually doesn't really affect the bootup time.  Pretty much stuck on the 8 minute mark regardless of whether it's on or off.  Seems to reduce DB load after that point however, but may not be worth the hassle operationally.

All of my aors (minus a small handful) are qualify_frequency > 0.

I'm going to take another look at where that's being generated and see if I don't have something else on my end that needs fixing, or I'll dig into the source.

Let me know if you have any other inspiration on this.

By: Alexei Gradinari (alexei gradinari) 2016-07-07 12:40:07.226-0500

JoshE,

What's SQL statements before 20k lookups of ps_contacts?


By: JoshE (n8ideas) 2016-07-07 12:55:03.271-0500

So I cleaned up the DB to around 7400 rows in ps_endpoints and ps_aors.

Immediately at startup, here's what I have:
{noformat}
[2016-07-07 11:42:59] DEBUG[30851] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoint_id_ips WHERE id LIKE ? ORDER BY id
[2016-07-07 11:42:59] DEBUG[30851] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoints WHERE id = ?
[2016-07-07 11:43:00] DEBUG[28579] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoints WHERE aors LIKE ? ORDER BY aors
[2016-07-07 11:43:02] DEBUG[28579] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoints WHERE aors LIKE ? ORDER BY aors
[2016-07-07 11:43:06] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE contact != ? AND qualify_frequency <= ? ORDER BY contact
[2016-07-07 11:43:06] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE expiration_time > ? AND qualify_frequency <= ? ORDER BY expiration_time
[2016-07-07 11:43:06] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE qualify_frequency > ? ORDER BY qualify_frequency
[2016-07-07 11:43:07] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
{noformat}
Beyond that, here's a rough pattern what I get
{noformat}
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:23] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE id = ?
[2016-07-07 11:46:24] DEBUG[32233] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 11:46:24] DEBUG[32267] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoints WHERE aors LIKE ? ORDER BY aors
{noformat}

After Asterisk Ready, here are the total counts:
{noformat}
[root@media-3:/var/log/asterisk#]cat debug | grep "SQL: SELECT \* FROM ps_aors"  | wc -l
5887
[root@media-3:/var/log/asterisk#]cat debug | grep "SQL: SELECT \* FROM ps_contacts"  | wc -l
5915
{noformat}


By: Alexei Gradinari (alexei gradinari) 2016-07-07 13:59:38.004-0500

JoshE,

It seems you use old version without my 1st patch https://gerrit.asterisk.org/#/c/2622/
Should be SELECT * FROM ps_contacts WHERE qualify_frequency > 0 ORDER BY qualify_frequency
but there isn't

Could you please, apply the patch https://gerrit.asterisk.org/#/c/2906/ to 13.10.0-rc2
or git clone 13 branch which included both my patches.



By: JoshE (n8ideas) 2016-07-07 16:08:04.463-0500

So I just cloned the 13 branch, with this as the last commit:

commit e26bd15e7a79d21ba0963d8ce38ec43e4550e225
Author: Scott Griepentrog <scott@griepentrog.com>
Date:   Thu Jul 7 10:55:42 2016 -0500

Same behavior of the 10k+ lookups between the ps_aors and ps_contacts tables.  First SQL queries are these:

{noformat}
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE contact != ? AND qualify_frequency <= ? ORDER BY contact
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE expiration_time > ? AND qualify_frequency <= ? ORDER BY expiration_time
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_aors WHERE qualify_frequency > ? ORDER BY qualify_frequency
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
[2016-07-07 15:06:10] DEBUG[349] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE id LIKE ? ORDER BY id
{noformat}

By: Alexei Gradinari (alexei gradinari) 2016-07-07 16:45:49.719-0500

JoshE,

You should check that length of contact's value in ps_aors is 0 for dynamic endpoints.
It seems that your aors are treated as with permanent contact.
You should add additional debug into location.c to find out the code which produce huge lookups.


By: JoshE (n8ideas) 2016-07-14 13:39:33.431-0500

Hi Alexei-

The length is definitely 0 and they are showing up as dynamic contacts.

A couple more interesting things.  It appears that each ps_contacts lookup is happening 3 times per id, which was obvious when I started printing out the regex.  Eliminating this line in location.c reduces 2/3 of those lookups:

       ast_sorcery_object_field_register_custom(sorcery, "aor", "contact", "", permanent_uri_handler, contacts_to_str, contacts_to_var_list, 0, 0);

contacts_to_str is the method being called twice, which generates two additional lookups, ultimately via ast_sip_location_retrieve_aor_contacts_nolock.

With that line commented out, Asterisk now starts in around 3 minutes.  

The third lookup is generated by create_mwi_subscriptions_for_endpoint, where unsolicited MWI notifications are allowed, as it is on most of our endpoints.

Still tracking a real solution to the issue, but just wanted to provide a quick update.

By: Alexei Gradinari (alexei gradinari) 2016-07-14 15:26:34.956-0500

JoshE,

Got it. It's unsolicited MWI.
All our ENDPOINTs are without 'mailboxes' set (no unsolicited MWI).
The 'mailboxes' is set for AORs (SIP devices subscribe to MWI).
Do you need unsolicited MWI?
Can you test without unsolicited MWI and without removing ast_sorcery_object_field_register_custom?




By: JoshE (n8ideas) 2016-07-14 16:04:47.818-0500

Alexei-

Yep.  That's it.  Removed unsolicited MWI (zeroed out the "mailboxes" field in ps_endpoints) and putting back the ast_sorcery_object_field_register_custom  and SIP stack responsiveness now happens in < 5 seconds.

We technically do probably need unsolicited MWI for some endpoints, though it architecturally could be worked around.

How complicated is a fix for unsolicited MWI?

Josh

By: nappsoft (nappsoft) 2016-08-04 06:00:37.368-0500

This reverts some of the changes

By: nappsoft (nappsoft) 2016-08-04 06:09:49.026-0500

Just discovered a problem with asterisk-13.11.0-rc1 related to this change:

After startup the State for all hints that have no registered contact is "Idle" instead of "Unavailable" and will stay like this until a contact registers.

Applying the patch (which reverts some of the changes) fixes the issue (what might not be the idea behind the original patch indeed, but I only wanted to check where the issue comes from). The important change is the second changeset I guess (I only made the change of the first changeset as I'm not sure why the "if (status)" had been removed...).

Please note that I've observed another strange behavior: two of the aors with no registered contact changed to the right status after a while (the other nine didn't)! The only difference between these aors and the others is that these two aors have a name that is similar to the one of a aor with a registered contact: the one with the registered contact is called "snom", the others are called "snomprov" and "snomd375". So as you can see the name of the aor with a registered contact is a substring of the name of the other two aors... Is there a problem with some contact/aor matching functions somewhere in the code?

By: Joshua C. Colp (jcolp) 2016-08-04 06:59:54.327-0500

[~nappsoft] I've been working on realtime tests to cover this stuff and have also uncovered the same. The behavior between realtime and config based is different for that now, when it probably shouldn't be. Can you open a new issue with your own information and the patch? This specific issue should have actually been closed, and I will be doing so now.