[Home]

Summary:ASTERISK-24540: chan_sip: inefficient database lookups when looking up a peer causes system slowdown with realtime
Reporter:Ross Beer (rossbeer)Labels:
Date Opened:2014-11-20 09:36:06.000-0600Date Closed:2017-09-12 09:33:45
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/DatabaseSupport
Versions:1.8.32.0 11.14.0 11.14.1 13.2.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:CentOS 6.4, CentOS 6.5, CentOS 7Attachments:( 0) backtrace-threads_21-11-2014-clean.txt
( 1) backtrace-threads-clean_server2.txt
( 2) backtrace-threads-clean.txt
( 3) lsof_24-11-2014_1158.txt
( 4) server2_additional_debug.txt
( 5) threads.txt
Description:We are experiencing an issue where the SIP channel driver pauses and does not allow further registrations or calls to progress.

Looking at core dumps shows lock waits for app_dial and presence.

Unfortunately asterisk had not been compiled with the correct settings so I will need to provide further back traces when the issue happens again, however have attached the files I have now.
Comments:By: Ross Beer (rossbeer) 2014-11-20 18:19:12.226-0600

When the sip channel pauses, here are the locks. The server was not making any calls at the time and only processing registrations.

{noformat}
=======================================================================
=== 11.14.0
=== Currently Held Locks
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0x2adf43eda700 (do_monitor           started at [29097] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 28505 handle_request_do &netlock 0x2adf355b47c0 (1)
       main/logger.c:1643 ast_bt_get_addresses() (0x50affe+1D)
       main/lock.c:218 __ast_pthread_mutex_lock() (0x503108+C9)
       channels/chan_sip.c:28508 handle_request_do()
       channels/chan_sip.c:28467 sipsock_read()
       main/io.c:292 ast_io_wait() (0x4fd017+19C)
       channels/chan_sip.c:29064 do_monitor()
       main/utils.c:1192 dummy_start()
       pthread_create.c:0 start_thread()
       :0 __clone() (0x2adec9481970+6D)
=== ---> Lock #1 (chan_sip.c): MUTEX 16986 register_verify p 0x2adf454f6140 (1)
       main/logger.c:1643 ast_bt_get_addresses() (0x50affe+1D)
       main/lock.c:218 __ast_pthread_mutex_lock() (0x503108+C9)
       main/astobj2.c:195 __ao2_lock() (0x44ca86+96)
       channels/chan_sip.c:16986 register_verify()
       channels/chan_sip.c:28059 handle_request_register()
       channels/chan_sip.c:28367 handle_incoming()
       channels/chan_sip.c:28535 handle_request_do()
       channels/chan_sip.c:28467 sipsock_read()
       main/io.c:292 ast_io_wait() (0x4fd017+19C)
       channels/chan_sip.c:29064 do_monitor()
       main/utils.c:1192 dummy_start()
       pthread_create.c:0 start_thread()
       :0 __clone() (0x2adec9481970+6D)
=== -------------------------------------------------------------------
===
{noformat}

By: Ross Beer (rossbeer) 2014-11-20 18:34:28.908-0600

And again:

{noformat}
=======================================================================
=== 11.14.0
=== Currently Held Locks
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0x2b6c54c43700 (run_devstate_collector started at [  812] devicestate.c ast_enable_distributed_devstate())
=== ---> Lock #0 (astobj2.c): MUTEX 1093 internal_ao2_callback c 0x226fa00 (1)
       main/logger.c:1643 ast_bt_get_addresses() (0x50affe+1D)
       main/lock.c:218 __ast_pthread_mutex_lock() (0x503108+C9)
       main/astobj2.c:195 __ao2_lock() (0x44ca86+96)
       main/event.c:1466 event_update_cache()
       main/event.c:1505 handle_event()
       main/taskprocessor.c:332 tps_processing_function()
       main/utils.c:1192 dummy_start()
       pthread_create.c:0 start_thread()
       :0 __clone() (0x2b6be6672970+6D)
       asterisk <unknown>()
       asterisk <unknown>()
=== -------------------------------------------------------------------
===
=== Thread ID: 0x2b6c61393700 (do_monitor           started at [29097] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 28505 handle_request_do &netlock 0x2b6c5293b7c0 (1)
       main/logger.c:1643 ast_bt_get_addresses() (0x50affe+1D)
       main/lock.c:218 __ast_pthread_mutex_lock() (0x503108+C9)
       channels/chan_sip.c:28508 handle_request_do()
       channels/chan_sip.c:28467 sipsock_read()
       main/io.c:292 ast_io_wait() (0x4fd017+19C)
       channels/chan_sip.c:29064 do_monitor()
       main/utils.c:1192 dummy_start()
       pthread_create.c:0 start_thread()
       :0 __clone() (0x2b6be6672970+6D)
=== ---> Lock #1 (chan_sip.c): MUTEX 16986 register_verify p 0x2b6c8c41e7a0 (1)
       main/logger.c:1643 ast_bt_get_addresses() (0x50affe+1D)
       main/lock.c:218 __ast_pthread_mutex_lock() (0x503108+C9)
       main/astobj2.c:195 __ao2_lock() (0x44ca86+96)
       channels/chan_sip.c:16986 register_verify()
       channels/chan_sip.c:28059 handle_request_register()
       channels/chan_sip.c:28367 handle_incoming()
       channels/chan_sip.c:28535 handle_request_do()
       channels/chan_sip.c:28467 sipsock_read()
       main/io.c:292 ast_io_wait() (0x4fd017+19C)
       channels/chan_sip.c:29064 do_monitor()
       main/utils.c:1192 dummy_start()
       pthread_create.c:0 start_thread()
       :0 __clone() (0x2b6be6672970+6D)
=== -------------------------------------------------------------------
===
=======================================================================
{noformat}

By: Matt Jordan (mjordan) 2014-11-20 19:23:21.989-0600

In the future, please attach output as a text file to the issue.

By: Matt Jordan (mjordan) 2014-11-20 19:25:05.633-0600

Which Asterisk version are you actually running with? The line numbers in the backtrace don't appear to be lining up with current Asterisk 11 {{chan_sip}}.

By: Ross Beer (rossbeer) 2014-11-20 19:30:58.283-0600

This was taken from a box running 11.14

By: Corey Farrell (coreyfarrell) 2014-11-20 20:40:48.004-0600

What patches have you added?

By: Ross Beer (rossbeer) 2014-11-21 03:53:52.476-0600

None, it was a fresh install from source.

By: Ross Beer (rossbeer) 2014-11-21 06:03:12.141-0600

Another pause

By: Ross Beer (rossbeer) 2014-11-21 06:10:58.745-0600

Attaching backtrace from second server

By: Ross Beer (rossbeer) 2014-11-21 06:12:55.317-0600

Additional debug for second server

By: Rusty Newton (rnewton) 2014-11-21 15:56:38.865-0600

Ross is it possible you can get an Asterisk log with DEBUG messages and a sip trace (together) - showing the last few minutes before the deadlock?

By: Ross Beer (rossbeer) 2014-11-24 04:51:43.519-0600

I am running a sip trace now, hopefully this will catch something.

There was a further pause today which gave the following backtrace:

0x00007f7a5b000a4d in ?? () Thread 1 (process 33217): #0  0x00007f7a5b000a4d in ?? () #1  0x0000000000000000 in ?? ()

There was only a single line shown which is a little strange in it's self.



By: Ross Beer (rossbeer) 2014-11-24 06:24:24.890-0600

I've been delving into this deeper and looking at the open RTP ports there appears to be a leak.

For 18 active calls there are 477 open ports, would the SIP channel diver wait for ports to become available?

The issue only happens when the servers are under load so this would correlate.

I have attached an 'lsof' showing the open ports.

By: Rusty Newton (rnewton) 2014-12-03 17:24:58.733-0600

Thanks, I'm going to throw this back into Waiting on Feedback until we get the DEBUG log with SIP trace integrated.

The lsof does look fishy to me.

By: Ross Beer (rossbeer) 2014-12-17 04:04:09.290-0600

Rolling back to Asterisk 1.8 resolved the issue. I am going to try installing 13 on a test bed to see if this resolves the issue.

It appears that Asterisk 11 performs a lookup on the realtime database for every call when the peer is defined in realtime. It also looks like this is the case when cache is enabled also.

By: Rusty Newton (rnewton) 2015-01-08 09:46:15.863-0600

{quote}
I'm going to throw this back into Waiting on Feedback until we get the DEBUG log with SIP trace integrated.
{quote}
We still need the DEBUG log with SIP trace integrated captured up-to and during the time of the deadlock.

By: Rusty Newton (rnewton) 2015-01-29 18:40:36.097-0600

Ross, in addition to the requested debug - can you provide a method and configs to reproduce this issue?

I'm not sure we can investigate this one easily without some steps and configs to reproduce the issue.

By: Ross Beer (rossbeer) 2015-02-13 02:45:35.501-0600

I've finally got around to installing Asterisk 13 and it appears that the same issue occurs in this version too. The SIP channel driver performance is impeded due to every call creating a real-time DB look-up even if the peer is statically defined.

The driver should look to it statically defined peers, check the real-time cache and then if no devices found look-up the peer from the database.

Can this order be changed in the chan_sip driver? I am planning to move to PJ sip but want to move to Asterisk 13 with the old chan_sip before moving. This would greatly ease the pain of moving.



By: Matt Jordan (mjordan) 2015-02-16 11:59:56.126-0600

{quote}
It appears that Asterisk 11 performs a lookup on the realtime database for every call when the peer is defined in realtime. It also looks like this is the case when cache is enabled also.
{quote}

It depends on how the peers are configured.

If the inbound request can be matched based on the user name in the {{From}} field, then the peer should be found. However, if this misses, then the peer will be looked for in realtime, which will typically result in two DB queries: one to get the peer by the user name; if that fails, another to get it by IP address. If both of those fail, *then* we look to see if we can match the peer statically by IP address. If that fails, we try realtime again (even though it most likely will fail).

This is sub-optimal; at the same time, there are a lot of nasty traps in that code. I would suspect that changing it and not breaking a lot of things would be non-trivial.

I'm fine accepting this issue on that notion, but without a well written and well tested patch, I doubt this issue will get much traction - particularly when {{chan_pjsip}} generally handles this case better already.

By: Matt Jordan (mjordan) 2015-02-16 12:00:51.790-0600

That being said: was there an actual locking issue? Or was this merely the database lookups impeding progress?

(Also note that {{chan_pjsip}}, being multi-threaded in the stack, won't block subsequent requests well a DB transaction is taking place.)

By: Rusty Newton (rnewton) 2015-03-06 15:18:22.173-0600

{quote}That being said: was there an actual locking issue? Or was this merely the database lookups impeding progress?{quote}

Setting the issue back to Waiting on Feedback

By: Ross Beer (rossbeer) 2015-03-09 11:07:35.730-0500

I think the issue is that the issue is that the cache should be hit before looking up details in the database. For example an outgoing call when using 'type=friend' appears to do a number of database lookups for each call and registration.

When there is heavy load, this can cause a strain on the database. This should be avoided when using real time cache, however in version 11 and 13  the cache is the last thing checked where as it should the the first.

Asterisk 1.8 does not have the same issue and therefore it would be great if we could resolve this in 11 and 13 to aid our migration to PJSIP which is the plan, however we would like to move to 13 using the existing SIP driver in the first instance.

13 should resolve a few issues experienced with 1.8 so the first hurdle would be to upgrade to 13 before moving to the very new PJSIP driver.

By: Matt Jordan (mjordan) 2015-03-10 16:15:03.488-0500

I'm going to go ahead and ack the issue, but remember my quote from earlier:

{quote}
I'm fine accepting this issue on that notion, but without a well written and well tested patch, I doubt this issue will get much traction - particularly when chan_pjsip generally handles this case better already.
{quote}

By: Ross Beer (rossbeer) 2017-09-12 09:33:45.963-0500

PJSIP does not have the same issue