[Home]

Summary:ASTERISK-25638: pjsip: Deadlock between monitor thread and worker threads
Reporter:Jani Aho (jani@callmaker)Labels:
Date Opened:2015-12-18 09:46:53.000-0600Date Closed:2017-12-20 06:05:37.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.6.0 13.7.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-25439 Segfault in find_entry () from /usr/lib/libpj.so.2 (dns_resolver, qualify_contact)
Environment:Centos 7.1 Pjproject 2.4.5Attachments:( 0) 2016_01_22__17_33_01.backtrace-threads.txt
( 1) 2016_01_22__17_33_01.full.tail.txt
( 2) 2016_01_22__17_33_01.locks.txt
( 3) 2016_01_22__17_33_01.netstat.txt
( 4) 2016_01_22__17_33_01.top.txt
( 5) 2016_01_22__17_34_01.backtrace-threads.txt
( 6) 2016_01_22__17_34_01.full.tail.txt
( 7) 2016_01_22__17_34_01.locks.txt
( 8) 2016_01_22__17_34_01.netstat.txt
( 9) 2016_01_22__17_34_01.top.txt
(10) backtrace-threads.txt
(11) core-show-locks.txt
Description:On some of our asterisk servers which server over 50k calls per day, we experience deadlocks in pjsip. These deadlocks can occur between multiple times per day to maybe once a week.

The behaviour we see is that no SIP dialogs are created. The only messages that are sent are BYE messages, but the replies to these messages aren't handled.

After one of those occasions we enabled debugging and got backtraces of locks and threads. These are from 13.6.0, but we have had the same kinds of deadlocks on 13.7.0-rc1.

I've tried to make sense of the backtraces and I've made some observations.
1. In backtrace-threads.txt Thread 644 seems to be stuck in find_entry in pjlib/src/pj/hash.c defined on line 133. That thread has acquired the mod_ua.mutex in pjsip_ua_find_dialog

2. The threads, like thread 632, which are waiting in _lll_lock_wait are trying to acquire the mutex as above in pjsip_ua_register_dlg.

3. There are a bunch of other calls tried to be made but they don't seem to get processed by the taskprocessor.
Comments:By: Asterisk Team (asteriskteam) 2015-12-18 09:46:54.514-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Jani Aho (jani@callmaker) 2015-12-18 09:48:16.951-0600

backtrace of gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk `pidof asterisk` > /tmp/backtrace-threads.txt, with scrubbed phone numbers.

By: Jani Aho (jani@callmaker) 2015-12-18 09:48:47.227-0600

backtrace of asterisk -rx "core show locks" > /tmp/core-show-locks.txt

By: Richard Mudgett (rmudgett) 2016-01-07 14:01:04.494-0600

The backtrace is not showing a classical deadlock.  The problem might not be a deadlock at all as your system may be running low on CPU cycles or physical memory that the machine is so far behind in processing SIP messages as to appear deadlocked.

# Thread 644 would be the key thread if it somehow got blocked.  The find_entry() function currently being executed by the thread does not block.  The find_entry() function is likely walking the linked list of dialogs.  Unless the dialog table linked list is exceedingly long or is corrupted and turned into an unending circular chain of entries then find_entry() won't take long.  I checked that the dialog table linked list, which find_entry() is searching, is protected from reentrancy by obtaining the mod_ua.mutex lock.  I did not see any code path that failed to protect the list.
# The PJSIP thread pool threads like 632 that are waiting for the mod_ua.mutex lock shouldn't have long to wait as find_entry() shouldn't be taking very long.
# There are 613 threads that are waiting on creating an outbound dialed channel for the PJSIP thread pool to process.  These threads are waiting for a PJSIP thread pool thread to process the request.

The backtrace has told me as much as it can.  I'm going to need more information on how your system is setup and configured to try to reproduce this issue.  Please attach any needed configuration files such as pjsip.conf.  Please review
https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

When the system is in the "deadlocked" state:

What does CLI "core show taskprocessors" show about the size of the in queues?  Please attach the command output as a {{.txt}} file to the issue.  The "Processed" column tells how many tasks are completed.  The "In Queue" column tells how many tasks are waiting to be processed.  If zero then nothing is pending.  The "Max Depth" column gives the longest pending task queue seen since the taskprocessor was created.

Do repeated CLI "core show taskprocessors" requests show that the SIP-control taskprocessor is processing tasks?  Repeated backtraces can also show that things are being processed.

Does CLI "core ping taskprocessor SIP-control" respond with a ping time?  Note that the CLI will be blocked until you get a ping time response.


By: Dmitriy Serov (Demon) 2016-01-22 12:38:58.573-0600

It's duplicate of ASTERISK-25439
Every day my production server hangs with this lock. I have a lot of logs of .lock, backtrace, logs, "top", "ping", "netstat" with pjsip debug also.
There is two main errors: this deadlock and segfault in find_entry (I think they are related by a common cause).
The archives of all these logs can provide to get rid of this very disturbing problem to me.

To show the nature of the deadlock it is added the logs of today's hangs.
Two sets of logs within 1 minute. The program is all in the same position, still uses the CPU, all endpoint disconnected

By: Joshua C. Colp (jcolp) 2017-12-20 06:05:37.588-0600

Tracking under ASTERISK-25439.