[Home]

Summary:ASTERISK-26969: Deadlock - chan_pjsip
Reporter:Ross Beer (rossbeer)Labels:
Date Opened:2017-04-27 04:54:58Date Closed:2017-05-18 13:11:22
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_pjsip
Versions:GIT Frequency of
Occurrence
Frequent
Related
Issues:
duplicatesASTERISK-26978 rtp: Crash in ast_rtp_codecs_payload_code()
Environment:Fedora 23Attachments:( 0) 05_backtrace-threads_2017-05-25_clean.txt
( 1) 07_backtrace-threads_2017-04-27_clean.txt
( 2) backtrace_20170508_clean.txt
Description:A deadlock has been introduced between the versions:

Asterisk GIT-13-13.15.0-rc1-64-ge851412M
Asterisk GIT-13-13.15.0-rc1-80-g299376cM

This is happening over multiple instances.

I am currently unable to run DEBUG_THREADS to identify, however I have backtraces when the system locks.

This is running with the revert commit: 3e7c396a51b240088c475dd53e7bac9869376129 (bridging:  Ensure successful T.38 negotation)
Comments:By: Asterisk Team (asteriskteam) 2017-04-27 04:54:59.779-0500

The severity of this issue has been automatically downgraded from "Blocker" to "Major". The "Blocker" severity is reserved for issues which have been determined to block the next release of Asterisk. This severity can only be set by privileged users. If this issue is deemed to block the next release it will be updated accordingly during the triage process.

By: Asterisk Team (asteriskteam) 2017-04-27 04:55:00.150-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2017-04-27 11:44:40.408-0500

05_backtrace-threads_2017-05-25_clean.txt is the same deadlock you've previously reported with the RTP payload code.

By: Ross Beer (rossbeer) 2017-04-27 12:29:55.509-0500

I'll move the 05_backtrace-threads_2017-05-25_clean.txt to ticket ASTERISK-26445

Is the second deadlock a new one we haven't seen before?

By: Richard Mudgett (rmudgett) 2017-04-27 13:00:01.781-0500

I cannot see a deadlock in the 04 backtrace.  Most of the threads are waiting in poll().  The other threads seems to be in an idle condition.

By: Ross Beer (rossbeer) 2017-04-27 13:15:02.714-0500

Each time this happens, SIP messages stop getting processed. It doesn't matter if its UDP, TCP or TLS.

The 05 backtrace did recover just after I did the trace on the running instance.

By: Richard Mudgett (rmudgett) 2017-04-27 13:19:06.477-0500

In the 04 backtrace, this is about the only thing interesting in it:
{noformat}
Thread 258 (Thread 0x7f85ce5cc700 (LWP 35497)):
#0  0x00007f86d6f32339 in pj_lock_acquire (lock=0x0) at ../src/pj/lock.c:178
#1  0x00007f86d6f324cf in grp_lock_acquire (p=0x7f850c749428) at ../src/pj/lock.c:290
#2  0x00007f86d6f32a0c in pj_grp_lock_acquire (grp_lock=0x7f850c749428) at ../src/pj/lock.c:478
#3  0x00007f86d6eb07b3 in tsx_timer_callback (theap=0x1ff8fb0, entry=0x7f8598090a70) at ../src/pjsip/sip_transaction.c:1187
#4  0x00007f86d6f4188a in pj_timer_heap_poll (ht=0x1ff8fb0, next_delay=0x7f85ce5cbd30) at ../src/pj/timer.c:643
#5  0x00007f86d6e967eb in pjsip_endpt_handle_events2 (endpt=0x1ff8cc8, max_timeout=0x7f85ce5cbd80, p_count=0x0) at ../src/pjsip/sip_endpoint.c:713
#6  0x00007f86d6e9691d in pjsip_endpt_handle_events (endpt=0x1ff8cc8, max_timeout=0x7f85ce5cbd80) at ../src/pjsip/sip_endpoint.c:770
#7  0x00007f86000bbff8 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4174
#8  0x00007f86d6f2a43d in thread_main (param=0x205ade8) at ../src/pj/os_core_unix.c:541
#9  0x00007f86d4d3861a in start_thread () at /usr/lib64/libpthread.so.0
#10 0x00007f86d40745fd in clone () at /usr/lib64/libc.so.6
{noformat}

The monitor thread is trying to acquire a lock using a NULL pointer.  How that happens is not in the backtrace.

By: Rusty Newton (rnewton) 2017-04-28 16:01:09.828-0500

We will probably need a debug log leading up to the freeze.

By: Ross Beer (rossbeer) 2017-05-08 09:08:35.644-0500

Further backtrace attached:

{noformat}
#0  0x00007f8cc531e353 in pj_lock_acquire (lock=0x626f3b303230) at ../src/pj/lock.c:180
180    return (*lock->acquire)(lock->lock_object);
{noformat}

Could the patch here be related https://trac.pjsip.org/repos/changeset/5576

By: Richard Mudgett (rmudgett) 2017-05-18 11:42:14.938-0500

I suspect the root cause of this issue is the same as ASTERISK-26978 since other manifestations of that issue are deadlocks using a likely destroyed RTP instance.

By: Rusty Newton (rnewton) 2017-05-18 13:11:22.299-0500

Closing this out. We can continue tracking on ASTERISK-26978