[Home]

Summary:ASTERISK-27966: pjsip: Race condition in 183 re transmission can result in a deadlock
Reporter:Torrey Searle (tsearle)Labels:pjsip
Date Opened:2018-07-17 07:04:38Date Closed:2018-08-02 06:55:18
Priority:MajorRegression?
Status:Closed/CompleteComponents:Third-Party/pjproject
Versions:13.20.0 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:Deadlock between two threads.

The first thread is stuck in an infinite loop on a header list size 1 containing just a contact header

{noformat}
Thread 4 (Thread 0x7f34e6391700 (LWP 14984)):
#0 pjsip_get_uri (uri=0x7f3205e5c2c0) at ../src/pjsip/sip_uri.c:221
#1 0x00007f34e7bc4a82 in pjsip_uri_get_uri (uri=<optimized out>) at /usr/include/pjsip/sip_uri.h:272
#2 sanitize_tdata (tdata=<optimized out>) at res_pjsip/pjsip_message_filter.c:212
#3 filter_on_tx_message (tdata=0x7f3206374e38) at res_pjsip/pjsip_message_filter.c:233
#4 0x00007f352ec3ab0e in endpt_on_tx_msg (endpt=<optimized out>, tdata=0x7f3206374e38) at ../src/pjsip/sip_endpoint.c:1088
#5 0x00007f352ec415c0 in pjsip_transport_send (tr=0x1aa67b8, tdata=tdata@entry=0x7f3206374e38, addr=addr@entry=0x7f321858ff34, addr_len=16, token=token@entry=0x7f321858fe58, cb=cb@entry=0x7f352ec4e1b0 <transport_callback>) at ../src/pjsip/sip_transport.c:833
#6 0x00007f352ec4edb0 in tsx_send_msg (tsx=0x7f321858fe58, tdata=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:2122
#7 0x00007f352ec4f8da in tsx_on_state_proceeding_uas (tsx=0x7f321858fe58, event=<optimized out>) at ../src/pjsip/sip_transaction.c:2882
#8 0x00007f352ec4eb86 in tsx_timer_callback (theap=<optimized out>, entry=0x7f321858ffd8) at ../src/pjsip/sip_transaction.c:1189
#9 0x00007f352dd6b757 in pj_timer_heap_poll (ht=0x1814600, next_delay=next_delay@entry=0x7f34e6390e50) at ../src/pj/timer.c:644
#10 0x00007f352ec3b5c4 in pjsip_endpt_handle_events2 (endpt=0x1814318, max_timeout=max_timeout@entry=0x7f34e6390ea0, p_count=p_count@entry=0x0) at ../src/pjsip/sip_endpoint.c:715
#11 0x00007f352ec3b6d7 in pjsip_endpt_handle_events (endpt=<optimized out>, max_timeout=max_timeout@entry=0x7f34e6390ea0) at ../src/pjsip/sip_endpoint.c:776
#12 0x00007f34e7bb8118 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4345
#13 0x00007f352dd59ca0 in thread_main (param=0x18139b8) at ../src/pj/os_core_unix.c:541
#14 0x00007f354837a064 in start_thread (arg=0x7f34e6391700) at pthread_create.c:309
#15 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

The second thread is stuck trying to get the tsx group lock heald by the first thread
{noformat}
Thread 16 (Thread 0x7f32ab2eb700 (LWP 6953)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f321a996058) at ../nptl/pthread_mutex_lock.c:114
#3 0x00007f352dd599a5 in pj_mutex_lock (mutex=0x7f321a996058) at ../src/pj/os_core_unix.c:1265
#4 0x00007f352dd5fce1 in grp_lock_acquire (p=0x7f321a995f48) at ../src/pj/lock.c:290
#5 0x00007f352dd5fd22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at ../src/pj/lock.c:478
#6 0x00007f352ec514ac in pjsip_tsx_send_msg (tsx=tsx@entry=0x7f321858fe58, tdata=tdata@entry=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:1783
#7 0x00007f352ec53952 in pjsip_dlg_send_response (dlg=<optimized out>, tsx=0x7f321858fe58, tdata=tdata@entry=0x7f3206374e38) at ../src/pjsip/sip_dialog.c:1531
#8 0x00007f352f09109a in pjsip_inv_send_msg (inv=0x7f321b7f68b8, tdata=tdata@entry=0x7f3206374e38) at ../src/pjsip-ua/sip_inv.c:3231
#9 0x00007f35257fcb4c in ast_sip_session_send_response (session=session@entry=0x7f3219a9c7c0, tdata=0x7f3206374e38) at res_pjsip_session.c:983
#10 0x00007f347ff2c1f5 in indicate (data=0x7f321b5ed260) at chan_pjsip.c:1246
#11 0x00000000005d52ce in ast_taskprocessor_execute (tps=tps@entry=0x1aaceb0) at taskprocessor.c:971
#12 0x00000000005dcc80 in execute_tasks (data=0x1aaceb0) at threadpool.c:1322
#13 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x180fdf0) at taskprocessor.c:971
#14 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at threadpool.c:351
#15 worker_active (worker=<optimized out>) at threadpool.c:1105
#16 worker_start (arg=0x7f34d4001ba0) at threadpool.c:1024
#17 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#18 0x00007f354837a064 in start_thread (arg=0x7f32ab2eb700) at pthread_create.c:309
#19 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

The first thread is an infinite loop because it's iterating a circular list of size one of type contact header

Infinite loop in sanitize_tdata when iterating the headers. This is the root cause of the outage

$85 = (pjsip_hdr *) 0x7f3205e5c218
(gdb) print hdr->next
$86 = (struct pjsip_hdr *) 0x7f3205e5c218

From analysis, I've found that when asterisk is trying to forward a 183 it invokes indicate (as seen in the 2nd thread).  This thread has a reference to the tdata of the last 183 send (which is the same tdata as the internal heap timer for retransmission has).  Indicate will call the pjsip method pjsip_dlg_modify_response which intern removes the contact header from the tdata, causing a chance for the first thread to end up in the above infinite loop.  

As since this tdata is being shared by both a transaction and a dialog, I feel the best way to avoid conflict is to have the  pjsip_dlg_modify_response to get the tsx group lock to avoid access by the by the heap timer while it's modifying a possibly shared object.
Comments:By: Asterisk Team (asteriskteam) 2018-07-17 07:04:39.359-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Torrey Searle (tsearle) 2018-07-19 10:40:40.630-0500

callflow is basically this

{noformat}
A --> INV --> Asterisk --> INV --> B
A <-- 183 <-- Asterisk <-- 183 <-- B
       … 60 seconds later …
A <-- 183 <-- Asterisk <-- 183 <-- B
** Deadlock on the A side **
{noformat}

after 60 seconds pjsip is trying to internally retransmit the 183 using the tdata reference {{tsx->last_tx}}
while asterisk is also trying to send a 183, which is first trying to remove the contact header in {{inv->last_answer}}

and {{tsx->last_tx}} points to the same tdata object as {{inv->last_answer}}

Modifying the data object while the heap timer is trying to retransmit the 183 results in an infinite loop while holding the tsx group lock.


By: Friendly Automation (friendly-automation) 2018-08-02 06:55:18.959-0500

Change 9477 merged by Joshua Colp:
thirdparty/pjproject: fix deadlock in response retransmissions

[https://gerrit.asterisk.org/9477|https://gerrit.asterisk.org/9477]

By: Friendly Automation (friendly-automation) 2018-08-02 07:14:04.289-0500

Change 9801 merged by Joshua Colp:
thirdparty/pjproject: fix deadlock in response retransmissions

[https://gerrit.asterisk.org/9801|https://gerrit.asterisk.org/9801]

By: Friendly Automation (friendly-automation) 2018-08-02 07:15:13.538-0500

Change 9800 merged by Joshua Colp:
thirdparty/pjproject: fix deadlock in response retransmissions

[https://gerrit.asterisk.org/9800|https://gerrit.asterisk.org/9800]

By: Friendly Automation (friendly-automation) 2018-08-02 07:15:16.583-0500

Change 9799 merged by Jenkins2:
thirdparty/pjproject: fix deadlock in response retransmissions

[https://gerrit.asterisk.org/9799|https://gerrit.asterisk.org/9799]