Summary: | ASTERISK-27966: pjsip: Race condition in 183 re transmission can result in a deadlock | ||
Reporter: | Torrey Searle (tsearle) | Labels: | pjsip |
Date Opened: | 2018-07-17 07:04:38 | Date Closed: | 2018-08-02 06:55:18 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | Third-Party/pjproject |
Versions: | 13.20.0 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ||
Description: | Deadlock between two threads.
The first thread is stuck in an infinite loop on a header list size 1 containing just a contact header {noformat} Thread 4 (Thread 0x7f34e6391700 (LWP 14984)): #0 pjsip_get_uri (uri=0x7f3205e5c2c0) at ../src/pjsip/sip_uri.c:221 #1 0x00007f34e7bc4a82 in pjsip_uri_get_uri (uri=<optimized out>) at /usr/include/pjsip/sip_uri.h:272 #2 sanitize_tdata (tdata=<optimized out>) at res_pjsip/pjsip_message_filter.c:212 #3 filter_on_tx_message (tdata=0x7f3206374e38) at res_pjsip/pjsip_message_filter.c:233 #4 0x00007f352ec3ab0e in endpt_on_tx_msg (endpt=<optimized out>, tdata=0x7f3206374e38) at ../src/pjsip/sip_endpoint.c:1088 #5 0x00007f352ec415c0 in pjsip_transport_send (tr=0x1aa67b8, tdata=tdata@entry=0x7f3206374e38, addr=addr@entry=0x7f321858ff34, addr_len=16, token=token@entry=0x7f321858fe58, cb=cb@entry=0x7f352ec4e1b0 <transport_callback>) at ../src/pjsip/sip_transport.c:833 #6 0x00007f352ec4edb0 in tsx_send_msg (tsx=0x7f321858fe58, tdata=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:2122 #7 0x00007f352ec4f8da in tsx_on_state_proceeding_uas (tsx=0x7f321858fe58, event=<optimized out>) at ../src/pjsip/sip_transaction.c:2882 #8 0x00007f352ec4eb86 in tsx_timer_callback (theap=<optimized out>, entry=0x7f321858ffd8) at ../src/pjsip/sip_transaction.c:1189 #9 0x00007f352dd6b757 in pj_timer_heap_poll (ht=0x1814600, next_delay=next_delay@entry=0x7f34e6390e50) at ../src/pj/timer.c:644 #10 0x00007f352ec3b5c4 in pjsip_endpt_handle_events2 (endpt=0x1814318, max_timeout=max_timeout@entry=0x7f34e6390ea0, p_count=p_count@entry=0x0) at ../src/pjsip/sip_endpoint.c:715 #11 0x00007f352ec3b6d7 in pjsip_endpt_handle_events (endpt=<optimized out>, max_timeout=max_timeout@entry=0x7f34e6390ea0) at ../src/pjsip/sip_endpoint.c:776 #12 0x00007f34e7bb8118 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4345 #13 0x00007f352dd59ca0 in thread_main (param=0x18139b8) at ../src/pj/os_core_unix.c:541 #14 0x00007f354837a064 in start_thread (arg=0x7f34e6391700) at pthread_create.c:309 #15 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {noformat} The second thread is stuck trying to get the tsx group lock heald by the first thread {noformat} Thread 16 (Thread 0x7f32ab2eb700 (LWP 6953)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f321a996058) at ../nptl/pthread_mutex_lock.c:114 #3 0x00007f352dd599a5 in pj_mutex_lock (mutex=0x7f321a996058) at ../src/pj/os_core_unix.c:1265 #4 0x00007f352dd5fce1 in grp_lock_acquire (p=0x7f321a995f48) at ../src/pj/lock.c:290 #5 0x00007f352dd5fd22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at ../src/pj/lock.c:478 #6 0x00007f352ec514ac in pjsip_tsx_send_msg (tsx=tsx@entry=0x7f321858fe58, tdata=tdata@entry=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:1783 #7 0x00007f352ec53952 in pjsip_dlg_send_response (dlg=<optimized out>, tsx=0x7f321858fe58, tdata=tdata@entry=0x7f3206374e38) at ../src/pjsip/sip_dialog.c:1531 #8 0x00007f352f09109a in pjsip_inv_send_msg (inv=0x7f321b7f68b8, tdata=tdata@entry=0x7f3206374e38) at ../src/pjsip-ua/sip_inv.c:3231 #9 0x00007f35257fcb4c in ast_sip_session_send_response (session=session@entry=0x7f3219a9c7c0, tdata=0x7f3206374e38) at res_pjsip_session.c:983 #10 0x00007f347ff2c1f5 in indicate (data=0x7f321b5ed260) at chan_pjsip.c:1246 #11 0x00000000005d52ce in ast_taskprocessor_execute (tps=tps@entry=0x1aaceb0) at taskprocessor.c:971 #12 0x00000000005dcc80 in execute_tasks (data=0x1aaceb0) at threadpool.c:1322 #13 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x180fdf0) at taskprocessor.c:971 #14 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at threadpool.c:351 #15 worker_active (worker=<optimized out>) at threadpool.c:1105 #16 worker_start (arg=0x7f34d4001ba0) at threadpool.c:1024 #17 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238 #18 0x00007f354837a064 in start_thread (arg=0x7f32ab2eb700) at pthread_create.c:309 #19 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {noformat} The first thread is an infinite loop because it's iterating a circular list of size one of type contact header Infinite loop in sanitize_tdata when iterating the headers. This is the root cause of the outage $85 = (pjsip_hdr *) 0x7f3205e5c218 (gdb) print hdr->next $86 = (struct pjsip_hdr *) 0x7f3205e5c218 From analysis, I've found that when asterisk is trying to forward a 183 it invokes indicate (as seen in the 2nd thread). This thread has a reference to the tdata of the last 183 send (which is the same tdata as the internal heap timer for retransmission has). Indicate will call the pjsip method pjsip_dlg_modify_response which intern removes the contact header from the tdata, causing a chance for the first thread to end up in the above infinite loop. As since this tdata is being shared by both a transaction and a dialog, I feel the best way to avoid conflict is to have the pjsip_dlg_modify_response to get the tsx group lock to avoid access by the by the heap timer while it's modifying a possibly shared object. | ||
Comments: | By: Asterisk Team (asteriskteam) 2018-07-17 07:04:39.359-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Torrey Searle (tsearle) 2018-07-19 10:40:40.630-0500 callflow is basically this {noformat} A --> INV --> Asterisk --> INV --> B A <-- 183 <-- Asterisk <-- 183 <-- B … 60 seconds later … A <-- 183 <-- Asterisk <-- 183 <-- B ** Deadlock on the A side ** {noformat} after 60 seconds pjsip is trying to internally retransmit the 183 using the tdata reference {{tsx->last_tx}} while asterisk is also trying to send a 183, which is first trying to remove the contact header in {{inv->last_answer}} and {{tsx->last_tx}} points to the same tdata object as {{inv->last_answer}} Modifying the data object while the heap timer is trying to retransmit the 183 results in an infinite loop while holding the tsx group lock. By: Friendly Automation (friendly-automation) 2018-08-02 06:55:18.959-0500 Change 9477 merged by Joshua Colp: thirdparty/pjproject: fix deadlock in response retransmissions [https://gerrit.asterisk.org/9477|https://gerrit.asterisk.org/9477] By: Friendly Automation (friendly-automation) 2018-08-02 07:14:04.289-0500 Change 9801 merged by Joshua Colp: thirdparty/pjproject: fix deadlock in response retransmissions [https://gerrit.asterisk.org/9801|https://gerrit.asterisk.org/9801] By: Friendly Automation (friendly-automation) 2018-08-02 07:15:13.538-0500 Change 9800 merged by Joshua Colp: thirdparty/pjproject: fix deadlock in response retransmissions [https://gerrit.asterisk.org/9800|https://gerrit.asterisk.org/9800] By: Friendly Automation (friendly-automation) 2018-08-02 07:15:16.583-0500 Change 9799 merged by Jenkins2: thirdparty/pjproject: fix deadlock in response retransmissions [https://gerrit.asterisk.org/9799|https://gerrit.asterisk.org/9799] |