[Home]

Summary:ASTERISK-29286: chan_pjsip: Deadlock between sending response and transaction layer
Reporter:Mikhail (j3st3r)Labels:patch
Date Opened:2021-02-09 05:15:04.000-0600Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Channels/chan_pjsip
Versions:16.16.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-29473 18.4 Crash on pjsip_inv_send_msg
is duplicated byASTERISK-29692 Asterisk stops responding to SIP OPTIONS requests from other SIP servers or asterisk servers via LAN or WAN with PJSIP module
Environment:ubuntu 18.04Attachments:( 0) 0012-Fixed-deadlock-between-SIP-transaction-and-dialog.patch
( 1) 0080-fix-transaction-layer-deadlock.patch
( 2) core.sead.leadzvon.ru-2021-02-17T13-14-55+0000-brief.txt
( 3) core.sead.leadzvon.ru-2021-02-17T13-14-55+0000-full.txt
( 4) core.sead.leadzvon.ru-2021-02-17T13-14-55+0000-info.txt
( 5) core.sead.leadzvon.ru-2021-02-17T13-14-55+0000-locks.txt
( 6) core.sead.leadzvon.ru-2021-02-17T13-14-55+0000-thread1.txt
( 7) debug_log_123456
( 8) gdb_debug_prints.txt
( 9) t.18881-brief.txt
(10) t.18881-full.txt
(11) t.18881-info.txt
(12) t.18881-locks.txt
(13) t.18881-thread1.txt
Description:In some cases asterisk stops answering options/register, but answers invite, server is not under high load
Comments:By: Asterisk Team (asteriskteam) 2021-02-09 05:15:05.968-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Mikhail (j3st3r) 2021-02-09 05:17:15.871-0600

core dump

By: Joshua C. Colp (jcolp) 2021-02-09 05:27:50.933-0600

We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

As well, what type of transports are you using? UDP? TCP? TLS? Websocket?

By: Mikhail (j3st3r) 2021-02-09 05:32:38.908-0600

Transport:  transport-udp             udp      0      0  0.0.0.0:5060
Transport:  transport-wss             wss      0      0  0.0.0.0:5060


unfortunatelly I didn't save debug (was using sngrep), but it just shows options/register asterisk doesn't answers and invite's asterisk answers

By: Joshua C. Colp (jcolp) 2021-02-09 05:35:51.922-0600

The SIP trace is not what I was referring to, but Asterisk core debug. The issue is between the PJSIP transaction layer and the code that actually starts to send the SIP response - with the core debug it would give better insight into when stuff is occurring.

By: Mikhail (j3st3r) 2021-02-09 05:41:19.963-0600

please take a look at the attached files, these are core debug

By: Joshua C. Colp (jcolp) 2021-02-09 05:47:10.695-0600

The attached files are backtraces which show a deadlock, but the core debug logging[1] would better show how it occurred.

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: Mikhail (j3st3r) 2021-02-09 05:50:54.671-0600

oh, sorry, you just asking for 'core set debug' output. ok, next time I'll face a problem - I'll paste it here

By: Joshua C. Colp (jcolp) 2021-02-09 05:53:51.860-0600

Please attach as a file, and create them according to the linked wiki page.

By: Mikhail (j3st3r) 2021-02-11 22:05:43.554-0600

finally faced the issue

By: Kevin Harwell (kharwell) 2021-02-12 11:34:52.975-0600

Definitely a dead lock:
{noformat}
Thread 71 (Thread 0x7f26c8a9e700 (LWP 3053)):
#0  0x00007f278292c1fd in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f27829250f4 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55eb9e3942f8) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007f278491e6d4 in pj_mutex_lock (mutex=0x55eb9e3942f8) at ../src/pj/os_core_unix.c:1281
#3  0x00007f2784926301 in grp_lock_acquire (p=0x55eb9e3941e8) at ../src/pj/lock.c:290
#4  0x00007f2784926301 in pj_grp_lock_acquire (grp_lock=0x55eb9e3941e8) at ../src/pj/lock.c:478
#5  0x00007f27848acdf4 in pjsip_tsx_send_msg (tsx=tsx@entry=0x7f2778a7f038, tdata=tdata@entry=0x7f27784d9028) at ../src/pjsip/sip_transaction.c:1788
#6  0x00007f27848af4d2 in pjsip_dlg_send_response (dlg=0x7f26d8733e58, tsx=0x7f2778a7f038, tdata=tdata@entry=0x7f27784d9028) at ../src/pjsip/sip_dialog.c:1594
#7  0x00007f2784870bfc in pjsip_inv_send_msg (inv=0x7f2740160cc8, tdata=tdata@entry=0x7f27784d9028) at ../src/pjsip-ua/sip_inv.c:3284
#8  0x00007f2709cac93c in ast_sip_session_send_response (session=session@entry=0x7f2740165f30, tdata=0x7f27784d9028) at res_pjsip_session.c:2566
#9  0x00007f27045732c4 in indicate (data=0x7f27781afa70) at chan_pjsip.c:1327
#10 0x000055eb99ecdee7 in ast_taskprocessor_execute (tps=tps@entry=0x55eb9d06b680) at taskprocessor.c:1237
#11 0x000055eb99ed44e0 in execute_tasks (data=0x55eb9d06b680) at threadpool.c:1354
#12 0x000055eb99ecdee7 in ast_taskprocessor_execute (tps=0x55eb9b2d0e40) at taskprocessor.c:1237
#13 0x000055eb99ed4fd4 in threadpool_execute (pool=0x55eb9b2d0c70) at threadpool.c:367
#14 0x000055eb99ed4fd4 in worker_active (worker=0x7f26d8737ac0) at threadpool.c:1137
#15 0x000055eb99ed4fd4 in worker_start (arg=arg@entry=0x7f26d8737ac0) at threadpool.c:1056
#16 0x000055eb99edd2ef in dummy_start (data=<optimized out>) at utils.c:1299
#17 0x00007f27829226db in start_thread (arg=0x7f26c8a9e700) at pthread_create.c:463
#18 0x00007f2781e5c71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}
{{pjsip_dlg_send_response}} acquires a lock on the dialog and then tries to acquire the transaction's group lock in {{pjsip_tsx_send_msg}}.
{noformat}
Thread 27 (Thread 0x7f2700ea8700 (LWP 18912)):
#0  0x00007f278292c1fd in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f27829250f4 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55eb9ebea3b8) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007f278491e6d4 in pj_mutex_lock (mutex=0x55eb9ebea3b8) at ../src/pj/os_core_unix.c:1281
#3  0x00007f2784926301 in grp_lock_acquire (p=0x55eb9ebea2a8) at ../src/pj/lock.c:290
#4  0x00007f2784926301 in pj_grp_lock_acquire (grp_lock=0x55eb9ebea2a8) at ../src/pj/lock.c:478
#5  0x00007f27848ae45f in pjsip_dlg_inc_lock (dlg=0x7f26d8733e58) at ../src/pjsip/sip_dialog.c:925
#6  0x00007f27848afb23 in pjsip_dlg_on_tsx_state (dlg=0x7f26d8733e58, tsx=0x7f2778a7f038, e=0x7f2700ea4170) at ../src/pjsip/sip_dialog.c:2121
#7  0x00007f27848a7953 in tsx_set_state (flag=<optimized out>, event_src=<optimized out>, event_src_type=<optimized out>, state=<optimized out>, tsx=<optimized out>) at ../src/pjsip/sip_transaction.c:1272
#8  0x00007f27848a7953 in tsx_on_state_completed_uas (tsx=<optimized out>, event=<optimized out>) at ../src/pjsip/sip_transaction.c:3226
#9  0x00007f27848aa161 in send_msg_callback (send_state=0x7f277866ace0, sent=-171060, cont=0x7f2700ea42a4) at ../src/pjsip/sip_transaction.c:1975
#10 0x00007f278489091e in send_response_resolver_cb (status=status@entry=0, token=token@entry=0x7f277866ace0, addr=addr@entry=0x7f2700ea42f0) at ../src/pjsip/sip_util.c:1741
#11 0x00007f271bf99616 in sip_resolve (resolver=<optimized out>, pool=<optimized out>, target=0x7f2778a7f160, token=0x7f277866ace0, cb=0x7f2784890860 <send_response_resolver_cb>) at res_pjsip/pjsip_resolver.c:529
#12 0x00007f27848946db in pjsip_resolve (resolver=0x55eb9ba6b188, pool=<optimized out>, target=target@entry=0x7f2778a7f160, token=token@entry=0x7f277866ace0, cb=cb@entry=0x7f2784890860 <send_response_resolver_cb>) at ../src/pjsip/sip_resolve.c:207
#13 0x00007f278488fc91 in pjsip_endpt_resolve (endpt=endpt@entry=0x55eb9b4f2e18, pool=<optimized out>, target=target@entry=0x7f2778a7f160, token=token@entry=0x7f277866ace0, cb=cb@entry=0x7f2784890860 <send_response_resolver_cb>) at ../src/pjsip/sip_endpoint.c:1201
#14 0x00007f2784892e4b in pjsip_endpt_send_response (endpt=0x55eb9b4f2e18, res_addr=res_addr@entry=0x7f2778a7f138, tdata=tdata@entry=0x7f277845ac18, token=token@entry=0x7f2778a7f038, cb=cb@entry=0x7f27848a9dd8 <send_msg_callback>) at ../src/pjsip/sip_util.c:1816
#15 0x00007f27848a8564 in tsx_send_msg (tsx=tsx@entry=0x7f2778a7f038, tdata=tdata@entry=0x7f277845ac18) at ../src/pjsip/sip_transaction.c:2240
#16 0x00007f27848a86dd in tsx_send_msg (tdata=0x7f277845ac18, tsx=0x7f2778a7f038) at ../src/pjsip/sip_transaction.c:2118
#17 0x00007f27848a86dd in tsx_retransmit (tsx=0x7f2778a7f038, resched=1) at ../src/pjsip/sip_transaction.c:2375
#18 0x00007f27848ab345 in tsx_on_state_proceeding_uas (tsx=0x7f2778a7f038, event=<optimized out>) at ../src/pjsip/sip_transaction.c:2897
#19 0x00007f27848a7c81 in tsx_timer_callback (theap=<optimized out>, entry=0x7f2778a7f1b8) at ../src/pjsip/sip_transaction.c:1194
#20 0x00007f2784936690 in pj_timer_heap_poll (ht=0x55eb9b4f3100, next_delay=next_delay@entry=0x7f2700ea7e20) at ../src/pj/timer.c:913
#21 0x00007f278488f534 in pjsip_endpt_handle_events2 (endpt=0x55eb9b4f2e18, max_timeout=max_timeout@entry=0x7f2700ea7e70, p_count=p_count@entry=0x0) at ../src/pjsip/sip_endpoint.c:716
#22 0x00007f278488f62f in pjsip_endpt_handle_events (endpt=<optimized out>, max_timeout=max_timeout@entry=0x7f2700ea7e70) at ../src/pjsip/sip_endpoint.c:777
#23 0x00007f271bf847f4 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4803
#24 0x00007f278491e9b0 in thread_main (param=0x55eb9b5bcea8) at ../src/pj/os_core_unix.c:541
#25 0x00007f27829226db in start_thread (arg=0x7f2700ea8700) at pthread_create.c:463
#26 0x00007f2781e5c71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}
Here {{pjsip_dlg_on_tsx_state}} attempts to lock the dialog, but must wait for it to be released by another thread (in this case Thread 71). However, before this {{send_msg_callback}} obtained and now holds the transaction's group lock. As well it appears the transaction's group lock was even acquired higher up prior to this in {{tsx_timer_callback}}

By: Kevin Harwell (kharwell) 2021-02-12 11:56:54.069-0600

[~j3st3r] this appears to be an issue within the pjproject/pjsip code itself. I'd recommend opening a bug report on their [issue tracker|https://github.com/pjsip/pjproject/issues] as well. If you do please comment back here with a link to the issue.

That said we're going to open this as an issue here too for further investigation from our side of things.

By: Mikhail (j3st3r) 2021-02-13 08:32:32.224-0600

Hello, Kevin. I've created an issue - https://github.com/pjsip/pjproject/issues/2641

By: Mikhail (j3st3r) 2021-02-15 04:43:18.271-0600

Hello, Kevin, pjsip developers ask to check if this patch can help https://github.com/pjsip/pjproject/issues/2641#issuecomment-779119428
could you please provide instructions how I can correctly apply this patch to asterisk?

By: Kevin Harwell (kharwell) 2021-02-15 13:11:34.307-0600

[~j3st3r]

You can test it out by copying the attached patch [^0080-fix-transaction-layer-deadlock.patch] to the following directory beneath the Asterisk 16 source: {{third-party/pjproject/patches/}}

Then ensure bundled-pjroject will be rebuilt and rebuild Asterisk (from top level Asterisk directory):
{noformat}
$ make -C third-party/pjproject/ distclean && make
{noformat}


By: Mikhail (j3st3r) 2021-02-15 14:18:01.313-0600

Applied the patch, will inform you about results

By: Mikhail (j3st3r) 2021-02-18 02:52:58.197-0600

new core backtraces

By: Mikhail (j3st3r) 2021-02-18 02:53:04.853-0600

Hello, Kevin. After applying the patch the behavior has changed. Now asterisk randomly crashes about once a day. Attaching backtraces. Also This I can see in logs:

2021-02-17 13:14:29] WARNING[30614] pjproject:              tsx0x7f90ad3e5588 .Error sending Response msg 183/INVITE/cseq=9725 (tdta0x7f90a80b31d8): Unknown Error (PJ_EUNKNOWN)
[2021-02-17 13:14:29] VERBOSE[30614] res_http_websocket.c: WebSocket connection from '180.178.129.90:50426' closed
[2021-02-17 13:15:00] Asterisk 16.16.0 built by root @ localhost on a x86_64 running Linux on 2021-01-27 09:16:44 UTC

By: Kevin Harwell (kharwell) 2021-02-18 17:48:37.993-0600

Are you comfortable with using {{gdb}}? If so could you load asterisk and the core file:
{noformat}
$ gdb asterisk /path/to/core
{noformat}
And then within {{gdb}} execute the following commands and comment back with the output:
{noformat}
(gdb) p *tdata->msg
(gdb) p *inv
(gdb) p *inv->invite_tsx
(gdb) p *inv->invite_tsx->cseq
(gdb) p *cseq
(gdb) p *cseq->cseq
{noformat}
(hopefully typed those correctly)

By: Mikhail (j3st3r) 2021-02-19 00:15:04.736-0600

Attached output [here|^gdb_debug_prints.txt].

By: Sean Bright (seanbright) 2021-06-09 10:00:17.662-0500

Attached pastebin content as a file in case it were to go away.

By: Joshua Elson (joshelson) 2021-06-09 16:11:26.868-0500

Just as a further heads up on this, from ASTERISK-29473. We are seeing this same issue on 18.4. This is happening to us with developer mode disabled.

By: Stanislav Abramenkov (silentindark) 2021-07-20 03:12:28.206-0500

Not sure, but I think that this report maybe be related to this problem.
https://github.com/pjsip/pjproject/issues/2761

By: Stanislav Abramenkov (silentindark) 2021-08-04 05:04:40.267-0500

Interesting commit -  Add checks and assertion if PJSIP string length is negative #2783
https://github.com/pjsip/pjproject/pull/2783
Not yet applied to master branch.

By: Stanislav Abramenkov (silentindark) 2023-04-13 23:58:48.891-0500

This fix merged into PJSIP