[Home]

Summary:ASTERISK-29774: Deadlock where hints lock blocks contexts lock
Reporter:Steve Sether (stevesether)Labels:
Date Opened:2021-11-23 16:10:52.000-0600Date Closed:
Priority:MinorRegression?No
Status:Open/NewComponents:PBX/General PBX/pbx_config
Versions:16.20.0 16.22.0 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-25736 pbx core: Deadlock during a reload
Environment:Centos 6, 2.6.32-754.el6.x86_64 #1 SMPAttachments:( 0) core.usi-vf01-mpls.ravon.net-2021-11-15T11-07-02-0600-brief.txt
( 1) core.usi-vf01-mpls.ravon.net-2021-11-15T11-07-02-0600-full.txt
( 2) core.usi-vf01-mpls.ravon.net-2021-11-15T11-07-02-0600-info.txt
( 3) core.usi-vf01-mpls.ravon.net-2021-11-15T11-07-02-0600-locks.txt
( 4) core.usi-vf01-mpls.ravon.net-2021-11-15T11-07-02-0600-thread1.txt
( 5) core.usi-vf01-mpls.ravon.net-2021-12-01T09-33-26-0600-brief.txt
( 6) core.usi-vf01-mpls.ravon.net-2021-12-01T09-33-26-0600-full.txt
( 7) core.usi-vf01-mpls.ravon.net-2021-12-01T09-33-26-0600-info.txt
( 8) core.usi-vf01-mpls.ravon.net-2021-12-01T09-33-26-0600-thread1.txt
Description:We've had a couple instances of a deadlocks recently.  This happened while we were trying to move phones from one Asterisk server to another.  It's too much to describe here what goes in in this process, but from an  Asterisk perspective it's a lot of reloading the dialplan regenerating pjsip.conf, reloading pjsip, and phones re-registering to the new Asterisk server through a proxy.

The sip freeze has happened twice now when we move upwards of 80 phones at a time.

The second time it happened we were able to get a core dump, and run it through the scripts built in to Asterisk.  The output is...  voluminous.  I tried running the core file through a scripting tool that for gdb that  claims to find deadlocks (and worked before), but it crashes when it's expecting (I think?) a certain kind of locking that it's not seeing.

Mainly, I need some kind of guidance on how to find out what's locking up Asterisk.  I can't seem to find anything that references the locks being held.  i.e. I don't really know what threads are locking the others, and  other guides in using gdb don't seem to be much help.  I'm not a C programmer, so I don't have much experience in semaphores, low level locks, etc.


I've tried to edit this file down to more unique entries, and removed anything that's doesn't look like it's locked so it's at least possibly semi-readable. There's quite a few threads that are largely duplicates

> Any help is appreciated.  Thanks.
{noformat}
> Thread 113 (Thread 0x7faf4deac700 (LWP 17772)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007fb093276c1b in pbx_load_module () at pbx_config.c:2091
> #4  0x00007fb093276e1e in handle_cli_dialplan_reload (e=<value optimized
> out>, cmd=<value optimized out>, a=<value optimized out>) at
> pbx_config.c:1597
> #5  0x00000000004ba668 in ast_cli_command_full (uid=-1, gid=-1, fd=112,
> s=0x7fafe800d839 "dialplan reload") at cli.c:2995
> #6  0x000000000060213b in action_command (s=0x7faf4deabcd0,
> m=0x7faf4deab3e0) at manager.c:5318
> #7  0x0000000000606af0 in process_message (s=0x7faf4deabcd0,
> m=0x7faf4deab3e0) at manager.c:6689
> #8  0x0000000000609dcb in do_message (data=0x7fb034002930) at
> manager.c:6902
> #9  session_do (data=0x7fb034002930) at manager.c:7007
> #10 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
>
{noformat}
This thread is attempting to grab the 'reload_lock' lock, but is waiting on
...

I'm going to call the 'unknown thread id'. Note, this 'unknown thread id'
is missing some of the stack trace, but will still try to make an educated
guess.
{noformat}
> --
> #7  0x00000000005137b0 in execute_state_callback (cb=0x5fa500
> <manager_state_cb>, context=0x7faf703f3f88 "blf-danielBarnes126266",
> exten=0x7faf703f4118 "danielBarnes126266-user1", data=0x0, reason=<value
> optimized out>, hint=0x7faf
> 703f4600, device_state_info=0x0) at pbx.c:3284
> #8  0x00000000005169d6 in ast_add_hint (e=0x7faf703f4080) at pbx.c:4057
> #9  0x0000000000521c53 in ast_add_extension2_lockopt (con=0x7faf703f3e90,
> replace=0, extension=<value optimized out>, priority=-1, label=<value
> optimized out>, callerid=<value optimized out>, application=<value
> optimized out>, data
> =0x7faf703f4060, datad=0x45bdd0 <ast_free_ptr>, registrar=0x7fb09327b9b8
> "pbx_config", registrar_file=0x7faf70091221
> "exec.1636994706829225.140397385758464", registrar_line=1218,
> lock_context=1) at pbx.c:7572
> #10 0x00000000005224da in ast_add_extension2 (con=<value optimized out>,
> replace=<value optimized out>, extension=<value optimized out>,
> priority=<value optimized out>, label=<value optimized out>,
> callerid=<value optimized out>, a
> pplication=0x7faf703f3fee
> "PJSIP/xxxxxxx&Custom:unavailable&Custom:call_state_xxxxxxx",
> data=0x7faf703f4060, datad=0x45bdd0 <ast_free_ptr>,
> registrar=0x7fb09327b9b8 "pbx_config", registrar_f
> ile=0x7faf70091221 "exec.1636994706829225.140397385758464",
> registrar_line=1218) at pbx.c:7304
> #11 0x00007fb09327641b in pbx_load_config (config_file=0x7fb09327b9e0
> "extensions.conf") at pbx_config.c:1890
> #12 0x00007fb093276c2e in pbx_load_module () at pbx_config.c:2101
> #13 0x00007fb093276e1e in handle_cli_dialplan_reload (e=<value optimized
> out>, cmd=<value optimized out>, a=<value optimized out>) at
> pbx_config.c:1597
> #14 0x00000000004ba668 in ast_cli_command_full (uid=-1, gid=-1, fd=106,
> s=0x7faf70000f39 "dialplan reload") at cli.c:2995
> #15 0x000000000060213b in action_command (s=0x7fb0d04e2cd0,
> m=0x7fb0d04e23e0) at manager.c:5318
> #16 0x0000000000606af0 in process_message (s=0x7fb0d04e2cd0,
> m=0x7fb0d04e23e0) at manager.c:6689
> #17 0x0000000000609dcb in do_message (data=0x7fb034005820) at
> manager.c:6902
> #18 session_do (data=0x7fb034005820) at manager.c:7007
> #19 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> --
>
{noformat}
This thread is holding the 'reload_lock', which is blocking Thread 113
above. It then grabs the hints container lock, in ast_add_hint. While
holding the hints lock, ast_add_hint then calls the manager_state_cb
function. Looking there you can see the manager_state_cb function attempts
to get ast_rdlock_contexts.

So, this 'unknown thread id' holds the hints container lock, and is blocked
getting the ast_rdlock_contexts, i.e. 'conlock'.

Now let's jump to Thread 61 below.
{noformat}
We have many of these threads running dialplan reload.  I suspect something
> else is locking them?
>
> Thread 110 (Thread 0x7fb0d0467700 (LWP 20543)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007fb093276c1b in pbx_load_module () at pbx_config.c:2091
> #4  0x00007fb093276e1e in handle_cli_dialplan_reload (e=<value optimized
> out>, cmd=<value optimized out>, a=<value optimized out>) at
> pbx_config.c:1597
> #5  0x00000000004ba668 in ast_cli_command_full (uid=-1, gid=-1, fd=118,
> s=0x7fafbc002009 "dialplan reload") at cli.c:2995
> #6  0x000000000060213b in action_command (s=0x7fb0d0466cd0,
> m=0x7fb0d04663e0) at manager.c:5318
> #7  0x0000000000606af0 in process_message (s=0x7fb0d0466cd0,
> m=0x7fb0d04663e0) at manager.c:6689
> #8  0x0000000000609dcb in do_message (data=0x7fb03400b860) at
> manager.c:6902
> #9  session_do (data=0x7fb03400b860) at manager.c:7007
> #10 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
>
>
> Thread 120 (Thread 0x7faf4d860700 (LWP 11287)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x7fb0bc238720, con=0x0,
> context=0x7fb0bc2390e0 "call_to_user_core", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7faff4e2ff60 "6121111111", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x7fb0bc238720, con=0x0,
> context=0x7fb0bc2390e0 "call_to_user_core", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7faff4e2ff60 "6126166579", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:2886
> #5  0x0000000000527228 in ast_exists_extension (c=0x7fb0bc238720,
> args=0x0) at pbx.c:4181
> #6  __ast_pbx_run (c=0x7fb0bc238720, args=0x0) at pbx.c:4595
> #7  0x000000000052752b in pbx_thread (data=<value optimized out>) at
> pbx.c:4704
> #8  0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #9  0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
>
>
> There's also many of these threads locked on this line in
> taskprocessor.c:202
>
> Thread 119 (Thread 0x7fb0443b8700 (LWP 11099)):
> #0  0x00007fb0d292ca00 in sem_wait () from /lib64/libpthread.so.0
> #1  0x000000000058eab0 in ast_sem_wait (data=0x4ce9c60) at
> /usr/src/debug/asterisk-16.20.0/include/asterisk/sem.h:61
> #2  default_tps_processing_function (data=0x4ce9c60) at taskprocessor.c:202
> #3  0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #4  0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #5  0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
> Thread 118 (Thread 0x7fb0507ca700 (LWP 27278)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x0, con=0x0,
> context=0x7fb02a3f405e "from-sip-external", exten=0x7fb0016480e0
> "16121111111", priority=1, label=0x0, callerid=0x0, action=E_MATCH,
> found=0x0, combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x0, con=0x0, context=0x7fb02a3f405e
> "from-sip-external", exten=0x7fb0016480e0 "16121111111", priority=1,
> label=0x0, callerid=0x0, action=E_MATCH, found=0x0, combined_find_spawn=0)
> at pbx.c:2886
> #5  0x000000000051d504 in ast_exists_extension (c=<value optimized out>,
> context=<value optimized out>, exten=<value optimized out>, priority=<value
> optimized out>, callerid=<value optimized out>) at pbx.c:4181
> #6  0x00007fb05ea8766e in get_destination (session=0x7fb0016480e0,
> rdata=<value optimized out>) at res_pjsip_session.c:3703
> #7  0x00007fb05ea9a1ec in new_invite (rdata=0x7faffc086e78) at
> res_pjsip_session.c:3939
> #8  handle_new_invite_request (rdata=0x7faffc086e78) at
> res_pjsip_session.c:4114
> #9  session_on_rx_request (rdata=0x7faffc086e78) at
> res_pjsip_session.c:4189
> #10 0x00007fb0d471184e in pjsip_endpt_process_rx_data (endpt=<value
> optimized out>, rdata=0x7faffc086e78, p=0x7fb0905fc500,
> p_handled=0x7fb0507c9c7c) at ../src/pjsip/sip_endpoint.c:930
> #11 0x00007fb0903d1dcd in distribute (data=0x7faffc086e78) at
> res_pjsip/pjsip_distributor.c:955
> #12 0x000000000058e046 in ast_taskprocessor_execute (tps=0x4388e10) at
> taskprocessor.c:1237
> #13 0x00000000005963e0 in execute_tasks (data=0x4388e10) at
> threadpool.c:1354
> #14 0x000000000058e046 in ast_taskprocessor_execute (tps=0x1274b70) at
> taskprocessor.c:1237
> #15 0x00000000005980a1 in threadpool_execute (arg=0x7fb01c001af0) at
> threadpool.c:367
> #16 worker_active (arg=0x7fb01c001af0) at threadpool.c:1137
> #17 worker_start (arg=0x7fb01c001af0) at threadpool.c:1056
> #18 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #19 0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #20 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
>
> Thread 107 (Thread 0x7faf4dd38700 (LWP 22760)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x7fb008118cb0, con=0x0,
> context=0x7fb008119670 "call_to_user_core", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7fafd866b440 "612xxxxxxx", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x7fb008118cb0, con=0x0,
> context=0x7fb008119670 "call_to_user_core", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7fafd866b440 "612xxxxxxx", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:2886
> #5  0x0000000000527228 in ast_exists_extension (c=0x7fb008118cb0,
> args=0x0) at pbx.c:4181
> #6  __ast_pbx_run (c=0x7fb008118cb0, args=0x0) at pbx.c:4595
> #7  0x000000000052752b in pbx_thread (data=<value optimized out>) at
> pbx.c:4704
> #8  0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #9  0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
> Thread 105 (Thread 0x7faf4e09c700 (LWP 663)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x7fb0bfeadc50, con=0x0,
> context=0x7fb0bfeae610 "proxy-dial", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7faff82d65d0 "612xxxxxxx", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x7fb0bfeadc50, con=0x0,
> context=0x7fb0bfeae610 "proxy-dial", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7faff82d65d0 "612xxxxxxx", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:2886
> #5  0x0000000000527228 in ast_exists_extension (c=0x7fb0bfeadc50,
> args=0x0) at pbx.c:4181
> #6  __ast_pbx_run (c=0x7fb0bfeadc50, args=0x0) at pbx.c:4595
> #7  0x000000000052752b in pbx_thread (data=<value optimized out>) at
> pbx.c:4704
> #8  0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #9  0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
> Thread 103 (Thread 0x7faf4e020700 (LWP 25155)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x0, con=0x0,
> context=0x7faf0743c47e "from-internal-phone", exten=0x7fb01800c8c0
> "8744405", priority=1, label=0x0, callerid=0x0, action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x0, con=0x0, context=0x7faf0743c47e
> "from-internal-phone", exten=0x7fb01800c8c0 "8744405", priority=1,
> label=0x0, callerid=0x0, action=E_MATCH, found=0x0, combined_find_spawn=0)
> at pbx.c:2886
> #5  0x000000000051d504 in ast_exists_extension (c=<value optimized out>,
> context=<value optimized out>, exten=<value optimized out>, priority=<value
> optimized out>, callerid=<value optimized out>) at pbx.c:4181
> #6  0x00007fb05ea8766e in get_destination (session=0x7fb01800c8c0,
> rdata=<value optimized out>) at res_pjsip_session.c:3703
> #7  0x00007fb05ea9a1ec in new_invite (rdata=0x7faffc135f28) at
> res_pjsip_session.c:3939
> #8  handle_new_invite_request (rdata=0x7faffc135f28) at
> res_pjsip_session.c:4114
> #9  session_on_rx_request (rdata=0x7faffc135f28) at
> res_pjsip_session.c:4189
> #10 0x00007fb0d471184e in pjsip_endpt_process_rx_data (endpt=<value
> optimized out>, rdata=0x7faffc135f28, p=0x7fb0905fc500,
> p_handled=0x7faf4e01fc7c) at ../src/pjsip/sip_endpoint.c:930
> #11 0x00007fb0903d1dcd in distribute (data=0x7faffc135f28) at
> res_pjsip/pjsip_distributor.c:955
> #12 0x000000000058e046 in ast_taskprocessor_execute (tps=0x37b1530) at
> taskprocessor.c:1237
> #13 0x00000000005963e0 in execute_tasks (data=0x37b1530) at
> threadpool.c:1354
> #14 0x000000000058e046 in ast_taskprocessor_execute (tps=0x1274b70) at
> taskprocessor.c:1237
> #15 0x00000000005980a1 in threadpool_execute (arg=0x7fb01c050b60) at
> threadpool.c:367
> #16 worker_active (arg=0x7fb01c050b60) at threadpool.c:1137
> #17 worker_start (arg=0x7fb01c050b60) at threadpool.c:1056
> #18 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #19 0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #20 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
> Thread 101 (Thread 0x7fb05074e700 (LWP 26565)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x7fb00813ece0, con=0x0,
> context=0x7fb00813f6a0 "call_to_user_core", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7fb0ac16dd10 "952xxxxxxx", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x7fb00813ece0, con=0x0,
> context=0x7fb00813f6a0 "call_to_user_core", exten=0x652011 "h", priority=1,
> label=0x0, callerid=0x7fb0ac16dd10 "952xxxxxxx", action=E_MATCH, found=0x0,
> combined_find_spawn=0) at pbx.c:2886
> #5  0x0000000000527228 in ast_exists_extension (c=0x7fb00813ece0,
> args=0x0) at pbx.c:4181
> #6  __ast_pbx_run (c=0x7fb00813ece0, args=0x0) at pbx.c:4595
> #7  0x000000000052752b in pbx_thread (data=<value optimized out>) at
> pbx.c:4704
> #8  0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #9  0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
>
> Thread 95 (Thread 0x7faf4edb0700 (LWP 24067)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00000000005242bd in ast_rdlock_contexts (c=<value optimized out>,
> context=<value optimized out>, exten=<value optimized out>,
> device_state_info=0x0) at pbx.c:8509
> #4  ast_hint_extension (c=<value optimized out>, context=<value optimized
> out>, exten=<value optimized out>, device_state_info=0x0) at pbx.c:3000
> #5  internal_extension_state_extended (c=<value optimized out>,
> context=<value optimized out>, exten=<value optimized out>,
> device_state_info=0x0) at pbx.c:3146
> #6  0x000000000060486b in action_extensionstate (s=0x7faf4edafcd0,
> m=0x7faf4edaf3e0) at manager.c:5988
> #7  0x0000000000606af0 in process_message (s=0x7faf4edafcd0,
> m=0x7faf4edaf3e0) at manager.c:6689
> #8  0x0000000000609dcb in do_message (data=0x7fb034011680) at
> manager.c:6902
> #9  session_do (data=0x7fb034011680) at manager.c:7007
> #10 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #11 0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #12 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
> Thread 90 (Thread 0x7fb0d0657700 (LWP 27240)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051cbb5 in ast_rdlock_contexts (c=0x0, con=0x0,
> context=0x7fb02a3f153e "from-sip-external", exten=0x7fb00814e570
> "1612xxxxxxx", priority=1, label=0x0, callerid=0x0, action=E_MATCH,
> found=0x0, combined_find_spawn=0) at pbx.c:8509
> #4  pbx_extension_helper (c=0x0, con=0x0, context=0x7fb02a3f153e
> "from-sip-external", exten=0x7fb00814e570 "1612xxxxxxx", priority=1,
> label=0x0, callerid=0x0, action=E_MATCH, found=0x0, combined_find_spawn=0)
> at pbx.c:2886
> #5  0x000000000051d504 in ast_exists_extension (c=<value optimized out>,
> context=<value optimized out>, exten=<value optimized out>, priority=<value
> optimized out>, callerid=<value optimized out>) at pbx.c:4181
> #6  0x00007fb05ea8766e in get_destination (session=0x7fb00814e570,
> rdata=<value optimized out>) at res_pjsip_session.c:3703
> #7  0x00007fb05ea9a1ec in new_invite (rdata=0x7faffc023458) at
> res_pjsip_session.c:3939
> #8  handle_new_invite_request (rdata=0x7faffc023458) at
> res_pjsip_session.c:4114
> #9  session_on_rx_request (rdata=0x7faffc023458) at
> res_pjsip_session.c:4189
> #10 0x00007fb0d471184e in pjsip_endpt_process_rx_data (endpt=<value
> optimized out>, rdata=0x7faffc023458, p=0x7fb0905fc500,
> p_handled=0x7fb0d0656c7c) at ../src/pjsip/sip_endpoint.c:930
> #11 0x00007fb0903d1dcd in distribute (data=0x7faffc023458) at
> res_pjsip/pjsip_distributor.c:955
> #12 0x000000000058e046 in ast_taskprocessor_execute (tps=0x5203c50) at
> taskprocessor.c:1237
> #13 0x00000000005963e0 in execute_tasks (data=0x5203c50) at
> threadpool.c:1354
> #14 0x000000000058e046 in ast_taskprocessor_execute (tps=0x1274b70) at
> taskprocessor.c:1237
> #15 0x00000000005980a1 in threadpool_execute (arg=0x7fb01c0073c0) at
> threadpool.c:367
> #16 worker_active (arg=0x7fb01c0073c0) at threadpool.c:1137
> #17 worker_start (arg=0x7fb01c0073c0) at threadpool.c:1056
> #18 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #19 0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #20 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
>
> Thread 61 (Thread 0x7faf4dcbc700 (LWP 13778)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000045cb16 in __ao2_lock (user_data=<value optimized out>,
> lock_how=<value optimized out>, file=0x647703 "pbx.c", func=0x64b031
> "ast_add_hint", line=4026, var=<value optimized out>) at astobj2.c:241
> #4  0x00000000005167b9 in ast_add_hint (e=0x7faf6401dff0) at pbx.c:4026
> #5  0x0000000000521c53 in ast_add_extension2_lockopt (con=0x7faf6401dee0,
> replace=1, extension=<value optimized out>, priority=-1, label=<value
> optimized out>, callerid=<value optimized out>, application=<value
> optimized out>, data=0x7faf640017b0, datad=0x45bdd0 <ast_free_ptr>,
> registrar=0x7fb09327b9b8 "pbx_config", registrar_file=0x0,
> registrar_line=0, lock_context=1) at pbx.c:7572
> #6  0x00000000005235e5 in ast_add_extension2 (context=<value optimized
> out>, replace=1, extension=0x7faf4dcbab20 "zzzzzzz-user1", priority=-1,
> label=0x0, callerid=0x0, application=0x7faf4dcbab36
> "PJSIP/zzzzzzz&Custom:unavailable", data=0x7faf640017b0, datad=0x45bdd0
> <ast_free_ptr>, registrar=0x7fb09327b9b8 "pbx_config") at pbx.c:7304
> #7  ast_add_extension (context=<value optimized out>, replace=1,
> extension=0x7faf4dcbab20 "zzzzzzz-user1", priority=-1, label=0x0,
> callerid=0x0, application=0x7faf4dcbab36
> "PJSIP/zzzzzzz&Custom:unavailable", data=0x7faf640017b0, datad=0x45bdd0
> <ast_free_ptr>, registrar=0x7fb09327b9b8 "pbx_config") at pbx.c:6979
> #8  0x00007fb093277204 in handle_cli_dialplan_add_extension (e=<value
> optimized out>, cmd=<value optimized out>, a=0x7faf4dcbae90) at
> pbx_config.c:1156
> #9  0x00000000004ba668 in ast_cli_command_full (uid=-1, gid=-1, fd=107,
> s=0x7faf6401dca9 "dialplan add extension
> zzzzzzz-user1,hint,PJSIP/zzzzzzz&Custom:unavailable into blf-zzzzzzz
> replace") at cli.c:2995
> #10 0x000000000060213b in action_command (s=0x7faf4dcbbcd0,
> m=0x7faf4dcbb3e0) at manager.c:5318
> #11 0x0000000000606af0 in process_message (s=0x7faf4dcbbcd0,
> m=0x7faf4dcbb3e0) at manager.c:6689
> #12 0x0000000000609dcb in do_message (data=0x7fb034007790) at
> manager.c:6902
> #13 session_do (data=0x7fb034007790) at manager.c:7007
> #14 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #15 0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #16 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
{noformat}
> Here, ast_add_extension calls find_context_locked, which locks 'conlock'
(in ast_rdlock_contexts). ast_add_extension then calls ast_add_extension2,
which calls ast_add_extension2_lockopt, which calls ast_add_hint.
ast_add_hint then tries to lock the hints container.

So thread 61 holds a read lock on the contexts list, i.e. 'conlock', which
is blocking most of the other threads (any requesting ast_rdlock_context).
It also is requesting a lock on the hints container, so it too is being
blocked by the 'unknown thread id'.

More below at the end of the reply ...
{noformat}
> Thread 12 (Thread 0x7fb093fff700 (LWP 7836)):
> #0  0x00007fb0d292d334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fb0d29285f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fb0d29284d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x000000000051dd53 in ast_rdlock_contexts (hint=0x7fb093ffe7b0 "",
> hintsize=512, name=0x0, namesize=0, c=0x0, context=0x7fb093ffeb30 "blf",
> exten=0x7fb093ffeae0 "yyyyyyy-user1") at pbx.c:8509
> #4  ast_hint_extension (hint=0x7fb093ffe7b0 "", hintsize=512, name=0x0,
> namesize=0, c=0x0, context=0x7fb093ffeb30 "blf", exten=0x7fb093ffeae0
> "yyyyyyy-user1") at pbx.c:3000
> #5  ast_get_hint (hint=0x7fb093ffe7b0 "", hintsize=512, name=0x0,
> namesize=0, c=0x0, context=0x7fb093ffeb30 "blf", exten=0x7fb093ffeae0
> "yyyyyyy-user1") at pbx.c:4143
> #6  0x00000000005fa566 in manager_state_cb (context=0x7fb093ffeb30 "blf",
> exten=0x7fb093ffeae0 "yyyyyyy-user1", info=0x7fb093ffea10, data=<value
> optimized out>) at manager.c:7299
> #7  0x00000000005137b0 in execute_state_callback (cb=0x5fa500
> <manager_state_cb>, context=0x7fb093ffeb30 "blf", exten=0x7fb093ffeae0
> "yyyyyyy-user1", data=0x0, reason=<value optimized out>,
> hint=0x7faf301e6670, device_state_info=0x0) at pbx.c:3284
> #8  0x0000000000515834 in device_state_notify_callbacks
> (hint=0x7faf301e6670, hint_app=<value optimized out>) at pbx.c:3427
> #9  0x0000000000523801 in device_state_cb (unused=<value optimized out>,
> sub=<value optimized out>, msg=<value optimized out>) at pbx.c:3659
> #10 0x0000000000573250 in dispatch_exec_async (local=<value optimized
> out>) at stasis.c:1266
> #11 0x000000000058e0ee in ast_taskprocessor_execute (tps=0x1776c10) at
> taskprocessor.c:1235
> #12 0x000000000058eac8 in default_tps_processing_function (data=0x1757ce0)
> at taskprocessor.c:209
> #13 0x00000000005a24b7 in dummy_start (data=<value optimized out>) at
> utils.c:1428
> #14 0x00007fb0d2926aa1 in start_thread () from /lib64/libpthread.so.0
> #15 0x00007fb0d1d02c4d in clone () from /lib64/libc.so.6
>
{noformat}
This appears to be a "classic" deadlock. The 'unknown thread id' is holding
the hints lock and is being blocked by the contexts lock ('conlock').
Thread 61 is holding the contexts lock ('conlock') and is being blocked by
the hints lock.

You did not explicitly state what version of Asterisk 16, but it appears
you are running Asterisk 16.20. While not the latest release I have tracked
this in the latest code, so believe it is still an existing problem. Please
file an issue for this on the issue tracker: https://issues.asterisk.org/.
Please include your description here, and attach any files you have from
running the ast_coredumper script.
Comments:By: Asterisk Team (asteriskteam) 2021-11-23 16:10:54.019-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Kevin Harwell (kharwell) 2021-11-23 16:29:09.744-0600

For reference: [original thread|http://lists.digium.com/pipermail/asterisk-dev/2021-November/078479.html] and [reply|http://lists.digium.com/pipermail/asterisk-dev/2021-November/078480.html]

By: Kevin Harwell (kharwell) 2021-11-23 16:30:19.787-0600

Slightly different backtraces, but may be related to ASTERISK-25736. It's possible fixing one may fix both issues.

By: Steve Sether (stevesether) 2021-12-01 10:57:48.636-0600

This is another set of log files from the same problem.  I don't know if it helps or not, but I'm including them for the sake of completeness.

By: Steve Sether (stevesether) 2021-12-01 13:18:53.663-0600

I tried some mitigation by doing dialplan reload MUCH less often, but this doesn't appear to have any effect.

At this point, I'm unsure how to mitigate the issue on my end.  One thought we had is offering a bounty.  I'm not too familiar with Asterisk locks, so I'm not too sure how hard this is to fix.  In my mind this could range from absurdly easy, to absurdly hard, depending on how Asterisk deals with locks and any underlying design issues.

Can anyone offer guidance on the difficulty of fixing Asterisk deadlocks, especially when some of the legwork has been done by Kevin?  That would help us in setting some appropriate level of bounty, and maybe even help whomever might be interested in fixing this for us.

I've been able to reproduce this fairly easily in production, with scores of live phones, but not in test, with a handful.

Thanks.

By: Joshua C. Colp (jcolp) 2021-12-01 13:37:34.534-0600

There is no standard for "Asterisk deadlocks". It's ultimately dependent on how locking is done and what's going on, so there is no real answer as to difficulty.

By: Mark Murawski (kobaz) 2021-12-07 08:57:30.977-0600

I'm seeing something very very very similar in my environment in terms of deadlocks (Asterisk 16 and 18).  I haven't been able to regularly reproduce it though.

Do you have steps to reproduce?

By: Steve Sether (stevesether) 2021-12-07 10:49:10.143-0600

I can reproduce it in production only, so I haven't been able to nail down exactly the steps since doing so freezes Asterisk.  Obviously we can't intentionally do this in production.

But, I can tell you that our script that moves phones does the following directly to Asterisk.

Re-generate the pjsip_peers.conf file, and reload it via "asterisk -rx "pjsip reload".  (This file is static in the filesystem, but generated via a script).
Reload the dialplan via the AMI CommandAction "dialplan reload".
Reload chan_sip via "sip reload" (which will do nothing on this server since we don't have chan sip installed, but include it for legacy reasons on other servers).

The dialplan reload command will re-generate hints via a PHP script.  Adding the hint is one of the locks being generated that's creating the deadlock, so this is almost certainly a critical piece.  When we move phones, we always are adding hints for the phones moved since the hints are generated on a per-server basis.  I tried a mitigation strategy where we load all the hints for all customers on every server.  This had a side affect of taking 3-4 minutes to reload the dialplan which would almost certainly cause some severe issues for us.  I didn't try this because of the severe side-effects.




By: Mark Murawski (kobaz) 2021-12-07 10:55:22.897-0600

This matches my workflow for the most part.  We db-generate pjsip_wizard and then issue a module reload res_pjsip and then dialplan reload.

Thanks!



By: Sean Bright (seanbright) 2021-12-07 11:16:03.884-0600

Unrelated to the problem but would cleanup the backtrace a ton - if you aren't using the following modules you should noload them in modules.conf:

chan_iax2
chan_mgcp
chan_phone
chan_skinny
chan_unistim
res_calendar
res_pktccops
pbx_dundi

Additionally, if you are on Linux you should be using res_timing_timerfd instead of res_timing_pthread.

None of this will resolve the deadlock you are experiencing.

By: Sean Bright (seanbright) 2021-12-07 11:21:29.369-0600

If you can [recompile with {{DEBUG_THREADS}} enabled|https://wiki.asterisk.org/wiki/display/AST/Getting+A+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc.], that would track the locks involved and make debugging this easier.

By: Steve Sether (stevesether) 2021-12-07 12:12:25.674-0600

We can't use DEBUG_THREADS in production, it reduces performance by such a large degree that Asterisk is essentially un-usable.

I'll see about un-loading those modules.  IIRC I had some trouble with module dependency when getting Asterisk 16 up, so it loaded quite a few of them.

By: Steve Sether (stevesether) 2021-12-07 12:16:22.592-0600

Oh, and what's the advantage/reason for using res_timing_timerfd over res_timing_pthread?  I'm not a C programmer, so I have a very limited understanding of threading in Linux.

By: Sean Bright (seanbright) 2021-12-07 12:32:27.453-0600

bq. res_timing_timerfd over res_timing_pthread?

res_timing_timerfd uses a timer interface exposed by the Linux kernel. res_timing_pthread abuses the pthreads API and provides a last-resort timer implementation.

bq. We can't use DEBUG_THREADS in production

Well if you change your mind we're here to help.

By: Sean Bright (seanbright) 2021-12-07 13:41:05.847-0600

So it looks like you might be able to workaround this by just serializing the dialplan related AMI commands you are sending to Asterisk. So just have one manager connection doing all of your 'dialplan reload' and 'dialplan add extension' calls instead of multiple.

By: Steve Sether (stevesether) 2021-12-07 14:07:26.387-0600

Well...  it's possible it's related to the dialplan reload commands.  This is what I thought initially as well, and made modifications to our script to only do the dialplan reload command a maximum of once every two minutes instead of every time we move a phone.

I suspect the reason there's so many dialplan reload commands is that the first one gets blocked, and that holds up all the others.  So I don't know it's that one reload is really causing the lock on the others.  It might be worth it on my end to clean this up though, so the script blocks when the dialplan reload never returns (I think there's a 30 second timeout currently in this script), which creates a lot of noise, and moves phones to a broken server.


By: Sean Bright (seanbright) 2021-12-07 14:12:17.108-0600

bq. Well... it's possible it's related to the dialplan reload commands.

More than possible. That is precisely what is occurring.

One thread is executing {{dialplan add extension}} and one is executing {{dialplan reload}} (both via separate manager connections) and they are deadlocking on one another. Just stop doing that and the deadlocks will stop. Obviously there is still a bug that needs fixing (we shouldn't deadlock), but there is no reason you can't workaround it in the meantime.

By: Steve Sether (stevesether) 2021-12-07 14:29:53.677-0600

We don't explicitly do any dialplan add extension, the dialplan reload command does this when it re-generates hints.  Adding extensions is much guaranteed to happen when we move phones from one server to another, since the we generate hints on a per-server basis. I guess it's just surprising to me that the dialplan reload would take over two minutes to complete without already being blocked by something else.  Normally this command only takes milli-seconds to run.

Still, what you're saying is definitely worth trying, and stopping two dialplan reloads happening at the same time is my next step.  There's obviously no reason to do two of these at the same time, so no matter what it's not "correct behavior".


By: Sean Bright (seanbright) 2021-12-07 14:34:41.193-0600

bq. We don't explicitly do any dialplan add extension

Yes you do. From [your more recent backtrace|^core.usi-vf01-mpls.ravon.net-2021-12-01T09-33-26-0600-full.txt]:

{noformat}
#9  0x00000000004ba668 in ast_cli_command_full (uid=-1, gid=-1, fd=44, s=0x7f2288039129 "dialplan add extension user1,hint,PJSIP/johnBrecount-CCD539F3A7B4&Custom:unavailable into blf-johnBrecount replace") at cli.c:2995
#10 0x000000000060213b in action_command (s=0x7f2345395cd0, m=0x7f23453953e0) at manager.c:5318
#11 0x0000000000606af0 in process_message (s=0x7f2345395cd0, m=0x7f23453953e0) at manager.c:6689
{noformat}

This portion of the backtrace means that someone sent something like:

{noformat}
Action: Command
Command: dialplan add extension user1,hint,PJSIP/johnBrecount-CCD539F3A7B4&Custom:unavailable into blf-johnBrecount replace
...
{noformat}

via a manager connection. It is not the result of some other command. Anyway, this is bordering on support but you have what you need to work around the problem.



By: Steve Sether (stevesether) 2021-12-07 14:53:31.199-0600

Hmmm..  You're right.  Thanks.  We have a cron job that's run every 5 minutes that's adding the hint.  I hadn't noticed that until you brought it up.  I think you're right that doing one, or the other, but not both will resolve this.

Thank you!