[Home]

Summary:ASTERISK-26019: Deadlock after 'core reload'
Reporter:Andrew Nagy (tm1000)Labels:
Date Opened:2016-05-12 11:08:49Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:
Versions:13.8.0 13.8.1 13.8.2 13.9.0 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace-threads-lgaetzdev2-2016-05-12.txt
Description:This is a sporadic issue. Which makes it all the more fun.

Occasionally while running "core reload" Asterisk will just up and stop all transactions. Extensions will unregister and calls will be dropped.

When trying to run "core reload" again sometimes Asterisk will say "previous reload not finished" but if you try to run "core reload" again after that it just wont say anything.

The CLI is still active at this point and there's been much debate about how to track down the bug.

It's only an issue in Asterisk 13. I have attached GDB logs to this ticket for further analysis

{code}
lgaetzdev2*CLI> core reload
   -- The previous reload command didn't finish yet
{code}
Comments:By: Asterisk Team (asteriskteam) 2016-05-12 11:08:49.556-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2016-05-12 11:20:02.326-0500

Please attach further backtraces as you get them. As well as "core show locks" output if you can get that.

By: Rusty Newton (rnewton) 2016-05-12 11:20:36.275-0500

Oh and DEBUG logs for the minutes before and during the reload would be great.

By: Andrew Nagy (tm1000) 2016-05-12 11:30:50.777-0500

At this time I can't get core show locks. Will get you the other information.

By: Andrew Nagy (tm1000) 2016-05-12 13:05:54.715-0500

Debug prior to not responding.
{code}
[2016-05-12 14:14:46] WARNING[2017]: pjproject:0 <?>:   tsx0x7feecc06b .Error sending Request msg OPTIONS/cseq=21777 (tdta0x3120a20): Transport not available for use (PJSIP_ETPNOTAVAIL)
[2016-05-12 14:14:46] WARNING[2017]: pjproject:0 <?>:   tsx0x7feecc06b .Transport error, terminating transaction. Err=171065 (Transport not available for use (PJSIP_ETPNOTAVAIL))
{code}

By: Kevin Harwell (kharwell) 2016-05-12 13:30:46.404-0500

Looking at the backtrace this appears to be a deadlock down in the pbx core.

Taking a look at Thread 13 we can see that it is waiting on the "hints" lock:
{noformat}
Thread 13 (Thread 0x7f0c7c526700 (LWP 8149)):
#0  0x00007f0d34646334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f0d346415f3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00007f0d346414d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000538379 in __ast_pthread_mutex_lock (filename=0x6524cb "pbx.c", lineno=3879, func=0x655f20 "ast_add_hint", mutex_name=0x653135 "hints", t=0x1d9add0) at lock.c:315
#4  0x000000000045c5d8 in __ao2_lock (user_data=0x1d9ae28, lock_how=AO2_LOCK_REQ_MUTEX, file=0x6524cb "pbx.c", func=0x655f20 "ast_add_hint", line=3879, var=0x653135 "hints") at astobj2.c:167
#5  0x00000000005742c4 in ast_add_hint (e=0x7f0d148da190) at pbx.c:3879
#6  0x000000000057e87e in ast_add_extension2_lockopt (con=0x7f0d141e2780, replace=0, extension=0x7f0d15ac8a07 "*992*10*2005", priority=-1, label=0x0, callerid=0x0, application=0x7f0c7c523b10 "Queue:700", data=0x7f0d1551dac0, datad=0x431360 <free@plt>, registrar=0x7f0d31c55470 "pbx_config", lock_context=1) at pbx.c:7268
#7  0x000000000057df66 in ast_add_extension2 (con=0x7f0d141e2780, replace=0, extension=0x7f0d15ac8a07 "*992*10*2005", priority=-1, label=0x0, callerid=0x0, application=0x7f0d150cecd5 "${DB(restapps/hints/queue/${EXTEN:8})}", data=0x7f0d1551dac0, datad=0x431360 <free@plt>, registrar=0x7f0d31c55470 "pbx_config") at pbx.c:7045
#8  0x000000000057d29e in ast_add_extension (context=0x7f0d141e2850 "restapps-device-hints", replace=0, extension=0x7f0d15ac8a07 "*992*10*2005", priority=-1, label=0x0, callerid=0x0, application=0x7f0d150cecd5 "${DB(restapps/hints/queue/${EXTEN:8})}", data=0x7f0d1551dac0, datad=0x431360 <free@plt>, registrar=0x7f0d31c55470 "pbx_config") at pbx.c:6742
#9  0x000000000057202b in internal_extension_state_extended (c=0x0, context=0x7f0d166fd5a9 "restapps-device-hints", exten=0x7f0d15ac8a07 "*992*10*2005", device_state_info=0x0) at pbx.c:3039
#10 0x00000000005720a1 in ast_extension_state (c=0x0, context=0x7f0d166fd5a9 "restapps-device-hints", exten=0x7f0d15ac8a07 "*992*10*2005") at pbx.c:3054
#11 0x000000000054e0ae in action_extensionstate (s=0x7f0c7c525a60, m=0x7f0c7c525550) at manager.c:5509
#12 0x000000000054ffdd in process_message (s=0x7f0c7c525a60, m=0x7f0c7c525550) at manager.c:6174
#13 0x0000000000550913 in do_message (s=0x7f0c7c525a60) at manager.c:6387
#14 0x0000000000550e0f in session_do (data=0x7f0cd4003058) at manager.c:6508
#15 0x00000000005e8cfd in handle_tcptls_connection (data=0x7f0cd4003058) at tcptls.c:695
#16 0x00000000005fb855 in dummy_start (data=0x7f0cd4001950) at utils.c:1235
#17 0x00007f0d3463faa1 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f0d339c793d in clone () from /lib64/libc.so.6
{noformat}
Further at frame #6 we can see that prior to attempting to get the hints lock this thread has already grabbed the context lock:
{noformat}
#6  0x000000000057e87e in ast_add_extension2_lockopt (con=0x7f0d141e2780, replace=0, extension=0x7f0d15ac8a07 "*992*10*2005", priority=-1, label=0x0, callerid=0x0, application=0x7f0c7c523b10 "Queue:700", data=0x7f0d1551dac0, datad=0x431360 <free@plt>, registrar=0x7f0d31c55470 "pbx_config", lock_context=1) at pbx.c:7268
{noformat}
Now check out Thread 11. It is attempting to acquire the contexts lock:
{noformat}
Thread 11 (Thread 0x7f0cbf181700 (LWP 8199)):
#0  0x00007f0d34646334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f0d346415f3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00007f0d346414d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000538379 in __ast_pthread_mutex_lock (filename=0x6524cb "pbx.c", lineno=8204, func=0x6559d0 "ast_rdlock_contexts", mutex_name=0x655828 "&conlock", t=0x8c6d00) at lock.c:315
#4  0x0000000000581454 in ast_rdlock_contexts () at pbx.c:8204
#5  0x0000000000571af9 in ast_hint_extension (c=0x0, context=0x7f0d14131960 "ext-meetme", exten=0x7f0d159cd488 "82001") at pbx.c:2887
#6  0x000000000057493b in ast_get_hint (hint=0x7f0cbf17c4b0 "\260\067\024\027\r\177", hintsize=512, name=0x0, namesize=0, c=0x0, context=0x7f0d14131960 "ext-meetme", exten=0x7f0d159cd488 "82001") at pbx.c:3997
#7  0x0000000000551bc5 in manager_state_cb (context=0x7f0d14131960 "ext-meetme", exten=0x7f0d159cd488 "82001", info=0x7f0cbf17c700, data=0x0) at manager.c:6747
#8  0x00000000005725e7 in execute_state_callback (cb=0x551b62 <manager_state_cb>, context=0x7f0d14131960 "ext-meetme", exten=0x7f0d159cd488 "82001", data=0x0, reason=AST_HINT_UPDATE_DEVICE, hint=0x7f0d16833e28, device_state_info=0x0) at pbx.c:3165
#9  0x00000000005745ba in ast_add_hint (e=0x7f0d159cd410) at pbx.c:3910
#10 0x000000000057dee7 in add_priority (con=0x7f0d14131890, tmp=0x7f0d159cd410, el=0x0, e=0x7f0d15b30e10, replace=0) at pbx.c:7009
#11 0x000000000057e598 in ast_add_extension2_lockopt (con=0x7f0d14131890, replace=0, extension=0x7f0cbf17df20 "82001", priority=-1, label=0x0, callerid=0x0, application=0x7f0d1565e6fb "confbridge:82001", data=0x7f0d15605ba0, datad=0x431360 <free@plt>, registrar=0x7f0d31c55470 "pbx_config", lock_context=1) at pbx.c:7184
#12 0x000000000057df66 in ast_add_extension2 (con=0x7f0d14131890, replace=0, extension=0x7f0cbf17df20 "82001", priority=-1, label=0x0, callerid=0x0, application=0x7f0d1565e6fb "confbridge:82001", data=0x7f0d15605ba0, datad=0x431360 <free@plt>, registrar=0x7f0d31c55470 "pbx_config") at pbx.c:7045
#13 0x00007f0d31c54288 in pbx_load_config (config_file=0x7f0d31c55460 "extensions.conf") at pbx_config.c:1841
#14 0x00007f0d31c551e4 in pbx_load_module () at pbx_config.c:2045
#15 0x00007f0d31c53254 in handle_cli_dialplan_reload (e=0x7f0d31e58c18, cmd=-4, a=0x7f0cbf1802f0) at pbx_config.c:1565
#16 0x00000000004d8835 in ast_cli_command_full (uid=499, gid=498, fd=39, s=0x7f0cbf1805e0 "dialplan reload") at cli.c:2757
#17 0x00000000004d8999 in ast_cli_command_multiple_full (uid=499, gid=498, fd=39, size=16, s=0x7f0cbf180a4f "dialplan reload") at cli.c:2783
#18 0x0000000000453134 in netconsole (vconsole=0x8e5ea0) at asterisk.c:1532
#19 0x00000000005fb855 in dummy_start (data=0x7f0d20003dc0) at utils.c:1235
#20 0x00007f0d3463faa1 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f0d339c793d in clone () from /lib64/libc.so.6
{noformat}
However, at frame #9 it grabbed the "hints" lock (it gets the "hints" lock in ast_add_hint).

So Thread 11 holds the "hints" lock and is waiting on the "context" lock, while Thread 13 holds the "context" lock and is waiting on the "hints" lock.

By: Andrew Nagy (tm1000) 2016-05-12 19:42:47.074-0500

This is because of ASTERISK-25996. Since I can't get AMI reload to generate the hints dynamically we were running an AMI reload then immediately after a dialplan reload. However there is an daemon also running in the background that detects reloads() and then tries to get the state of dyanmic hints through extensonState, because it's blocked by ASTERISK-25996 it has to use AMI "command dialplan" to force Asterisk to generate the hints.

This because sort of a snowball for Asterisk and eventually it just gives up on life. For now we've just taken out the dialplan reload. We've turned AMI reload into the CLI: asterisk -rx 'core reload' and it's solved all of our problems. Temporarily of course. I hope that soon/someday ASTERISK-25996 will be fixed :-)

With regards to this ticket it can probably be closed in favor of ASTERISK-25996, though strangely Asterisk 11 never locked up (and it was getting the same "hits" as 13)