ASTERISK-26759: Double free, ast_taskprocessor_execute->tps_task

[Home]

Summary: ASTERISK-26759: Double free, ast_taskprocessor_execute->tps_task_free

Reporter: Stuart Henderson (sthen) Labels:

Date Opened: 2017-01-30 04:34:43.000-0600 Date Closed: 2018-01-01 13:21:37.000-0600

Priority: Major Regression? No

Status: Closed/Complete Components: Core/General

Versions: 13.13.1 Frequency of
Occurrence One Time

Related
Issues:

Environment: OpenBSD 6.0, amd64. Using chan_sip, odbc realtime, http manager. Attachments: ( 0) 0001-http-getprotobyname-is-not-thread-safe.patch
( 1) ast_26759.debug.txt

Description: Asterisk had been up and operating normally for a week then exited with a malloc(3) detected "chunk is already free" error (OpenBSD malloc has double-free detection enabled all the time). I get the following backtrace - short bt inline, output from "thread apply all bt full" is at https://junkpile.org/ast13.13.1-20170130.txt.

{noformat}
(gdb) bt
#0 0x00001f1890b581ca in thrkill () at <stdin>:2
#1 0x00001f1890b3e499 in *_libc_abort () at /usr/src/lib/libc/stdlib/abort.c:52
#2 0x00001f1890b82969 in wrterror (d=0x1f18437474d0,
msg=0x1f1890c9980d "chunk is already free", p=0x1f181121f990)
at /usr/src/lib/libc/stdlib/malloc.c:295
#3 0x00001f1890b82b3e in find_chunknum (d=0x1f18437474d0, r=<optimized out>,
ptr=0x1f181121f990) at /usr/src/lib/libc/stdlib/malloc.c:1054
#4 0x00001f1890b839b8 in free_bytes (ptr=<optimized out>, r=<optimized out>,
d=<optimized out>) at /usr/src/lib/libc/stdlib/malloc.c:1072
#5 ofree (pool=0x1f18437474d0, p=0x1f181121f990)
at /usr/src/lib/libc/stdlib/malloc.c:1337
#6 0x00001f1890b83e3b in free (ptr=0x1f1862aae680)
at /usr/src/lib/libc/stdlib/malloc.c:1364
#7 0x00001f15964a7209 in tps_task_free ()
#8 0x00001f15964a72eb in ast_taskprocessor_execute ()
#9 0x00001f15964af641 in worker_start ()
#10 0x00001f15964b97c5 in dummy_start ()
#11 0x00001f180e05e31e in _rthread_start (v=0x0)
at /usr/src/lib/librthread/rthread.c:115
#12 0x00001f1890b0b75b in __tfork_thread ()
at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:75
#13 0x0000000000000000 in ?? ()
{noformat}

The following lines were logged 2 seconds before the timestamp of the core, however they have been seen on one previous occasion not associated with a crash.

{noformat}
Jan 30 09:41:50 pbx7 asterisk[39270]: WARNING[-1]: http.c:1948 in httpd_helper_thread: Failed to set TCP_NODELAY on HTTP connection, getprotobyname("tcp") failed
Jan 30 09:41:50 pbx7 asterisk[39270]: WARNING[-1]: http.c:1949 in httpd_helper_thread: Some HTTP requests may be slow to respond.
{noformat}

What other information would be useful? Is there anything I should capture in case it happens again? I have the core in case there's something you'd like me to poke at with gdb. Thanks.

Comments: By: Asterisk Team (asteriskteam) 2017-01-30 04:34:44.315-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Stuart Henderson (sthen) 2017-02-02 06:00:37.288-0600

Hit another double free, same logs prior to the crash, different backtrace.

{noformat}
[Feb 2 10:46:53] WARNING[-1]: http.c:1948 httpd_helper_thread: Failed to set TCP_NODELAY on HTTP connection, getprotobyname("tcp") failed
[Feb 2 10:46:53] WARNING[-1]: http.c:1949 httpd_helper_thread: Some HTTP requests may be slow to respond.
[Feb 2 10:46:53] WARNING[-1]: http.c:1944 httpd_helper_thread: Failed to set TCP_NODELAY on HTTP connection: Invalid argument
pbx7*CLI>
Disconnected from Asterisk server
{noformat}

{noformat}
#0 0x00001aef15cb51ca in thrkill () at <stdin>:2
#1 0x00001aef15c9b499 in *_libc_abort () at /usr/src/lib/libc/stdlib/abort.c:52
#2 0x00001aef15cdf969 in wrterror (d=0x1aef4f1d43d0,
msg=0x1aef15df680d "chunk is already free", p=0x1aeefa33bb00)
at /usr/src/lib/libc/stdlib/malloc.c:295
#3 0x00001aef15cdfb3e in find_chunknum (d=0x1aef4f1d43d0, r=<optimized out>,
ptr=0x1aeefa33bb00) at /usr/src/lib/libc/stdlib/malloc.c:1054
#4 0x00001aef15ce09b8 in free_bytes (ptr=<optimized out>, r=<optimized out>,
d=<optimized out>) at /usr/src/lib/libc/stdlib/malloc.c:1072
#5 ofree (pool=0x1aef4f1d43d0, p=0x1aeefa33bb00)
at /usr/src/lib/libc/stdlib/malloc.c:1337
#6 0x00001aef15ce0e3b in free (ptr=0x1aeece3e0b80)
at /usr/src/lib/libc/stdlib/malloc.c:1364
#7 0x00001aec6a5a49f7 in __ast_string_field_free_memory ()
#8 0x00001aec6a519e56 in logmsg_free ()
#9 0x00001aec6a51f3b8 in logger_thread ()
#10 0x00001aec6a5b97c5 in dummy_start ()
#11 0x00001aef6194131e in _rthread_start (v=0x0)
at /usr/src/lib/librthread/rthread.c:115
#12 0x00001aef15c6875b in __tfork_thread ()
at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:75
#13 0x0000000000000000 in ?? ()
{noformat}

"thread apply all bt full" at https://junkpile.org/ast13.13.1-20170202.txt
By: Stuart Henderson (sthen) 2017-02-02 06:13:15.431-0600

...though getprotobyname() failing doesn't make a huge amount of sense... the only reason I can see would be hitting FD limits (currently allowing 2048 and Asterisk is using around 150 or so in normal operation). I'll start logging FD use to look for hints of a leak.
By: Rusty Newton (rnewton) 2017-02-02 07:13:49.034-0600

We need one or both of the following items to continue investigation of the issue:
1. Valgrind output. See https://wiki.asterisk.org/wiki/display/AST/Valgrind for instructions on how to use Valgrind with Asterisk.
2. MALLOC_DEBUG output. See https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag for instructions on how to use the MALLOC_DEBUG option.

Note that MALLOC_DEBUG and Valgrind are mutually exclusive options. Valgrind output is preferable, but will be more system resource intensive and may be difficult to get on a production system. In such a case, you may have better luck getting the necessary output from MALLOC_DEBUG.

By: Rusty Newton (rnewton) 2017-02-02 07:16:33.439-0600

Along with the valgrind or malloc_debug output, get a debug log: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information , make sure it has verbose and debug log levels turned up to 5 each, and of course include warning,error and notice levels.

In addition, per the issue tracker guidelines, please attach traces and debug directly to the issue with a .txt extension.

Thanks!
By: Stuart Henderson (sthen) 2017-02-02 07:46:38.459-0600

Thanks Rusty. I'll do a build with MALLOC_DEBUG set (and adjust audiohook_read_frame_both debug so I can actually get debug logs without flattening the box..) and add feedback when I have it.

Noted about attachments, I didn't find the option under "more" before.

I've also just spotted that my FD limit setting was only partially working and would have been 512 not 2048, so it's probable that this is the root cause, still I'll try to get more information to help track down the double-free.
By: Rusty Newton (rnewton) 2017-02-02 07:49:45.796-0600

Thanks!
By: Stuart Henderson (sthen) 2017-02-07 10:37:34.106-0600

Another occurrence, it seems we have a problem with some of our AMI/HTTP code which was causing multiple Action=hangup requests for the same channel to be sent; this had occurred at the time of this crash (and previous crashes). I've attached debug logs and a better backtrace with numbers etc masked out (consistently, done by search+replace). All I'm getting in mmlog from MALLOC_DEBUG here is "<timestamp> - New session".
By: Rusty Newton (rnewton) 2017-02-07 19:00:42.618-0600

Alrighty. I'm opening this up as this may be as much info as is useful. Remember that OpenBSD is not officially supported, so the time until resolution will reflect that. If you come up with any further data, please attach it to the issue. If you can narrow down the issue and build out reproduction instructions for a test environment that would be the most helpful thing for any community developer that has interest in taking this issue on.
By: Sean Bright (seanbright) 2017-03-17 20:46:30.117-0500

[~sthen], I've attached a patch against 13.13.1. Could you give it a whirl and let us know what happens?
By: Stuart Henderson (sthen) 2017-03-18 10:44:21.828-0500

Thanks Sean, that looks like an excellent candidate (and the right thing to do whether or not it fixes things for me) - I'll test over the next few days. There's another one in Skinny.
By: Corey Farrell (coreyfarrell) 2018-01-01 13:21:38.060-0600

It looks like this was fixed by https://gerrit.asterisk.org/5249. If I'm incorrect you can reopen this issue by commenting.