[Home]

Summary:ASTERISK-27862: ARI: Crashing on json_mem_free (after sending event)
Reporter:Stefan Repke (stefffan@gmx.de)Labels:
Date Opened:2018-05-18 08:27:45Date Closed:2018-06-14 09:00:58
Priority:MajorRegression?
Status:Closed/CompleteComponents:Applications/app_stasis
Versions:13.21.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Attachments:( 0) backtrace_1.txt
( 1) backtrace_2.txt
( 2) backtrace_3.txt
( 3) cli.txt
Description:ARI is activated and there are several apps connected with "subscribeAll=true".
Then, Asterisk crashes occasionally when delivering ARI events!

With normal workload, this happens once or twice a day. It can be provoked by adding many (eg. 10) ARI subscribers and producing lots of events (eg. PeerStatusChange by registering/unregistering a peer).

I coredumped three crashes, all having a memory corruption in {{json_mem_free}} at {{json.c:84}}; cf. attached backtraces.
* Backtrace 2 and 3 have similar problems: {{double free or corruption (fasttop)}},
* Backtrace 1 looks like wrong memory access.

I've also attached the last logger messages from the CLI (debug=5 / verbose=5).
Comments:By: Asterisk Team (asteriskteam) 2018-05-18 08:27:47.038-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Richard Mudgett (rmudgett) 2018-05-18 16:35:29.299-0500

Your backtrace appears to contain a memory corruption. We need one or both of the following items to continue investigation of the issue:
1. Valgrind output. See https://wiki.asterisk.org/wiki/display/AST/Valgrind for instructions on how to use Valgrind with Asterisk.
2. MALLOC_DEBUG output. See https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag for instructions on how to use the MALLOC_DEBUG option.

Note that MALLOC_DEBUG and Valgrind are mutually exclusive options. Valgrind output is preferable, but will be more system resource intensive and may be difficult to get on a production system. In such a case, you may have better luck getting the necessary output from MALLOC_DEBUG.



By: Stefan Repke (stefffan@gmx.de) 2018-05-22 09:00:23.957-0500

It's a memory corruption for sure! I did some extensive testing with the two options of memory debugging:

1. With valgrind, I wasn't able to reproduce the crash!

2. With {{MALLOC_DEBUG}} (and {{memory backtrace off}}), it crashed again (after a few seconds of stress testing):
{noformat:title=mmlog}
WARNING: Memory corrupted after free of 0x7f92c0020480 allocated at json.c ast_json_malloc() line 155
WARNING: Memory corrupted after free of 0x7f92ec0066c0 allocated at json.c ast_json_malloc() line 155
WARNING: Memory corrupted after free of 0x7f92c00381f0 allocated at json.c ast_json_malloc() line 155
WARNING: Freeing unregistered memory 0x7f92ac041cf0 by json.c json_mem_free() line 84
{noformat}
{noformat:title=core-thread1.txt}
!@!@!@! thread1.txt !@!@!@!

$1 = {si_signo = 11, si_errno = 0, si_code = 128, _sifields = {_pad = {0 <repeats 28 times>}, _kill = {si_pid = 0, si_uid = 0}, _timer = {si_tid = 0, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x0}, _sigpoll = {si_band = 0, si_fd = 0}}}
Signal        Stop Print Pass to program Description
SIGSEGV       Yes Yes Yes Segmentation fault

Thread 1 (Thread 0x7f92b3d17700 (LWP 9121)):
#0  0x0000000000536992 in ast_json_unref (json=0x7f92ac04adc8) at json.c:253
       __cur = 0xdeaddeaddeaddead
       free_list = 0x7f92ac010150
       mem = 0x7f92ac041cf0
#1  0x00007f92f37cc6c5 in sub_default_handler (data=0x7f935000cf70, sub=0x7f935000f650, message=0x7f92ac045d80) at stasis/app.c:357
       app = 0x7f935000cf70
       json = 0x7f92ac04adc8
#2  0x00000000005deb0e in router_dispatch (data=0x7f935000f510, sub=0x7f935000f650, message=0x7f92ac045d80) at stasis_message_router.c:201
       router = 0x7f935000f510
       route = {message_type = 0x0, callback = 0x7f92f37cc616 <sub_default_handler>, data = 0x7f935000cf70}
#3  0x00000000005cd840 in subscription_invoke (sub=0x7f935000f650, message=0x7f92ac045d80) at stasis.c:434
       __PRETTY_FUNCTION__ = "subscription_invoke"
#4  0x00000000005ce319 in dispatch_exec_async (local=0x7f92b3d16da0) at stasis.c:733
       sub = 0x7f935000f650
       message = 0x7f92ac045d80
#5  0x00000000005eb2e9 in ast_taskprocessor_execute (tps=0x7f935000f7c0) at taskprocessor.c:970
       local = {local_data = 0x7f935000f650, data = 0x7f92ac045d80}
       t = 0x7f92b80131e0
       size = 0
       __PRETTY_FUNCTION__ = "ast_taskprocessor_execute"
#6  0x00000000005e957c in default_tps_processing_function (data=0x7f9350002100) at taskprocessor.c:191
       listener = 0x7f9350002100
       tps = 0x7f935000f7c0
       pvt = 0x7f9350001fe0
       sem_value = 6701170
       res = 0
       __PRETTY_FUNCTION__ = "default_tps_processing_function"
#7  0x0000000000600b19 in dummy_start (data=0x7f935000fb20) at utils.c:1239
       __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {0, -1581999541431752663, 0, 140270822379616, 0, 140268058801920, -1581999541406586839, 1526435149223897129}, __mask_was_saved = 0}}, __pad = {0x7f92b3d16ef0, 0x0, 0x0, 0x0}}
       __cancel_routine = 0x451d2a <ast_unregister_thread>
       __cancel_arg = 0x7f92b3d17700
       __not_first_call = 0
       ret = 0x0
       a = {start_routine = 0x5e94e6 <default_tps_processing_function>, data = 0x7f9350002100, name = 0x7f935000fbe0 "default_tps_processing_function started at [  208] taskprocessor.c default_listener_start()"}
       __PRETTY_FUNCTION__ = "dummy_start"
#8  0x00007f9356986064 in start_thread (arg=0x7f92b3d17700) at pthread_create.c:309
       __res = <optimized out>
       pd = 0x7f92b3d17700
       now = <optimized out>
       unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140268058801920, 1526434874704078889, 0, 140270822379616, 0, 140268058801920, -1581999541429655511, -1582925802170013655}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
       not_first_call = <optimized out>
       pagesize_m1 = <optimized out>
       sp = <optimized out>
       freesize = <optimized out>
       __PRETTY_FUNCTION__ = "start_thread"
#9  0x00007f935592962d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.
{noformat}

3. With {{MALLOC_DEBUG}} and {{memory backtrace on}}, I wasn't able to reproduce the crash either (even after hours of stress testing)!
But this gives interessting insights into the other memory corruptions "at json.c ast_json_malloc() line 155":
{noformat:title=mmlog}
WARNING: Memory corrupted after free of 0x7f894c088400 allocated at json.c ast_json_malloc() line 155
Memory allocation backtrace:
#0: [0x45cc17] main/astmm.c:523 __ast_malloc() (0x45cbdb+3C)
#1: [0x53674f] main/json.c:155 ast_json_malloc() (0x536724+2B)
#2: [0x7f89a1cd72ee] libjansson.so.4 json_integer() (0x7f89a1cd72e0+E)
#3: [0x7f89a1cd5c76] libjansson.so.4 <unknown>()
#4: [0x7f89a1cd5afb] libjansson.so.4 <unknown>()
#5: [0x7f89a1cd5f5f] libjansson.so.4 json_vpack_ex() (0x7f89a1cd5ed0+8F)
#6: [0x537b6d] main/json.c:859 ast_json_vpack() (0x537b1c+51)
#7: [0x537b0c] main/json.c:850 ast_json_pack() (0x537a65+A7)
#8: [0x7f890b49ebb7] channels/chan_iax2.c:9082 update_registry()
#9: [0x7f890b4ae9ed] channels/chan_iax2.c:11555 socket_process_helper()
#10: [0x7f890b4b26e0] channels/chan_iax2.c:11974 socket_process()
#11: [0x7f890b4b2dba] channels/chan_iax2.c:12086 iax2_process_thread()
#12: [0x600b19] main/utils.c:1239 dummy_start()
{noformat}

4. Further, I've observed some occurances of the following line during stress testing:
{noformat}
[2018-05-22 13:30:17] ERROR[21499]: ari/ari_websockets.c:175 ast_ari_websocket_session_write: Failed to encode JSON object
{noformat}



All this looks like a fatal race condition on the json objects ...


By: Richard Mudgett (rmudgett) 2018-05-22 18:03:03.444-0500

Which version of libjansson are you using?  A change was made https://gerrit.asterisk.org/#/c/8186/ that removed some locking to take advantage of some thread safe changes done in libjansson version 2.11.

By: Stefan Repke (stefffan@gmx.de) 2018-05-23 04:05:39.569-0500

We use the Debian package libjansson of version 2.7-1+deb8u1.

Further, I was curious about the advantages you mentioned and moved to Debian package 2.11-1. Then, all of the above problems are gone:
* no crashes!
* no "Memory corrupted after free of 0x... allocated at json.c ast_json_malloc() line 155"!
* no "ERROR\[...\]: ari/ari_websockets.c:175 ast_ari_websocket_session_write: Failed to encode JSON object"!

Hence, Asterisk has threadding issues when using an older version of libjansson (like Debian stable) ...