[Home]

Summary:ASTERISK-19837: Asterisk crashing regularly in 1.8.11.1 due to memory corruption
Reporter:David Cunningham (dcunningham)Labels:
Date Opened:2012-05-03 18:42:33Date Closed:2012-10-02 17:18:50
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/General
Versions:1.8.11.1 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-19923 Asterisk crashing due to memory corruptions in chan_sip/voicemail
Environment:CentOS release 5.6 (Final), i386Attachments:( 0) full-19837-3.log.gz
Description:Since we upgraded a system to Asterisk 1.8.11.1 we are getting regular crashes. The backtrace is below and the core file will be attached. Thanks.

(gdb) bt full
#0  0x007f5175 in _int_free () from /lib/libc.so.6
No symbol table info available.
#1  0x007f5b09 in free () from /lib/libc.so.6
No symbol table info available.
#2  0x0818cc5c in destroy (pvt=0xa055a08) at translate.c:145
       t = 0x4071f520
#3  0x0818cf40 in ast_translator_free_path (p=0xa055a08) at translate.c:233
       pn = 0x0
#4  0x080ada79 in free_translation (clonechan=0x44b958a8) at channel.c:2705
No locals.
#5  0x080ae11e in ast_hangup (chan=0x44b958a8) at channel.c:2794
       extra_str = "xX\271DxX\271D\000\000\000\000\000\000\000\000\024\\\271D\000\000\000\000\230o\216B\350c\b\bÔ‹\034\b\364\017\216\000(GT\n@!\216\000\270o\216B\t[\177\000\000\000\000\000\000\002\002"
       was_zombie = 0
       __PRETTY_FUNCTION__ = "ast_hangup"
#6  0x08144811 in __ast_pbx_run (c=0x44b958a8, args=0x0) at pbx.c:5235
       found = 1
       res = -1
       autoloopflag = 0
       error = 1
       __PRETTY_FUNCTION__ = "__ast_pbx_run"
#7  0x08144b3d in pbx_thread (data=0x44b958a8) at pbx.c:5327
       c = 0x44b958a8
#8  0x08196782 in dummy_start (data=0x44a0b2f8) at utils.c:1004
       __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {9715700, 0, 1116634000, 1116631992, -422965372, -1403184250}, __mask_was_saved = 0}},
         __pad = {0x428e7470, 0x0, 0x0, 0x0}}
       __cancel_routine = 0x807ae77 <ast_unregister_thread>
       __cancel_arg = 0x428e7b90
       not_first_call = 0
       ret = 0x89e46e
       a = {start_routine = 0x8144b1e <pbx_thread>, data = 0x44b958a8,
         name = 0x44a79568 "pbx_thread", ' ' <repeats 11 times>, "started at [ 5353] pbx.c ast_pbx_start()"}
#9  0x00933832 in start_thread () from /lib/libpthread.so.0
No symbol table info available.
#10 0x0085e45e in clone () from /lib/libc.so.6
No symbol table info available.
Comments:By: David Cunningham (dcunningham) 2012-05-03 18:45:00.705-0500

The core file is 104Mb in size. How can we provide this to you please?


By: Richard Mudgett (rmudgett) 2012-05-03 19:27:02.753-0500

Core files are not wanted because:
1) They are huge.
2) They are only useful on the machine that generated them.

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

What do you need to do to cause this crash?

By: David Cunningham (dcunningham) 2012-05-03 19:32:09.296-0500

From our view it appears to be random - in other words, we have not identified any particular action that causes the crash. Thanks.

By: Matt Jordan (mjordan) 2012-05-04 08:13:05.303-0500

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Please collect a DEBUG log illustrating the actions Asterisk takes leading up to the crash.

By: David Cunningham (dcunningham) 2012-05-04 18:24:46.024-0500

Asterisk full log is attached. The crash happened on line 18993 at 16:20:14. Thanks.

By: Russell Bryant (russell) 2012-06-01 06:50:42.825-0500

Does this happen in a test environment?  I was wondering if it would be possible to test under valgrind:

https://wiki.asterisk.org/wiki/display/AST/Valgrind

By: David Cunningham (dcunningham) 2012-06-04 04:24:59.002-0500

This is happening on a production environment. I think the customer would be unhappy to run it under valgrind unless we were very certain it wouldn't cause any issues noticeable by their customers. We could upgrade a test system to this version of Asterisk, but since we don't know how to reproduce the problem it may not happen.
Can you advise if a valgrind trace is absolutely required, and if there's any risk associated with it?

By: Russell Bryant (russell) 2012-06-04 04:45:09.666-0500

There is certainly a risk.  I wouldn't try to use it in production.

By: David Cunningham (dcunningham) 2012-06-04 04:52:24.973-0500

OK thanks, please advise if we can help with anything else.

By: Russell Bryant (russell) 2012-06-04 19:34:55.892-0500

After looking into this some more, I'm not sure how to proceed unless we're able to reproduce this in a test environment.  If the customer has any sort of downtime where the system isn't used when valgrind could be used, that would be great.  Otherwise, I think trying as similar of a setup as possible on a test system would be the best next step.  Let me know if I can help.

By: David Cunningham (dcunningham) 2012-06-05 11:17:17.150-0500

Russell - thanks, will check with the customer and get back to you.

By: Igor Goncharovsky (igorg) 2012-06-07 01:15:37.249-0500

I have got access to customer server, so can look at core files. Almost all core files have differend backtraces, failing both on allocating and freeing of memmory. Core files created mostly in workhours (failing approx. once in 3 days) and I see no specific way to reproduce issue.

#0  0x40000410 in __kernel_vsyscall ()
#1  0x007b4df0 in raise () from /lib/libc.so.6
#2  0x007b6701 in abort () from /lib/libc.so.6
#3  0x007ed3ab in __libc_message () from /lib/libc.so.6
#4  0x007f6d96 in _int_malloc () from /lib/libc.so.6
#5  0x007f7cca in calloc () from /lib/libc.so.6
#6  0x081954c8 in _ast_calloc (num=1, len=402, file=0x81f57dc "/usr/src/asterisk-1.8.11.1/include/asterisk/strings.h", lineno=420, func=0x81f57cc "ast_str_create") at /usr/src/asterisk-1.8.11.1/include/asterisk/utils.h:480
#7  0x081958db in ast_str_create (init_len=390) at /usr/src/asterisk-1.8.11.1/include/asterisk/strings.h:405
#8  0x4057c0e8 in __sip_reliable_xmit (p=0x9eb68a0, seqno=102, resp=0, data=0x9ee1510, fatal=0, sipmethod=14) at chan_sip.c:3770
#9  0x4057e2a1 in send_request (p=0x9eb68a0, req=0x4216fdd8, reliable=XMIT_RELIABLE, seqno=102) at chan_sip.c:4229
#10 0x405ab9e0 in transmit_request (p=0x9eb68a0, sipmethod=14, seqno=102, reliable=XMIT_RELIABLE, newbranch=0) at chan_sip.c:13600
#11 0x40588a09 in sip_hangup (ast=0xa247338) at chan_sip.c:6336
#12 0x080ae3bb in ast_hangup (chan=0xa247338) at channel.c:2831
#13 0x4055ad52 in hanguptree (outgoing=0xa65e8b0, exception=0x0, answered_elsewhere=0) at app_dial.c:700


(gdb) bt
#0  0x40000410 in __kernel_vsyscall ()
#1  0x007b4df0 in raise () from /lib/libc.so.6
#2  0x007b6701 in abort () from /lib/libc.so.6
#3  0x007ed3ab in __libc_message () from /lib/libc.so.6
#4  0x007f6721 in _int_malloc () from /lib/libc.so.6
#5  0x007f7fb7 in malloc () from /lib/libc.so.6
#6  0x08195461 in _ast_malloc (len=229, file=0x81e05f4 "manager.c", lineno=5029, func=0x81e322a "append_event") at /usr/src/asterisk-1.8.11.1/include/asterisk/utils.h:457
#7  0x0812d971 in append_event (
   str=0x8e05744 "Event: AGIExec\r\nPrivilege: agi,all\r\nTimestamp: 1337788773.143723\r\nSubEvent: Start\r\nChannel: Local/2@enswitch-call-exten-9e55;2\r\nCommandId: 656304319\r\nCommand: SET VARIABLE TIMEOUT(absolute) \"86398\"\r\n\r"..., category=8192) at manager.c:5029
#8  0x0812de3d in __ast_manager_event_multi



By: David Cunningham (dcunningham) 2012-06-14 02:56:32.713-0500

We've had to downgrade another customer due to regular crashes in 1.8.12.0.
Can we be of any assistance with fixing this?


By: Russell Bryant (russell) 2012-06-14 11:07:38.712-0500

You could try running Asterisk under valgrind in a test environment similar to what your customers are using and see if you get any output when running test calls. :-)

By: Russell Bryant (russell) 2012-06-19 16:21:23.572-0500

Is Voicemail being used?  If so, give the patch on this issue a try:

ASTERISK-19923

By: Matt Jordan (mjordan) 2012-06-25 08:45:39.280-0500

David:

Have you had a chance to try the patch on ASTERISK-19923?

Matt

By: David Cunningham (dcunningham) 2012-06-25 18:42:33.451-0500

We're waiting to hear from the customer. Will update you, thanks.

By: Igor Goncharovsky (igorg) 2012-07-08 23:59:42.605-0500

Patch helps with this issue, reported by customer

By: David Cunningham (dcunningham) 2012-07-09 00:15:42.086-0500

Sorry, yes, the customer has reported no crashes since the patch was applied.
What version of Asterisk will this patch be put into please?


By: David Cunningham (dcunningham) 2012-07-09 00:15:55.107-0500

Sending back.

By: Igor Goncharovsky (igorg) 2012-07-09 00:26:32.092-0500

It is in 1.8.14.0-rc2 and should be released in 1.8.14.0