[Home]

Summary:ASTERISK-25212: [patch]Segfault when using DEBUG_FD_LEAKS
Reporter:Walter Doekes (wdoekes)Labels:
Date Opened:2015-06-30 09:31:18Date Closed:2015-07-02 07:53:20
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_dial Channels/chan_sip/General Core/General
Versions:11.19.0 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-25213 [patch]Possibility of deadlock in chan_sip INVITE early Replace code
Environment:Attachments:( 0) bt.txt
( 1) btall.txt
( 2) btallfull.txt
( 3) btfull.txt
( 4) infothreads.txt
( 5) issueA25212-fix_fdleak_overflow_and_cleanup.patch
Description:[Edit by Rusty - copying part of Walter's comment up here so it will be seen quickly for those first looking at the issue]
Ok. So DETECT_DEADLOCKS wasn't the cause. After disabling it, it still crashes on the same spot.
Same when disabling DEBUG_THREADS.

Details how to reproduce are in ASTERISK-25213. Crash happens with and without an attempt at fixing the refer_locked issue.
[end copy paste]

When attempting to reproduce a deadlock, I enabled DEBUG_THREADS and DETECT_DEADLOCKS.

Instead of detecting the deadlock, sometimes Asterisk crashes instead.

Recent crash contains this:

{noformat}
(gdb) bt
#0  pthread_cond_signal@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S:69
#1  0x0000000000509909 in __ast_cond_signal (filename=0x5c993b "autoservice.c", lineno=222, func=0x5c9a10 <__PRETTY_FUNCTION__.13263> "ast_autoservice_start", cond_name=0x5c9967 "&as_cond",
   cond=0x885680 <as_cond>) at lock.c:508
#2  0x0000000000456ec8 in ast_autoservice_start (chan=0x3b31188) at autoservice.c:222
#3  0x00000000004573ba in ast_autoservice_chan_hangup_peer (chan=0x3b31188, peer=0x7eff1038ef68) at autoservice.c:317
#4  0x00007efed32345c7 in dial_exec_full (chan=0x3b31188, data=0x7efec6cc6b10 "SIP/bob", peerflags=0x7efec6cc64c0, continue_exec=0x0) at app_dial.c:3078
#5  0x00007efed323499a in dial_exec (chan=0x3b31188, data=0x7efec6cc6b10 "SIP/bob") at app_dial.c:3130
#6  0x0000000000533b49 in pbx_exec (c=0x3b31188, app=0x27c0440, data=0x7efec6cc6b10 "SIP/bob") at pbx.c:1677
#7  0x000000000053e7c8 in pbx_extension_helper (c=0x3b31188, con=0x0, context=0x3b31fd8 "default", exten=0x3b32028 "bob", priority=1, label=0x0, callerid=0x3df43b0 "alice", action=E_SPAWN,
   found=0x7efec6ccabfc, combined_find_spawn=1) at pbx.c:4970
{noformat}

{noformat}
(gdb) disass
Dump of assembler code for function pthread_cond_signal@@GLIBC_2.3.2:
...
  0x00007eff16ab3c4e <+62>: je     0x7eff16ab3c67 <pthread_cond_signal@@GLIBC_2.3.2+87>
  0x00007eff16ab3c50 <+64>: mov    0x20(%r8),%rcx
=> 0x00007eff16ab3c54 <+68>: mov    0x10(%rcx),%r11d
  0x00007eff16ab3c58 <+72>: and    $0x30,%r11d
  0x00007eff16ab3c5c <+76>: cmp    $0x20,%r11d
...
(gdb) up
#1  0x0000000000509909 in __ast_cond_signal (filename=0x5c993b "autoservice.c", lineno=222, func=0x5c9a10 <__PRETTY_FUNCTION__.13263> "ast_autoservice_start", cond_name=0x5c9967 "&as_cond",
   cond=0x885680 <as_cond>) at lock.c:508
508 return pthread_cond_signal(cond);
(gdb) print cond
$55 = (ast_cond_t *) 0x885680 <as_cond>
(gdb) print *cond
$56 = {__data = {__lock = 1, __futex = 1638, __total_seq = 8389759012580623496, __wakeup_seq = 7809628147932885606, __woken_seq = 6874024004411486821, __mutex = 0x7069700074696e69, __nwaiters = 101,
   __broadcast_seq = 2063597568}, __size = "\001\000\000\000f\006\000\000\210\004\000\000_intfrnal_alertpipe_init\000pipe\000\000\000\000\000\000{", __align = 7035156430849}
(gdb) p *(char(*)[48])cond
$57 = "\001\000\000\000f\006\000\000\210\004\000\000_intfrnal_alertpipe_init\000pipe\000\000\000\000\000\000{"
{noformat}

That doesn't look like valid values in the cond struct.

{noformat}
(gdb) info registers
rax            0xca 202
rbx            0x10 16
rcx            0x7069700074696e69 8100128552057925225
rdx            0x1 1
rsi            0x5 5
rdi            0x885684 8935044
rbp            0x7efec6cc5520 0x7efec6cc5520
rsp            0x7efec6cc54f0 0x7efec6cc54f0
r8             0x885680 8935040
r9             0x843510 8664336
r10            0x7efec4456de0 139632679677408
r11            0x7efec6ccafc8 139632722096072
r12            0x3b32028 62070824
r13            0x1 1
r14            0x3caa942 63613250
r15            0x27c04a0 41682080
rip            0x509909 0x509909 <__ast_cond_signal+39>
eflags         0x10217 [ CF PF AF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0 0
es             0x0 0
fs             0x0 0
gs             0x0 0
{noformat}

Cheers,
Walter Doekes
OSSO B.V.
Comments:By: Asterisk Team (asteriskteam) 2015-06-30 09:31:19.320-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Walter Doekes (wdoekes) 2015-06-30 10:55:50.239-0500

Ok. So DETECT_DEADLOCKS wasn't the cause. After disabling it, it still crashes on the same spot.
Same when disabling DEBUG_THREADS.

Details how to reproduce are in ASTERISK-25213. Crash happens with and without an attempt at fixing the refer_locked issue.

Just sit back and wait, and it dies after half a minute (at around 600+ calls):
{noformat}
*CLI> Segmentation fault (core dumped)

real 0m23.660s
user 0m1.661s
sys 0m0.609s
{noformat}

Crashes do not always happen in the same spot, cause looks like overwritten memory in all cases. Enabled MALLOC_DEBUG and of course all the crashes went away. *sigh*

(Or was it the DEBUG_FD_LEAKS option?)

By: Walter Doekes (wdoekes) 2015-07-01 05:31:39.045-0500

Hm. Appears to be the DEBUG_FD_LEAKS indeed. After finishing the work on ASTERISK-25213 and observing no crashes all morning, I recompiled with DEBUG_FD_LEAKS.

It did not take long until I got this:
{noformat}
*CLI> ==21778== Thread 25:
==21778== Invalid read of size 4
==21778==    at 0x6268C54: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.S:69)
==21778==    by 0x456E17: ast_autoservice_start (autoservice.c:222)
==21778==    by 0x457309: ast_autoservice_chan_hangup_peer (autoservice.c:317)
==21778==    by 0x200BA5C6: dial_exec_full (app_dial.c:3078)
==21778==    by 0x200BA999: dial_exec (app_dial.c:3130)
==21778==    by 0x52F6EA: pbx_exec (pbx.c:1677)
==21778==    by 0x53A369: pbx_extension_helper (pbx.c:4970)
==21778==    by 0x53D937: ast_spawn_extension (pbx.c:6100)
==21778==    by 0x53F047: __ast_pbx_run (pbx.c:6575)
==21778==    by 0x540A8B: pbx_thread (pbx.c:6905)
==21778==    by 0x5A000B: dummy_start (utils.c:1223)
==21778==    by 0x6264181: start_thread (pthread_create.c:312)
==21778==  Address 0x7069700074696e79 is not stack'd, malloc'd or (recently) free'd
{noformat}

{noformat}
==21778== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==21778==  General Protection Fault
==21778==    at 0x6268C54: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.S:69)
==21778==    by 0x456E17: ast_autoservice_start (autoservice.c:222)
==21778==    by 0x457309: ast_autoservice_chan_hangup_peer (autoservice.c:317)
==21778==    by 0x200BA5C6: dial_exec_full (app_dial.c:3078)
==21778==    by 0x200BA999: dial_exec (app_dial.c:3130)
==21778==    by 0x52F6EA: pbx_exec (pbx.c:1677)
==21778==    by 0x53A369: pbx_extension_helper (pbx.c:4970)
==21778==    by 0x53D937: ast_spawn_extension (pbx.c:6100)
==21778==    by 0x53F047: __ast_pbx_run (pbx.c:6575)
==21778==    by 0x540A8B: pbx_thread (pbx.c:6905)
==21778==    by 0x5A000B: dummy_start (utils.c:1223)
==21778==    by 0x6264181: start_thread (pthread_create.c:312)
{noformat}

By: Walter Doekes (wdoekes) 2015-07-02 04:16:32.929-0500

This {{issueA25212-fix_fdleak_overflow_and_cleanup.patch}} ought to fix it.

The problem was that the pipe() function did not do any bounds checking, overwriting past the fdleaks static structure.

I also cleaned some, amongst others consistently making use of ARRAY_LEN.