ASTERISK-15406: [patch] Asterisk crashes in ast_rtcp

[Home]

Summary: ASTERISK-15406: [patch] Asterisk crashes in ast_rtcp_write at rtp.c:3536

Reporter: Nicu Farmache (nicuro) Labels:

Date Opened: 2010-01-06 08:12:20.000-0600 Date Closed: 2011-06-07 14:00:22

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/RTP

Versions: Frequency of
Occurrence

Related
Issues:
is related to ASTERISK-18570 Crashes in RTCP handling

Environment: Attachments: ( 0) 20100119__issue16556.diff.txt
( 1) bt-2010-01-06T11:52:38+0200.txt
( 2) bt-2010-01-06T15:52:23+0200.txt
( 3) bt-2010-01-06T16:46:35+0200.txt
( 4) p_rtp.txt
( 5) p_rtp2.txt
( 6) p_rtp3.txt

Description: I have 3 servers running asterisk and all are crashing approximately once or twice a day.
It is always the same bt on all the servers.

****** ADDITIONAL INFORMATION ******

Core was generated by `/usr/sbin/asterisk -f -vvvg -c'.
Program terminated with signal 11, Segmentation fault.
#0 0x08118899 in ast_rtcp_write (data=0xc63b7b8) at rtp.c:3536
3536 if (rtp->txcount > rtp->rtcp->lastsrtxcount)
(gdb) bt full
#0 0x08118899 in ast_rtcp_write (data=0xc63b7b8) at rtp.c:3536
rtp = 0xc63b7b8
res = 1000
#1 0x0813f5e0 in ast_sched_runq (con=0x8d96928) at sched.c:489
current = 0xb3c39790
when = {tv_sec = 1262785942, tv_usec = 482994}
numevents = 0
res = 1262785942
__PRETTY_FUNCTION__ = "ast_sched_runq"
#2 0xb6a87519 in do_monitor (data=0x0) at chan_sip.c:21441
res = 0
t = 1262785942
reloading = 0
__PRETTY_FUNCTION__ = "do_monitor"
#3 0x08152423 in dummy_start (data=0xb68256b8) at utils.c:968
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1217806348, 0, 4001536, -1243798632, 1568098554, -847851647}, __mask_was_saved = 0}}, __pad = {0xb5dd2454, 0x0, 0x6, 0xb5dd23a8}}
__cancel_routine = 0x8076296 <ast_unregister_thread>
__cancel_arg = 0xb5dd2b70
not_first_call = 0
ret = 0xb768b636
a = {start_routine = 0xb6a8723b <do_monitor>, data = 0x0, name = 0xb6824f00 "do_monitor", ' ' <repeats 11 times>, "started at [21468] chan_sip.c restart_monitor()"}
#4 0xb768b80e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
No symbol table info available.
ASTERISK-1 0xb77707ee in clone () from /lib/tls/i686/cmov/libc.so.6
No symbol table info available.
(gdb)

Comments: By: Nicu Farmache (nicuro) 2010-01-06 09:06:05.000-0600

Is it ok if I comment out this function because I do not need rtcp and it is very important for me that asterisk is stable in a production environment.

It is running an inbound call center and all calls fail when it segfaults.
By: Leif Madsen (lmadsen) 2010-01-06 09:12:30.000-0600

It is OK for you to do anything you want to the code :) If you want to know if there would be any side effects, I'd ask on the asterisk-dev mailing list.
By: Tilghman Lesher (tilghman) 2010-01-12 18:18:26.000-0600

In gdb, what is the output of:

p *rtp
p *rtp->rtcp
By: Nicu Farmache (nicuro) 2010-01-13 05:10:54.000-0600

I attached the output.
I think the problem is:

(gdb) p *rtp->rtcp
Cannot access memory at address 0x2e363235
By: Tilghman Lesher (tilghman) 2010-01-13 09:38:04.000-0600

It appears that this schedule ID is being run after the memory has been freed and allocated to some other thread. Would it be possible for you to run one of these servers under Valgrind (see doc/valgrind.txt)?
By: Tilghman Lesher (tilghman) 2010-01-15 15:01:16.000-0600

Note that it is possible for Asterisk to cease crashing when run under Valgrind, though the output may still provide us the information necessary to track down the error.
By: Nicu Farmache (nicuro) 2010-01-16 15:42:48.000-0600

I tried to run it under Valgrind but it runs to slow and since this is a production server and this happens usually at peak hours I cannot run it like this.

I have other servers that do not crash and the only difference between them that I could think about is that these servers (the ones that crash) regularly use the redirect action trough AMI.

The calls come from 10 E1 channels (dahdi) and go to an extension that plays moh. Later a script takes some of the channels and redirects them to another part of the dialplan that eventual calls a sip extension.
By: Tilghman Lesher (tilghman) 2010-01-19 13:47:18.000-0600

If the AMI Redirect is what is causing this crash, then this patch may fix it. Please try this out on a single server and see if it makes a positive difference.
By: Nicu Farmache (nicuro) 2010-01-19 15:32:18.000-0600

I applied the patch and I enabled full loging to a file so that I can see what is the last thing it does if it crashes again.

I will return tomorrow with the results.
By: Tilghman Lesher (tilghman) 2010-01-20 23:57:48.000-0600

So did it crash again?
By: Nicu Farmache (nicuro) 2010-01-21 03:58:51.000-0600

Unfortunately it did. I'll wait for it to crash again so that I can check if there is something simmilar in the logs before the crash.
By: Tilghman Lesher (tilghman) 2010-01-26 02:06:49.000-0600

nicuro: did it crash again, enough that you have something in the logs to show me?

BTW, I also have a new tool that may assist in debugging this. It's similar in detecting memory errors to Valgrind, but it does not have the same slowdown. It's located in a developer branch in SVN: http://svn.digium.com/svn/asterisk/team/tilghman/malloc_hold
By: Tilghman Lesher (tilghman) 2010-01-31 19:34:34.000-0600

nicuro: ping
By: Tilghman Lesher (tilghman) 2010-02-04 13:15:31.000-0600

Suspended, pending further response.