ASTERISK-25540: Core dump related to RTCP report

[Home]

Summary: ASTERISK-25540: Core dump related to RTCP report

Reporter: Morten Tryfoss (mtryfoss) Labels:

Date Opened: 2015-11-11 02:34:12.000-0600 Date Closed: 2018-02-28 07:54:11.000-0600

Priority: Critical Regression?

Status: Closed/Complete Components: Core/RTP

Versions: 13.6.0 Frequency of
Occurrence One Time

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt
( 1) backtrace2.txt

Description: I've opened several reports regarding core dumps in 13.5 and 13.6, but I haven't seen exactly this one before. These nodes have relatively high amount of traffic and has a couple of weeks uptime. I can't reproduce the crash intentionally.

I'm just uploading to let you know about potential issues. Feel free to close again if you want.

Comments: By: Asterisk Team (asteriskteam) 2015-11-11 02:34:13.776-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Morten Tryfoss (mtryfoss) 2015-11-11 02:35:22.178-0600

See attached backtrace.
By: Morten Tryfoss (mtryfoss) 2015-11-11 02:39:18.784-0600

Could this be related to ASTERISK-25449, egen though it should be fixed in 13.6?
By: Morten Tryfoss (mtryfoss) 2015-11-11 07:54:29.824-0600

Another crash today, on another node.
By: Morten Tryfoss (mtryfoss) 2015-11-12 03:24:34.973-0600

I have a feeling that this is a sort of race condition between a call answer and a scheduled RTCP report. The last AMI event asterisk sent before crashing was a SIP-call entering a bridge (BridgeEnter). The call was answered exactly 5 seconds after the initial INVITE.

I'm going to try adjusting the interval (default - 5000 msec now) to see what happens.
By: Matt Jordan (mjordan) 2015-11-18 19:24:01.072-0600

There have been sporadic reports of a similar crash, but we haven't yet been able to reproduce it. If you can provide step by step instructions with configuration that can produce the crash, that'd be great. Thanks!
By: Morten Tryfoss (mtryfoss) 2015-11-19 01:24:57.751-0600

I adjusted the RTCP interval from 5000 to 10000 msec. We've had no crash since then, but they might and will probably come at some point.

As I wrote earlier I've not been able to reproduce this, but it happens on a node with just simple SIP-SIP calls between two users. Maybe it could be reproduced if the interval is lowered and a tool for stress testing is used to initiate a lot of calls being answered after around the same time the the report is sent?
By: Rusty Newton (rnewton) 2015-11-21 13:26:28.796-0600

Can you provide the dialplan and configuration of both systems with a description of the call flow?

Simple sip to sip calls between two different systems could mean a million different things in terms of configuration and what happens.
By: Morten Tryfoss (mtryfoss) 2015-11-23 08:54:39.276-0600

The first crash is a very simple scenario:
{noformat}
exten => _XX.,1,Agi(agi://${DB(fastagi/hostname)}/check_callerid_vbox.agi)
exten => _XX.,n,hangup
{noformat}

The AGI log is like this:
{noformat}
write: GET VARIABLE SIP_HEADER(X-NCOM-SRCIP)
read: 200 result=1 (x.x.x.x)
write: EXEC SIPAddHeader "P-Asserted-Identity: <sip:11223344@x.x.x.x\;user=phone>"
read: 200 result=0
write: EXEC SIPAddHeader "X-NCOM-CALLTYPE: 01"
read: 200 result=0
write: EXEC SIPAddHeader "X-NCOM-ACCOUNT: 11223344"
read: 200 result=0
write: EXEC SIPAddHeader "X-NCOM-CUSTOMERID: 1"
read: 200 result=0
write: SET VARIABLE "CALLERID(name)" "11223344"
read: 200 result=1
write: GET VARIABLE SIPPEER(sipic2,status)
read: 200 result=1 (OK (5 ms))
write: EXEC Dial "SIP/sipic2/99887766,,"
{noformat}

Inbound peer:
{noformat}
[sip2]
username=sip2
type=friend
context=cc_customer
host=x.x.x.x
deny=0.0.0.0/0.0.0.0
permit=x.x.x.x
progressinband=no
nat=no
canreinvite=no
{noformat}

Outbound peer:
{noformat}
[sipic2]
username=sipic2
type=friend
context=default
host=y.y.y.y
deny=0.0.0.0/0.0.0.0
permit=y.y.y.y
disallow=all
allow=alaw
progressinband=no
canreinvite = no
usereqphone=yes
qualify=yes
qualifyfreq=15
{noformat}
By: Rusty Newton (rnewton) 2015-11-24 08:43:51.080-0600

Thanks for the additional information.

We do have some people performing a variety of stress testing with Asterisk right now so I will provide this information to them and if we find something similar to this we'll re-open or open a new issue. Thanks!

In the meantime if someone can find how to reproduce this reliably please comment here on the issue or of course open a new one if the issue is significantly different.
By: Rusty Newton (rnewton) 2015-11-24 11:42:44.136-0600

This is similar enough to some other issues we have seen - so we will keep this open for now just in case someone arrives with more data or gets a chance to put more effort into reproduction.
By: Morten Tryfoss (mtryfoss) 2015-12-21 04:55:28.134-0600

Got another crash today, exactly the same as last time. We've not had any crash for 4-5 weeks, and the Asterisk processes has not been restarted since. Last time we had two the same day.

Could the risk of a crash increase when the process gets some uptime? It will be interesting to see if this happens with the other ones which were restarted at the exact same time, or if this is just random.
By: Morten Tryfoss (mtryfoss) 2016-02-16 02:14:28.693-0600

After adjusting RTCP interval from 5 to 10 seconds, the crashes happens much less frequent. But I had another one last night. What seems to happen is "200 OK" for the call just the same time as the RTCP report should be sent. In my case, 10 seconds after INVITE. There were very little load on the server when it crashed.
By: Aaron An (aaron) 2016-05-05 01:10:11.682-0500

I have the similar crash at 13.8.0.

(gdb) bt
#0 0x00007f6f5b151a2d in malloc_consolidate () from /lib64/libc.so.6
#1 0x00007f6f5b153ed5 in _int_malloc () from /lib64/libc.so.6
#2 0x00007f6f5b15626c in malloc () from /lib64/libc.so.6
#3 0x00007f6f5b7a899f in __libc_res_nsend () from /lib64/libresolv.so.2
#4 0x00007f6f5b7a5cce in __libc_res_nquery () from /lib64/libresolv.so.2
#5 0x00007f6f5b7a6784 in __libc_res_nsearch () from /lib64/libresolv.so.2
#6 0x00007f6ef3a47c53 in _nss_dns_gethostbyname4_r () from /lib64/libnss_dns.so.2
#7 0x00007f6f5b1b2bb8 in gaih_inet () from /lib64/libc.so.6
#8 0x00007f6f5b1b627d in getaddrinfo () from /lib64/libc.so.6
#9 0x0000000000639b48 in ast_sockaddr_resolve (addrs=0x7f6e9caf75d0, str=0x7f6e9caf76b0 "ip-10-10-6-111", flags=768, family=0) at netsock2.c:304
#10 0x000000000043a7d2 in resolve_first (addr=0x7f6e9caf7930, name=0x7f6e9caf76b0 "ip-10-10-6-111", flag=768, family=0) at acl.c:792
#11 0x000000000043b3ec in ast_find_ourip (ourip=0x7f6e9caf7930, bindaddr=0x7f6e9caf7810, family=0) at acl.c:970
#12 0x00007f6eae2c20bb in ast_rtcp_write_report (instance=instance@entry=0x7f6ddbfe6858, sr=sr@entry=1) at res_rtp_asterisk.c:3199
#13 0x00007f6eae2c284d in ast_rtcp_write (data=0x7f6ddbfe6858) at res_rtp_asterisk.c:3236
#14 0x00000000006c00e3 in ast_sched_runq (con=0x2acc5c0) at sched.c:783
#15 0x00000000006bd465 in sched_run (data=0x2acc5c0) at sched.c:170
#16 0x000000000072de0f in dummy_start (data=0x2acb1c0) at utils.c:1237
#17 0x00007f6f5bee1dc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f6f5b1ccc9d in clone () from /lib64/libc.so.6
By: Morten Tryfoss (mtryfoss) 2016-09-30 03:05:54.048-0500

It keeps crashing from time to time. I've analyzed the sip message flow on each, and what seems to happen is that the call receives "200 OK" at just the same time Asterisk sends an rtcp report.
rtcpinterval is configured to 10 seconds, and this happens when the timing is just right at the calls is answered after for example 10, 20 or 30 seconds. Eearlier rtcpinterval was 5 seconds, and the crashes were far more often (which is logical).

The last event in the AMI debug log is "Event: BridgeEnter" for the two sip calls.
By: Jatin Jain (jatinjain) 2017-07-31 07:07:24.142-0500

Faced same problem with 13.13 as well.
By: Richard Mudgett (rmudgett) 2017-07-31 10:42:17.846-0500

This problem may be caused by the reentrancy issue fixed for these issues: ASTERISK-26835, ASTERISK-26853, and ASTERISK-26912. Please test with at least the versions containing that fix.
By: Morten Tryfoss (mtryfoss) 2018-02-28 05:06:04.680-0600

Those commits seems to have worked. I've had no crashes like this after applying those.