[Home]

Summary:ASTERISK-26692: res_rtp_asterisk: Crash in dtls_srtp_handle_timeout at res_rtp_asterisk (using chan_sip)
Reporter:Sebastian Gutierrez (sum)Labels:
Date Opened:2017-01-04 09:29:00.000-0600Date Closed:2017-04-27 10:05:52
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/General Resources/res_rtp_asterisk
Versions:13.13.0 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-26831 res_rtp_asterisk: Race condition when RTCP and WebRTC is used
Environment:Ubuntu 16.04Attachments:( 0) backtrace.txt
( 1) backtrace.txt
( 2) backtrace.txt
( 3) backtracefull.txt
( 4) dump1.txt
( 5) full.txt
( 6) logbeforedump.txt
( 7) logtimeofbt.txt
Description:Only dumps occasionally:
{noformat}
#0  dtls_srtp_handle_timeout (instance=instance@entry=0x7fc860007180, rtcp=rtcp@entry=1) at res_rtp_asterisk.c:1910
       rtp = 0x7fc86001c220
       dtls = 0x240
       dtls_timeout = {tv_sec = 140498581643312, tv_usec = 140498715030608}
#1  0x00007fc838bd5823 in dtls_srtp_handle_rtcp_timeout (data=0x7fc860007180) at res_rtp_asterisk.c:1941
       instance = 0x7fc860007180
       reschedule = <optimized out>
#2  0x00000000005b00fb in ast_sched_runq (con=0x24a53f0) at sched.c:783
       current = 0x24b5140
       numevents = 11
       res = <optimized out>
       __PRETTY_FUNCTION__ = "ast_sched_runq"
#3  0x00007fc8312e28ee in do_monitor (data=data@entry=0x0) at chan_sip.c:29502
       res = <optimized out>
       t = 1483976610
       reloading = <optimized out>
       __PRETTY_FUNCTION__ = "do_monitor"
{noformat}

dmesg:
{noformat}
[6126811.297251] asterisk[21118]: segfault at 278 ip 00007fad67faa822 sp 00007fad62bb6ce0 error 4 in res_rtp_asterisk.so[7fad67f9f000+1a000]

[5438520.048513] asterisk[17423]: segfault at 278 ip 00007f8d3b777772 sp 00007f8d363ffce0 error 4 in res_rtp_asterisk.so[7f8d3b76c000+1a000]

[5003538.737797] asterisk[28892]: segfault at 278 ip 00007fc3757f8772 sp 00007fc37030cce0 error 4 in res_rtp_asterisk.so[7fc3757ed000+1a000]
{noformat}
in messages I can see this that is the only thing that could be the issue:
{noformat}
[Jan  4 10:08:21] WARNING[29724][C-00000f32] chan_sip.c: Insufficient information for SDP (m= not found)
{noformat}
may be related to ASTERISK-9041 ?

Comments:By: Asterisk Team (asteriskteam) 2017-01-04 09:29:05.602-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2017-01-04 09:30:56.608-0600

There's not much to go on here. We really need to see console output and SIP traffic at the very least as well as configuration. There should also be a core dump if it really did segfault.

By: Sebastian Gutierrez (sum) 2017-01-04 10:21:48.515-0600

The issue is once or twice per day so Is difficult to have a sip trace because will be huge the peers configuration is same for all (see below)
are all webrtc peers. and no core dump, I will try to get more information.
{noformat}
[1002]
description=Agente2-agent
host=dynamic
nat=force_rport,comedia
type=friend
accountcode=Agente2
call-limit=2000
callgroup=
callerid=1002
canreinvite=no
context=agentes
dtmfmode=rfc2833
fromuser=
fromdomain=
insecure=
language=es
mailbox=1002@voicemail
pickupgroup=
qualify=yes
secret=123456
disallow=all
allow=alaw
allow=ulaw
defaultuser=1002
subscribecontext=default
transport=wss
encryption=yes
directmedia=no
port=5061
avpf=yes
force_avp=yes
icesupport=yes
dtlsenable=yes
dtlsverify=fingerprint
dtlscertfile=/etc/nginx/ssl/server.pem
dtlscafile=/etc/nginx/ssl/server.crt
dtlssetup=actpass
{noformat}

By: Rusty Newton (rnewton) 2017-01-05 09:39:18.312-0600

Is Asterisk running with -g ?

Remember to attach debug to the issue with .txt extension and do not paste debug into the comments. Thanks.

By: Sebastian Gutierrez (sum) 2017-01-05 10:07:28.033-0600

I'm using safe_asterisk so is running:  asterisk -f -vvvg -c

By: Rusty Newton (rnewton) 2017-01-05 21:31:07.829-0600

Quite strange. Yeah, we'll need the full debug log to see if there is anything we can spot there. Otherwise we don't have much to go on unless you can describe a reliable way to reproduce the issue step by step.

By: Sebastian Gutierrez (sum) 2017-01-09 10:04:50.825-0600

this time dumped, let me know if more verbosity is needed

By: Rusty Newton (rnewton) 2017-01-09 19:53:35.676-0600

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

If you run the following command for the backtrace, that is exactly what we want each time.

{noformat}
gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt" --batch -c core > /tmp/backtrace.txt
{noformat}

Did you have a log to go along with this crash? We always need the log if you have one. Especially with the mentioned debug settings in https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: Sebastian Gutierrez (sum) 2017-01-09 20:03:40.956-0600

attached backtrace

By: Sebastian Gutierrez (sum) 2017-01-09 20:14:03.083-0600

attach log portion`

By: Sebastian Gutierrez (sum) 2017-01-12 09:41:25.938-0600

new backtrace

By: Sebastian Gutierrez (sum) 2017-01-13 07:55:44.837-0600

do you need a better backtrace with dont optimize or something like that? let me know, this crashes every day more than once.




By: Rusty Newton (rnewton) 2017-01-14 13:58:20.197-0600

Yeah, in the instructions previously linked the wiki describes what we need. It mentions DONT_OPTIMIZE and BETTER_BACKTRACES compiler flags.

In addition, grab a full debug log to go along with the trace. The instructions for that are here: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Make sure it includes the last thousand lines or so before the crash, and warning,error,notice,verbose and debug log channels with verbose and debug both turned up to 5 or higher.



By: Asterisk Team (asteriskteam) 2017-01-29 12:00:01.375-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Sebastian Gutierrez (sum) 2017-02-01 15:36:37.383-0600

attached bt full

By: Sebastian Gutierrez (sum) 2017-02-01 15:37:25.084-0600

full bt

By: Sebastian Gutierrez (sum) 2017-02-01 16:36:34.456-0600

open

By: Asterisk Team (asteriskteam) 2017-02-01 16:36:34.798-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Rusty Newton (rnewton) 2017-02-02 07:01:02.496-0600

RE-attaching new logs with .txt extension so others don't have to download.

By: Rusty Newton (rnewton) 2017-02-02 07:07:06.796-0600

Thanks for the additional trace, however you still didn't provide the debug log as requested.

In this case the trace isn't enough, developers can't tell how it got into the funky state that resulted in the crash.

Please reference my previous comments for instructions on gathering a debug log leading up to the crash. As it right now, we don't have enough data for a developer to investigate so I would be hesistant to open the issue up.

By: Sebastian Gutierrez (sum) 2017-02-02 07:12:36.817-0600

as you can understand is impossible to get debug log for this issue, is random and I cannot reproduce all the time, so getting a debug log makes it impossible....

so this is a real issue that is why I think this should be remain open, maybe others have the same issue and can share more logs. closing this issue wont make the bug disappear....  


I will try anyway to see if I can have lucky and get some debug logs....

By: Rusty Newton (rnewton) 2017-02-02 07:23:27.054-0600

bq. as you can understand is impossible to get debug log for this issue, is random and I cannot reproduce all the time, so getting a debug log makes it impossible....

If the issue is disk space, turn up logging, rotate and delete old logs so they don't get out of control, when Asterisk crashes copy the current log to a new directory.

Now, if the problem is performance issues due to the additional logging then I understand. Not much you can do there unless you can build a clone system and run test traffic on it to simulate and trigger the issue.

bq. so this is a real issue that is why I think this should be remain open, maybe others have the same issue and can share more logs. closing this issue wont make the bug disappear....

We open up and track issues that have enough information for investigation. If we don't have enough data for a developer to investigate then there is no reason to keep it open. In a closed state the issue is still visible and searchable - we can always open it up if someone gathers additional data in the future.

If we kept every issue open then the issue tracker would be filled with issues that never led to any resolution which causes confusion when searching through open issues.

bq. I will try anyway to see if I can have lucky and get some debug logs....

I'll leave this open for a few more weeks. Thanks.

By: Sebastian Gutierrez (sum) 2017-02-16 11:12:55.549-0600

{noformat}
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/asterisk -f -vvvg -c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  dtls_srtp_handle_timeout (instance=instance@entry=0x7f8bc018df70, rtcp=rtcp@entry=1) at res_rtp_asterisk.c:1910
1910 DTLSv1_handle_timeout(dtls->ssl);
[Current thread is 1 (Thread 0x7f8b710ae700 (LWP 7303))]
(gdb)
(gdb)
(gdb) bt
#0  dtls_srtp_handle_timeout (instance=instance@entry=0x7f8bc018df70, rtcp=rtcp@entry=1) at res_rtp_asterisk.c:1910
#1  0x00007f8b814cc813 in dtls_srtp_handle_rtcp_timeout (data=0x7f8bc018df70) at res_rtp_asterisk.c:1941
#2  0x00000000005d7251 in ast_sched_runq (con=0x24baaa0) at sched.c:783
#3  0x00007f8b79bbef42 in do_monitor (data=0x0) at chan_sip.c:29502
#4  0x000000000061c7b7 in dummy_start (data=0x24c8990) at utils.c:1235
#5  0x00007f8bcb7d46ba in start_thread (arg=0x7f8b710ae700) at pthread_create.c:333
#6  0x00007f8bcaa7882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
{noformat}

the log is configured this way:

{noformat}
full => notice,warning,error,debug,verbose,dtmf,fax
{noformat}

The dump was:  147M Feb 16 11:46 core

I attach the log file

the dump is the same as the others

By: Richard Mudgett (rmudgett) 2017-04-12 10:45:24.618-0500

The patch for ASTERISK-26835 and ASTERISK-26853 is likely to fix this as well.

By: Sebastian Gutierrez (sum) 2017-04-18 17:04:07.411-0500

new backtrace

By: Richard Mudgett (rmudgett) 2017-04-18 18:50:35.199-0500

I think I see what is going on.  This is actually not related to the ASTERISK-26835 and ASTERISK-26853 patch.  We seem to be forgetting to stop a scheduled DTLS timer when we disable RTCP because a channel is put on hold or we try to do direct media before the DTLS handshake completes.

By: Sebastian Gutierrez (sum) 2017-04-18 21:20:10.804-0500

ok, let me know if some patch needs to be tested

By: Friendly Automation (friendly-automation) 2017-04-27 10:05:52.883-0500

Change 5504 merged by Jenkins2:
res_rtp_asterisk.c: Fix crash in RTCP DTLS operation.

[https://gerrit.asterisk.org/5504|https://gerrit.asterisk.org/5504]

By: Friendly Automation (friendly-automation) 2017-04-27 10:06:37.260-0500

Change 5505 merged by Jenkins2:
res_rtp_asterisk.c: Fix crash in RTCP DTLS operation.

[https://gerrit.asterisk.org/5505|https://gerrit.asterisk.org/5505]

By: Friendly Automation (friendly-automation) 2017-04-27 14:00:01.306-0500

Change 5503 merged by Joshua Colp:
res_rtp_asterisk.c: Fix crash in RTCP DTLS operation.

[https://gerrit.asterisk.org/5503|https://gerrit.asterisk.org/5503]