ASTERISK-25127: DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check

[Home]

Summary: ASTERISK-25127: DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending

Reporter: Dade Brandon (dade) Labels:

Date Opened: 2015-05-24 13:24:20 Date Closed: 2015-07-07 14:57:00

Priority: Critical Regression?

Status: Closed/Complete Components: Resources/res_rtp_asterisk

Versions: 11.18.0 Frequency of
Occurrence Frequent

Related
Issues:
is related to ASTERISK-25103 Roundup - investigate Asterisk DTLS crashes

Environment: Linux kernel "3.13.0-24-generic" Ubuntu 14.04, Asterisk 11.18.0-rc1, Compiler flags: DONT_OPTIMIZE, LOADABLE_MODULES, BETTER_BACKTRACES, BUILD_NATIVE, G711_NEW_ALGORITHM Openssl: 1.0.1f-1ubuntu2.11 libuuid1: 2.20.1-5.1ubuntu20.4 SIP Realtime: Module loaded & not in use Timer: res_timing_timerfd (res_timing_pthread also loaded) See attached 'environment.txt' for output of 'core show settings' and 'module show' All calls are: [Peer] <->ch1<-> Asterisk <->ch2<-> [Misc ITSPs] ch1 transport is always SIP over WSS using sipjs on Chrome (stable/M42/M43 and/or Canary M45) with ulaw codec. Peer is almost always NATd, Asterisk is never NATd, ch2 transport is always Plain Old SIP (5060, no TLS) with RTP (no [d]TLS) Attachments: ( 0) backtrace_latest_patch.txt
( 1) core-1.txt
( 2) core-2.txt
( 3) core-all_thread_bt_jun10.txt
( 4) core-fullbt-jun22#14.txt
( 5) core-fullbt-jun24#15
( 6) core-jun10#2.txt
( 7) debug.21.txt
( 8) debug-crash_jun10.txt
( 9) debug-jun10#2.txt
(10) debug-jun22#14.txt
(11) debug-jun24#15
(12) environment.txt

Description: h2. Preface
First I just want to say, I am very familiar with the other DTLS crash issues on Jira, I believe that if this is related, it's probably a precursor to the crashes that create the later segfaults, because since upgrading to trunk I haven't had a core dump, but have continued to experience crashes (asterisk restarting via safe_asterisk-- unknown signal since there's no core). This is likely more suited as a parent issue to some of the other DTLS crash issues.

We get about 5-10 crashes per production day (across 35 servers) and I only did the latest update Friday evening, so I will probably know if there's useful core dumps by the end of Tuesday, due to US holidays on Monday. We service businesses, so the volume is extremely low right now due to the long weekend in the US market.

Also, we notice that the crashes seem to target certain servers, and that there appears to be a correlation between the affected servers, and the latency of the peers connected to that server.

h2. Details:

Preceding an asterisk crash, we receive "Unable to cancel schedule ID nnnnnn. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line NNNN)"

(The line number is 1811 on trunk, we have other patches applied above which are mostly logging related, causing our line number to be less relevant. The line is "AST_SCHED_DEL_UNREF(rtp->sched, rtp->dtlstimerid, ao2_ref(instance, -1));"

Asterisk does not die immediately after. In messages, there is anywhere from 2 seconds to a full minute remaining before each crash. Note the timing in the example debug logs below.

On servers that we've added extra logging to, we find logs reporting that for each of these issues, an ast_debug call we inserted from main/rtp_engine.c, for the same thread and call, indicating that the rtp instance->engine is NULL:

h3. Example 1
{noformat}
[14:55:16] WARNING[11973][C-0000e496] res_rtp_asterisk.c: Unable to cancel schedule ID 539829. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
[14:55:55] DEBUG[11973][C-0000e496] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f7c50076da8'
- this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
{noformat}
h3. Example 2
{noformat}
[12:18:54] WARNING[10203][C-0000a76c] res_rtp_asterisk.c: Unable to cancel schedule ID 436177. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).

[12:18:59] DEBUG[23236][C-0000a76c] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
- this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).

[12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG1.2 ast_rtp_instance_write-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
- this debug line asserts (instance && !instance->engine) before instance->engine->write(instance, frame)

[12:18:59] VERBOSE[11061][C-0000a76c] app_verbose.c: Caller party hung up MinDur running on SIP/Pir1-0000fadf -- Answered time :atime 1431112732:talk 7 -- Route 7503 -- MinDur request 0

[12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG29.2 -- NULL INSTANCE ENGINE for Instance '0x7f6492f03a98'
- this asserts (instance && !instance->engine) before instance->engine->stop(instance)
{noformat}

Comments: By: Dade Brandon (dade) 2015-05-25 18:34:44.910-0500

Got a couple of related crashes today to update, both with core dumps

\[May 25 13:39:25\] WARNING\[6951\]\[C-0000023c\] res_rtp_asterisk.c: Unable to cancel schedule ID 9800. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1836).

35 seconds later, crash in ast_rtp_instance_get_stats (print instance->engine = (struct ast_rtp_engine *) 0x0)

attached as "core-1.txt" [^core-1.txt]

(this is a persistent crash location since switching to WebRTC from our previous SIP without \[d\]TLS configuration; it does not happen on servers without webrtc peers)

By: Dade Brandon (dade) 2015-05-25 18:43:12.411-0500

Core #2 for today,

\[May 25 09:57:18\] WARNING\[16886\]\[C-000000e9\] res_rtp_asterisk.c: Unable to cancel schedule ID 4383. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1836).
Followed

#2 0x00007f47c872138f in OpenSSLDie () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0

Interestingly, this one doesn't have rtp->engine == null (in backtrace I added print instance->engine from frame 7), however it's still preceded by the same "Unable to cancel schedule ID" message. This is another crash location we've been seeing since we switched on webrtc back in 11.15, but I'd never drawn the connection to this debug message before today.

If anybody knows how I can compile with debug symbols for openssl, I'd be happy to do so.

Attached as core-2.txt [^core-2.txt]
By: Dade Brandon (dade) 2015-05-25 18:46:18.146-0500

Sorry re core #2 w/ abort/assertion failure in openssl, I just realized that abort was actually a different crash 8 min before the one that followed the log @ May 25 09:57:18. There's no core for the crash @ May 25 09:57, but it occurred @ 09:57:33 (15 seconds after the 'Unable to cancel schedule ID' message)
By: Dade Brandon (dade) 2015-05-25 20:05:39.529-0500

Re core #1 - I inspected the packet capture, nothing looks overly abnormal about it, but I was able to confirm that the channel/leg that was undergoing sip_hangup within the pbx thread at the time of the segfault, which had !instance->engine, was not a DTLS channel, it was the ITSP leg. No signalling to end was received from that leg. The other leg, which was using DTLS, was the one with the "Unable to cancel schedule ID" set.

I can't see anywhere that would set instance->engine to NULL, or any reason why the dialog would have its ->rtp set to non-NULL if the engine didn't initialize, so I suspect some sort of memory corruption or double free happening in the DTLS code, in a way that affects the bridged channel.

Hopefully this helps in some way.
By: Rusty Newton (rnewton) 2015-05-28 19:25:43.988-0500

Thanks! As always, if you can get traces from Asterisk compiled with DONT_OPTIMIZE and BETTER_BACKTRACES then that is helpful. Also it would be great if you can provide full logs with "DEBUG" turned up to 5 captured right before the crash.
By: Dade Brandon (dade) 2015-06-03 19:09:30.195-0500

We're still working on getting copies of this crash with full debug logs, it's been oddly elusive by hitting before our network admin gets in to start monitoring with higher log levels. I should have an update by or on the weekend.
By: Nicole McIntosh (atna99) 2015-06-12 17:01:14.133-0500

1 coredump with full backtrace, and 1 debug log.

no 'a bug' message present but, caused by null engine.

see:
debug-crash_jun10.txt
core-all_thread_bt_jun10.txt
By: Nicole McIntosh (atna99) 2015-06-12 17:38:45.522-0500

coredump and debug log attached, for bug message:

[Jun 10 12:47:59] WARNING[13371][C-00008300] res_rtp_asterisk.c: Unable to cancel schedule ID 387151. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1836).

see:
core-jun10#2.txt
debug-jun10#2.txt

By: Nicole McIntosh (atna99) 2015-06-22 13:15:52.399-0500

Another crash.

'a bug' message present on first line of debug log. attached backtrace..

core-fullbt-jun22#14.txt
debug-jun22#14.txt
By: Nicole McIntosh (atna99) 2015-06-24 14:59:38.318-0500

From trace:
#0 0x000000000055835c in ast_rtp_instance_get_stats (instance=0x7f95d872da88, stats=0x7f968ea6aba0,
stat=AST_RTP_INSTANCE_STAT_ALL) at rtp_engine.c:2104
2104 return instance->engine->get_stat ? instance->engine->get_stat(instance, stats, stat)

[Jun 24 10:02:44] WARNING[28148][C-00000a8a] res_rtp_asterisk.c: Unable to cancel schedule ID 8922 0. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1836).

Backtrace from coredump and debug log also.
core-fullbt-jun24#15
debug-jun24#15

By: Dade Brandon (dade) 2015-06-24 15:21:52.530-0500

Nicole from the previous messages posted the requested debug logs and correlated backtraces, let us know if you need anything else. This is over my head. One thing I did notice is that the git logs for pjproject appear to resolve similar issues at some point later than the version included in asterisk-11 trunk.
By: Joshua C. Colp (jcolp) 2015-07-06 06:00:35.655-0500

A change is now up for review at the following addresses for a fix to this problem. While our code review process is pretty fast these days if anyone would like to test the change and provide feedback on this issue it would be welcome:

11: https://gerrit.asterisk.org/#/c/786/
13: https://gerrit.asterisk.org/#/c/787/
master: https://gerrit.asterisk.org/#/c/788/

The patch can be downloaded by clicking the "Download" dropdown and selecting the method you wish.
By: Dade Brandon (dade) 2015-07-06 17:20:32.864-0500

Backtrace attached

Reproducible crash (assert) occurs when bridging to an external line, while using early media + invite with SDP. (Our other case of not using early media and inviting without SDP, causes "No remote address on RTP instance" and caused a deadlock on 4 of 5 tests, I haven't included that one here since the first case is 100% reproducible and I'm unsure if the internal patches we normally apply would resolve the invite without SDP issue)

Environment is: checkout on '11' followed by overwriting res/res_asterisk_rtp.c with the patch candidate (make clean; make distclean; ./configure; menuconfig /w better backtraces + new g711 + dont_optimize

Debug log in the next attach
By: Dade Brandon (dade) 2015-07-06 17:21:05.770-0500

Debug log for previous backtrace pertaining to the latest patch
By: Dade Brandon (dade) 2015-07-06 17:31:12.590-0500

Add to the last backtrace, if I do 'frame 1; print dtls->ssl' the result is (SSL*)0x0

By: Dade Brandon (dade) 2015-07-06 17:34:42.717-0500

Additional note: by adding 'if (!dtls->ssl) return;' to the top of dtls_perform_setup, I had three calls complete, with very little delay, but then I fell in to a deadlock on the fourth try
By: Dade Brandon (dade) 2015-07-06 17:37:54.564-0500

Deadlock threads in case its relevant - multiple threads of the following (this is after adding if (!dtls->ssl) return -- line numbers here may be off as this layer of testing is being done including our usual patches (which have a lot of added debug statements and some skips in SDP parsing for private IP ranges)
Thread 37 (Thread 0x7fa983721700 (LWP 17717)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fa9f26df657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fa9f26df480 in __GI___pthread_mutex_lock (mutex=0x7fa9540644f0) at ../nptl/pthread_mutex_lock.c:79
#3 0x00000000005033f1 in __ast_pthread_mutex_lock (filename=0x7fa99a7432ab "res_rtp_asterisk.c", lineno=1862, func=0x7fa99a745f30 <__PRETTY_FUNCTION__.29879> "dtls_srtp_stop_timeout_timer", mutex_name=0x7fa99a74350c "&dtls->lock",
t=0x7fa9540644f0) at lock.c:313
#4 0x00007fa99a701516 in dtls_srtp_stop_timeout_timer (instance=0x7fa954070978, rtp=0x7fa954075b80, rtcp=1) at res_rtp_asterisk.c:1862
#5 0x00007fa99a7004db in ast_rtp_dtls_stop (instance=0x7fa954070978) at res_rtp_asterisk.c:1402
#6 0x00007fa99a70396b in ast_rtp_destroy (instance=0x7fa954070978) at res_rtp_asterisk.c:2554
#7 0x00000000005505fd in instance_destructor (obj=0x7fa954070978) at rtp_engine.c:217
#8 0x000000000044d1e5 in internal_ao2_ref (user_data=0x7fa954070978, delta=-1, file=0x5be3db "astobj2.c", line=551, func=0x5be691 <__FUNCTION__.8503> "__ao2_ref") at astobj2.c:469
#9 0x000000000044d4ed in __ao2_ref (user_data=0x7fa954070978, delta=-1) at astobj2.c:551
#10 0x00007fa99a7015d2 in dtls_srtp_stop_timeout_timer (instance=0x7fa954070978, rtp=0x7fa954075b80, rtcp=1) at res_rtp_asterisk.c:1863
#11 0x00007fa99a70f4e5 in ast_rtp_stop (instance=0x7fa954070978) at res_rtp_asterisk.c:4854
#12 0x00000000005538f4 in ast_rtp_instance_stop (instance=0x7fa954070978) at rtp_engine.c:1182
#13 0x00007fa9ab08bbc0 in stop_media_flows (p=0x7fa954065748) at chan_sip.c:23926
#14 0x00007fa9ab03492d in sip_hangup (ast=0x7fa954055cf8) at chan_sip.c:6942
#15 0x0000000000479500 in ast_hangup (chan=0x7fa954055cf8) at channel.c:2842
#16 0x000000000053b3a8 in __ast_pbx_run (c=0x7fa954055cf8, args=0x0) at pbx.c:6813
#17 0x000000000053b74c in pbx_thread (data=0x7fa954055cf8) at pbx.c:6905
#18 0x000000000059b45b in dummy_start (data=0x7fa9540656d0) at utils.c:1223
#19 0x00007fa9f26dd182 in start_thread (arg=0x7fa983721700) at pthread_create.c:312
#20 0x00007fa9f384a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

By: Joshua C. Colp (jcolp) 2015-07-06 18:23:10.783-0500

I've uploaded a new version of the change which should resolve both problems.