[Home]

Summary:ASTERISK-28576: res_rtp_asterisk: ICE Completion Crash when sent packet length doesn't match
Reporter:Joshua Elson (joshelson)Labels:patch pjsip webrtc
Date Opened:2019-10-09 15:44:37Date Closed:2019-11-18 09:16:18.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Resources/res_rtp_asterisk
Versions:16.6.0 Frequency of
Occurrence
Related
Issues:
is duplicated byASTERISK-28737 Asterisk 13.28.0 repeatable crashes
is duplicated byASTERISK-28630 Asterisk crash
is duplicated byASTERISK-28728 Asterisk crash in RTP stack (segfault)
Environment:Attachments:( 0) ASTERISK-28576.diff
( 1) core.brief.txt
( 2) core.full.txt
( 3) core-brief.txt
( 4) core-full.txt
( 5) core-locks.txt
( 6) core-thread1.txt
( 7) locks.txt
( 8) thread1.txt
Description:Crash on ICE completion. Backtrace forthcoming.
Comments:By: Asterisk Team (asteriskteam) 2019-10-09 15:44:37.766-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Joshua Elson (joshelson) 2019-10-09 15:45:01.158-0500

Backtraces here.

By: Kevin Harwell (kharwell) 2019-10-09 16:48:16.583-0500

*Edit* Commented on wrong issue - comment removed

By: Joshua Elson (joshelson) 2019-10-09 16:52:01.719-0500

I think you might have meant this reply for ASTERISK-28575....

By: Kevin Harwell (kharwell) 2019-10-09 16:53:49.980-0500

[~joshelson] yup was looking into that issue and had this one up. I've moved the comment over to that one.

By: Joshua Elson (joshelson) 2019-10-09 18:31:04.806-0500

Alright. Good deal. For this one, this appears quite frequently and only with WebRTC clients. We will crash multiple times per hour under moderate load and it happens more frequently when the remote agents are > 200ms away from the Asterisk instance.

By: George Joseph (gjoseph) 2019-10-11 09:21:43.384-0500

If you still have the actual coredump (or the next time it happens), can you run ast_coredumper on it with the --tarball-coredumps option?

{{ast_coredumper --tarball-coredumps --no-default-search <path_to_coredump>}}

This will create a tarball that contains the coredump itself plus the asterisk binary and modules so we can run gdb ourselves and examine things in more detail.

The file will be too large to attach but if you can host it on Dropbox or Google Drive, etc., and give us the link that'd be great.


By: Joshua Elson (joshelson) 2019-10-14 16:15:59.228-0500

Here is the full tarball with the binaries:

https://drive.google.com/file/d/1pC6LlM2O-syskNFq78K_tyydntot2epn/view?usp=sharing

By: Sean Bright (seanbright) 2019-10-15 06:44:16.610-0500

Distro & openssl version (the package version ideally)?

By: Joshua Elson (joshelson) 2019-10-15 12:03:35.162-0500

Using latest CentOS 7 with yum installed openssl 1.0.2k.

By: Benjamin Keith Ford (bford) 2019-10-15 13:07:04.387-0500

Is this a new system with new installs, or has it been upgraded and that's when you noticed the crash? If it's a system that's been upgraded over time and you've just now noticed the crash, I'm curious if it occurred after either an upgrade to Asterisk, or an upgrade to the system itself (packages, like openssl).

By: Joshua Elson (joshelson) 2019-10-15 15:13:14.212-0500

This was built from automation on this version, so no changes upgrades of Asterisk or OpenSSL for this environment.

By: Benjamin Keith Ford (bford) 2019-10-16 08:32:19.791-0500

If that's the case, would it be possible to try on an older version of openssl to potentially eliminate that library from the equation? I'm going to run some tests of my own and see if I can replicate the problem.

By: Joshua Elson (joshelson) 2019-10-16 11:38:06.569-0500

I can. Is there a version that you think is best? Anything with some miles on it?

By: Benjamin Keith Ford (bford) 2019-10-17 08:07:25.890-0500

An earlier version of 1.0.2 or 1.0.1 would be as far back as I would go, personally.

By: Asterisk Team (asteriskteam) 2019-10-31 12:00:01.348-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Joshua Elson (joshelson) 2019-10-31 12:36:49.535-0500

Apologies on the slow response, but the same issue was reproducible on OpenSSL 1.0.2e.

Any other areas we can try?

By: Asterisk Team (asteriskteam) 2019-10-31 12:36:49.918-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Kevin Harwell (kharwell) 2019-10-31 13:29:19.665-0500

Yes, install the debug symbols for openssl, and then attach a new backtrace when it happens again. We can then hopefully see what is being passed into the libssl functions.

By: Joshua Elson (joshelson) 2019-11-12 17:35:49.421-0600

After some work getting this setup, here's a backtrace with openssl symbols attached. Finally!

By: Kevin Harwell (kharwell) 2019-11-12 18:04:55.054-0600

Well this doesn't seem good. From the [source|https://github.com/openssl/openssl/blob/b39c0475a671879e2dd6c7a29de1127139f2dc0d/ssl/d1_both.c#L419]:
{noformat}
           /*
            * bad if this assert fails, only part of the handshake message
            * got sent.  but why would this happen?
            */
           OPENSSL_assert(len == (unsigned int)ret);
{noformat}
The comment makes it sounds like they are not even sure why a partial write may occur.

By: Joshua Elson (joshelson) 2019-11-12 18:21:43.089-0600

So this is an unusual environment, where the webrtc client is over a 250ms transoceanic link at times, so it's possible this is a networking issue. But the service we provide does generally work pretty well. We think this call quite clearly is bad, but when we spin up load this can be duplicated dozens of times an hour with only a handful of concurrent calls.

By: Sean Bright (seanbright) 2019-11-13 09:33:27.667-0600

What about going the other way - upgrading to the latest (1.0.2t at the time I write this)? Given that an assert is firing well inside OpenSSL, is it conceivable that this is just a bug in older versions?

By: Joshua C. Colp (jcolp) 2019-11-13 13:14:54.338-0600

Can you try this patch and see if it alters things? I'm wondering if in certain cases we're returning a length different than what is passed in, and then it implodes. This patch essentially lies but retransmissions should still ensure it gets sent eventually.

By: Joshua Elson (joshelson) 2019-11-13 16:37:16.829-0600

We're testing this patch out now. This also has been discussed on the OpenSSL list in the last month, if that's helpful:

http://openssl.6102.n7.nabble.com/Crash-in-OpenSSL-v1-0-1-from-dtls1-do-write-OPENSSL-assert-len-unsigned-int-ret-td77379.html

It appears more recent versions changed the assert to simply a warning, but updating OpenSSL in CentOS/RHEL is quite difficult and unsupported, generally, so an Asterisk-only solution might be easier to implement.

By: Joshua Elson (joshelson) 2019-11-13 23:23:28.947-0600

Based on testing today, this appears to resolve the crash. We were not able to crash the box under load with patch applied. We're still sorting through options on OpenSSL version changes, but if this patch is acceptable, it does seem to prevent the hard assert.

Thanks a bunch for all the work on this!

By: Friendly Automation (friendly-automation) 2019-11-18 09:16:19.780-0600

Change 13177 merged by Friendly Automation:
res_rtp_asterisk: Always return provided DTLS packet length.

[https://gerrit.asterisk.org/c/asterisk/+/13177|https://gerrit.asterisk.org/c/asterisk/+/13177]

By: Friendly Automation (friendly-automation) 2019-11-18 10:40:17.155-0600

Change 13206 merged by Friendly Automation:
res_rtp_asterisk: Always return provided DTLS packet length.

[https://gerrit.asterisk.org/c/asterisk/+/13206|https://gerrit.asterisk.org/c/asterisk/+/13206]

By: Friendly Automation (friendly-automation) 2019-11-18 13:05:52.045-0600

Change 13205 merged by George Joseph:
res_rtp_asterisk: Always return provided DTLS packet length.

[https://gerrit.asterisk.org/c/asterisk/+/13205|https://gerrit.asterisk.org/c/asterisk/+/13205]

By: Friendly Automation (friendly-automation) 2019-11-18 13:07:13.662-0600

Change 13207 merged by George Joseph:
res_rtp_asterisk: Always return provided DTLS packet length.

[https://gerrit.asterisk.org/c/asterisk/+/13207|https://gerrit.asterisk.org/c/asterisk/+/13207]