[Home]

Summary:ASTERISK-25103: Roundup - investigate Asterisk DTLS crashes
Reporter:Rusty Newton (rnewton)Labels:
Date Opened:2015-05-19 15:25:11Date Closed:2015-07-07 14:56:59
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_rtp_asterisk
Versions:Frequency of
Occurrence
Related
Issues:
is related toASTERISK-22805 res_rtp_asterisk: Crash when calling BIO_ctrl_pending in dtls_srtp_check_pending when dialed by JSSIP
is related toASTERISK-24651 [patch] Fix race condition in DTLS
is related toASTERISK-25127 DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending
is related toASTERISK-24550 res_rtp_asterisk: Crash in ast_rtp_on_ice_complete during DTLS handshake
is related toASTERISK-24832 [patch]DTLS-crashes within openssl
is related toASTERISK-25099 res_rtp_asterisk: Crash when using DTLS
Environment:Asterisk 11, 13, MasterAttachments:
Description:A issue for an investigation into the various DTLS crashes currently hanging about.

I'll link the issues currently on the tracker to this issue rather than linking them all to each other.
Comments:By: Thomas Guebels (tguescaux) 2015-05-20 03:10:26.263-0500

Hi,

Shouldn't you also have a look at ASTERISK-24651 ? The patch proposed there, while not pretty, fixed most of the crashes I had during DTLS handshakes (see my comment 224900).

By: Rusty Newton (rnewton) 2015-05-28 19:23:21.574-0500

Linked!

By: Stefan Engström (StefanEng86) 2015-06-28 07:00:46.241-0500

I still think the unsynchronized call to SSL_do_handshake(dtls->ssl); on a ssl object on which  SSL_set_accept_state(ssl) has been called is the main problem.

When we have initialized the dtls->ssl to act as server, all the code needed to progress and finish the handshake is essentially already in __rtp_recvfrom. That function will at some point receive a client handshake message such as 'client hello', then write it to the input buffer (BIO_write(dtls->read_bio, buf, len)) and call SSL_read to make openssl progress the handshake (write server hello to output buffer).

When we are dtls-clients it is appropriate to call SSL_do_handshake after ice completion because that kicks off the handshake by producing a client hello so this call cannot race with the ssl-processing within __rtp_recvfrom.

I do not understand all the possible call-flows within asterisk so i created a test program at http://pastebin.com/8TGqN10j based off the example at http://www.roxlu.com/2014/042/using-openssl-with-memory-bios

Indeed my program crashes half the time when SSL_do_handshake and other SSL related calls race. I noted that in roxlu's example they call SSL_do_handshake within the krx_ssl_handle_traffic function instead of our SSL_read; this seems to be two alternative ways of progressing the handshake (i.e. both those calls makes ssl write server hello to output buffer given that we have written client hello to input buffer), but this is different from our call to SSL_do_handshake, i believe one should only call SSL_do_handshake (or SSL_read) as server  whenever there's a possible change in the BIO buffers, not from a asynchronously callback.

renegotiation is a different story...

I might have missed something as I base my conclusions on empirical tests on a very narrow use case as well as looking at unofficial documentation/guides on google for openssl :)

By: Joshua C. Colp (jcolp) 2015-07-06 06:00:00.787-0500

A change is now up for review at the following addresses for a fix to this problem. While our code review process is pretty fast these days if anyone would like to test the change and provide feedback on this issue it would be welcome:

11: https://gerrit.asterisk.org/#/c/786/
13: https://gerrit.asterisk.org/#/c/787/
master: https://gerrit.asterisk.org/#/c/788/

The patch can be downloaded by clicking the "Download" dropdown and selecting the method you wish.

By: Stefan Engström (StefanEng86) 2015-07-07 09:11:00.866-0500

I tested applying patchsets 1-3 from 788 to a 13.1.0 installation and then ran some webrtc-calling tests using calling robots. It seemed to work fine for 1000+ calls (within 4 hours).

By: Joshua C. Colp (jcolp) 2015-07-07 10:59:34.648-0500

It's probably the ICE negotiation that's slowing things down, nothing really to be done there except to implement trickle ICE which is a SUBSTANTIAL amount of work. The DTLS negotiation also takes some time. Just the way it is.

By: Joshua C. Colp (jcolp) 2015-07-07 11:59:19.459-0500

DTLS is all OpenSSL territory, we start the negotiation as soon as we can.

By: JoshE (n8ideas) 2015-07-07 12:36:34.653-0500

@Dade- would you be willing to share the hacks for ICE that you've done?  We have almost the exact list on our end as well, especially around reducing candidate generation for Internet-facing multi-NIC setups, and it might be nice to try to get some of this standardized and into trunk, where appropriate.

On the core issue Mr. Colp's patch addresses... it also looks good in our test suite.  We were able to repro the crashes easily before and we're clean now.  Thanks a bunch for your work on this!