[Home]

Summary:ASTERISK-24832: [patch]DTLS-crashes within openssl
Reporter:Stefan Engström (StefanEng86)Labels:
Date Opened:2015-02-26 08:57:06.000-0600Date Closed:2015-07-07 14:56:58
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_rtp_asterisk
Versions:13.1.0 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-25103 Roundup - investigate Asterisk DTLS crashes
Environment:Fedora 20 x86_64, openssl-1.0.1e-41.fc20.x86_64, Asterisk 13.1.0, Chrome SIPML5 chan_sip peers with transport WSSAttachments:( 0) crash1.txt
( 1) crash2.txt
( 2) crash3.txt
( 3) crash4.txt
( 4) crash5.extralog
( 5) crash5.txt
( 6) CUSTOMERRORDEBUGLOG
( 7) DTLSfailure6ErrorlogNocrashUsingNewPatch.txt
( 8) DTLSREVIEWME.patch
( 9) DTLSREVIEWMEcodecontribution.patch
(10) SIPCONF.txt
(11) TESTDTLS.patch
Description:I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

ISSUE-0
First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed by compensating for ASTERISK-24830

By code inspection and tracing logs; it looks like the crashes mostly occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.)

EDIT -- This JIRA is getting a little bigger. It seems there are many sub-problems which are all related to DTLS though... not all sub-issues below may be real issues, some are just me asking questions about code. I'd be happy if a developer took a look at it and answered questions or discussed some of the issues and possible fixes.

ISSUE-1 - crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1

ISSUE-2: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
       dtls_perform_handshake(instance, &rtp->dtls, 0);
       if (rtp->rtcp) {
               dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
       }
...
}
chan_sip seems to call process_sdp which eventually calls res_asterisk_rtp::dtls_set_setup which ultimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,

the big question is why is dtls_perform_handshake() called at all if we are passive? After I added a patch it doen't seem to crash anymore, but I'm pretty sure the patch is not a full fix -- note that the patches labeled test are just for the debugging process


Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711
Requires patch from ASTERISK-24830  (the obvious fix of replacing USE_PJPROJECT WITH HAVE_PJPROJECT...)





Comments:By: Stefan Engström (StefanEng86) 2015-02-26 09:03:24.957-0600

added error logs produced by attached patch. file contains 1000 lines before crash1 and 1000 lines before crash2

By: Rusty Newton (rnewton) 2015-02-26 09:12:49.021-0600

Thank you for the crash report. However, we need more information to investigate the crash. Please provide:

1. A backtrace generated from a core dump using the instructions provided on the Asterisk wiki [1].
2. Specific steps taken that lead to the crash.
3. All configuration information necessary to reproduce the crash.

Thanks!

{quote}
Since it's load-related it's hard to provide enough information but ill try add more continuously.
{quote}

We understand, but, if you can, recompile Asterisk with the options described in the documentation linked such that when you do have a crash again it will be more helpful.

[1]: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace



By: Rusty Newton (rnewton) 2015-02-26 09:13:03.732-0600

Your backtrace appears to contain a memory corruption. We need one or both of the following items to continue investigation of the issue:
1. Valgrind output. See https://wiki.asterisk.org/wiki/display/AST/Valgrind for instructions on how to use Valgrind with Asterisk.
2. MALLOC_DEBUG output. See https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag for instructions on how to use the MALLOC_DEBUG option.

Note that MALLOC_DEBUG and Valgrind are mutually exclusive options. Valgrind output is preferable, but will be more system resource intensive and may be difficult to get on a production system. In such a case, you may have better luck getting the necessary output from MALLOC_DEBUG.



By: Stefan Engström (StefanEng86) 2015-02-26 13:20:10.126-0600

I will recompile with MALLOC_DEBUG , DONT_OPTIMIZE and BETTER_BACKTRACES but there is no telling when the next crash occurs.

I attach the sip.conf for the relevant peers -- I'm unable to provide the exact steps for how to produce sipml5 chrome webrtc calling robots calling in the same pattern as mine, because they run in an environment thats not mine (or public). Ill see if im allowed to at least acquire pcap files. Maybe I can find a way to reproduce the crash using only the sipml5 demo page and manual calling but basically it should be 2 webrtc callers always calling the same queue and hanging up after a random time less than 3 minutes whereas the 2 agents in the queue attempt to answer calls instantly and then hang up after random time less than 3 minutes


I realize at the moment it's hard for anyone to reproduce my environment -- I'm sort of hoping to find to root of the issue myself

By: Rusty Newton (rnewton) 2015-02-27 06:59:30.369-0600

Okay. Be sure to use the "Send Back" button when you have the new crash traces and malloc_debug output.

By: Stefan Engström (StefanEng86) 2015-02-27 08:41:51.539-0600

I followed instructions on https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag

menuselect/menuselect --disable-category MENUSELECT_CORE_SOUNDS --disable-category MENUSELECT_EXTRA_SOUNDS --disable-category MENUSELECT_MOH --enable-category MENUSELECT_ADDONS --enable app_meetme --enable app_page --disable chan_mgcp --disable chan_mobile --enable OPTIONAL_API --enable MALLOC_DEBUG --enable DONT_OPTIMIZE

I even ran the command asterisk -rx "memory atexit list on"

But after core is dumped mmlog.txt only contains 3 lines without information 1421693682 - New session. I suspect the reason was mmlog being owned by root so i did chown asterisk:asterisk /var/log/asterisk/mmlog

What "memory atexit"-commands are required?




By: Stefan Engström (StefanEng86) 2015-02-28 14:27:44.760-0600

(still no mmlog output (dont know why), but i'd be happy if somene commented the last crashes because they seem interesting)

By: Stefan Engström (StefanEng86) 2015-03-01 08:07:05.194-0600

uploading a new draft for patch. It's been built to compensate for issues found in crashes. Parts of the patch from 24651 was used but not all. It has unresolved TODOs and it has the fix for ASTERISK-24830 included

By: Stefan Engström (StefanEng86) 2015-03-01 09:01:25.507-0600

Not any crashes for a while now but we get DTLS-Failures, see The DTLSFailure6...-log (with comments in it). At first I thought  this might suggest that we also need to do some more concurrency handling between calls to dtls_srtp_check_pending from __rtp_rcvfrom and from the dtls_srtp_handle_timeout like the russian dude said, but I'm not sure. Increasing the timeout for dtls_srtp_check_pending  from (old value(=999 ms)) to 3*(old value) seemed to decrease failures...
Adding mutexes around all calls to dtls_srtp_check_pending seemed to lead to another issue, possibly a deadlock...

By: Richard Mudgett (rmudgett) 2015-03-02 09:41:29.563-0600

None of the attached patches are currently marked as code contributions to the project and thus are not licensed.  If you have not already done so please sign a licence agreement and mark the patches as code contributions.

By: Stefan Engström (StefanEng86) 2015-03-02 14:21:47.243-0600

Only the DTLSREVIEWME.patch is possible candidate for code contribution; the other was just for aiding the debugging process. I signed the license agreement today and I'm waiting for an OK. The patch is still messy - I suspect only the part about not calling dtls_perform_handshake was significant for reducing crashes.

By: Stefan Engström (StefanEng86) 2015-03-04 03:22:31.788-0600

There, license fixed. This patch has only my fixes and some comments. I left out the mutexes around ssl-related calls . I'll leave it to you to judge if that's needed (cf russian guy's patch). With this patch applied I reduced crashes to 0 in 2 days from once every 1-4 hours for my particular use case. Next step I guess is looking over other use cases such as using pjsip channel instead or NO pjprojects installed...

By: Rusty Newton (rnewton) 2015-03-06 17:50:53.349-0600

Thanks Stefan. The next step in the process is to get the patch up on Reviewboard for peer review. Here is a link to the contribution process on the wiki: https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process

You can edit this issue and add a link to the Reviewboard request URL after you make one.

By: Stefan Engström (StefanEng86) 2015-03-14 06:32:45.138-0500

When i tried to add my patch to review board manually with https://reviewboard.asterisk.org/r/new/ it complained about the format of my patch Unable to parse diff revision header '2015-02-25 14:47:25.772839327 +0100'

When i tried to use the other way with svn and rbt i didn't know what svn checkout command that would checkout the same source code as was tar'ed by (Havent used those tools before)
wget http://downloads.asterisk.org/pub/telephony/asterisk/releases/asterisk-13.1.0.tar.gz

Moreover, the wiki said that if the patch is not in a final form it should be documented as such but i saw no such field in the form in https://reviewboard.asterisk.org/r/new/

Also what should be written in the Base Directory "The absolute path in the repository the diff was generated in." for me? The patch wasn't generated in a repository, just in an untarred asterisk source folder

By: Rusty Newton (rnewton) 2015-03-23 07:55:01.100-0500

bq. When i tried to add my patch to review board manually with https://reviewboard.asterisk.org/r/new/ it complained about the format of my patch Unable to parse diff revision header '2015-02-25 14:47:25.772839327 +0100'

Was the patch in unified diff format? It must be. 'svn diff' should get what you want, or otherwise :

{noformat}diff -u original/some_file.c modified/some_file.c > my_fancy_patch.diff{noformat}

bq. When i tried to use the other way with svn and rbt i didn't know what svn checkout command that would checkout the same source code as was tar'ed by (Havent used those tools before)
bq. wget http://downloads.asterisk.org/pub/telephony/asterisk/releases/asterisk-13.1.0.tar.gz

You should make your patch on the latest checkout of 13, not on the archived release files.

{noformat}
svn co http://svn.asterisk.org/svn/asterisk/branches/13 asterisk-13
{noformat}

That command would checkout the latest of the 13 branch to the 'asterisk-13' folder.

bq. Moreover, the wiki said that if the patch is not in a final form it should be documented as such but i saw no such field in the form in https://reviewboard.asterisk.org/r/new/

Just make a comment in the description field.

bq. Also what should be written in the Base Directory "The absolute path in the repository the diff was generated in." for me? The patch wasn't generated in a repository, just in an untarred asterisk source folder

If using the latest SVN revision of the 13 branch as instructed above, then 'branches/13/' should work if you generated the patch from the root of the Asterisk checkout. You could probably put the same thing if your patch was from the root of your unpacked Asterisk tarball for 13.

By: Stefan Engström (StefanEng86) 2015-03-25 07:56:18.875-0500

I created https://reviewboard.asterisk.org/r/4521/diff/
This is the patch that stopped my crashes, but it may have some garbage in it.

By: Rusty Newton (rnewton) 2015-05-19 15:28:16.572-0500

Stefan, sorry you got caught up in our move to Git/Gerrit.

Do you want to move your patch over into Gerrit? https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

It should get more attention once you get it over there.

By: Stefan Engström (StefanEng86) 2015-05-20 10:39:44.054-0500

migrated patch from reviewboard to gerrit


By: Joshua C. Colp (jcolp) 2015-07-06 06:00:55.869-0500

A change is now up for review at the following addresses for a fix to this problem. While our code review process is pretty fast these days if anyone would like to test the change and provide feedback on this issue it would be welcome:

11: https://gerrit.asterisk.org/#/c/786/
13: https://gerrit.asterisk.org/#/c/787/
master: https://gerrit.asterisk.org/#/c/788/

The patch can be downloaded by clicking the "Download" dropdown and selecting the method you wish.