[Home]

Summary:ASTERISK-28472: Asterisk occasionally passes a NULL as srtp->session to srtp_protect/unprotect causing SEGV
Reporter:Jonas Swiatek (jonasswiatek)Labels:pjsip
Date Opened:2019-07-06 12:07:06Date Closed:2019-08-09 07:49:16
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:General
Versions:16.4.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Amazon Linux 2Attachments:( 0) core-2019-07-10T17-36-23+0000-brief.txt
( 1) core-2019-07-10T17-36-23+0000-full.txt
( 2) core-2019-07-10T17-36-23+0000-locks.txt
( 3) core-2019-07-10T17-36-23+0000-thread1.txt
( 4) core-2019-07-19T13-40-30+0000-brief.txt
( 5) core-2019-07-19T13-40-30+0000-full.txt
( 6) core-2019-07-19T13-40-30+0000-locks.txt
( 7) core-2019-07-19T13-40-30+0000-thread1.txt
( 8) core-brief.txt
( 9) core-full.txt
(10) core-locks.txt
(11) core-thread1.txt
(12) extensions.conf
(13) pjsip.conf
Description:We're seeing this crash every now and then - last time was around 3 weeks ago, on Asterisk 16.3, though we've upgraded to 16.4 since then, and enabled BETTER_BACKTRACES and DONT_OPTIMIZED to try and get to the bottom of it.

I'm at a loss for what might be causing it. When this server crashed it was serving 6 channels, and the only exotic thing being used by any of them were ARI, but I don't believe any of the active channels were actually in the Stasis application.

I'm not really qualified to read the crash dump, but there is some mention of libsrtp in one of the files, so I've included which version of that we're using (1.5.4, as recommended on the wiki).

Any insight would be greatly appreciated, and I can provide whatever else information might be valuable.
Comments:By: Asterisk Team (asteriskteam) 2019-07-06 12:07:07.455-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Kevin Harwell (kharwell) 2019-07-08 15:51:54.089-0500

Hrm, this is a tough one especially since it happens sporadically.

1. Can you think of anything that changed just before noticing the crash? For instance, had you upgraded Asterisk, or libsrtp?

2. It looks like a channel is attempting to join a bridge. Please post the relevant dialplan, and endpoint(s) configuration for the call.

3. Speaking of endpoints, what kind of endpoints/phones are used in the crash scenario? Is that something that changed recently, and then you started seeing the crash?

4. Please attach the Asterisk log leading up to the crash (even better if debug 5 is enabled but understand not wanting to enable debugging for an extended period if this is a production system).

5. Lastly, if you can, install srtp with debug symbols. The crash occurred down in the library itself. Those symbols might help show what/where it was actually crashing on.


By: Jonas Swiatek (jonasswiatek) 2019-07-10 17:01:28.540-0500

Another core dump

By: Jonas Swiatek (jonasswiatek) 2019-07-10 17:15:09.518-0500

> 1. Can you think of anything that changed just before noticing the crash? For instance, had you upgraded Asterisk, or libsrtp?

The before was Asterisk 13.25 with chan_sip, so the changes are quite substantial and probably not related ;)

> Please post the relevant dialplan, and endpoint(s) configuration for the call.

Done, see attachments

> what kind of endpoints/phones are used in the crash scenario?

Could be anything really - our customers can connect whatever SIP compatible device they want. I can try to figure out which exact models were in use at the time of the crash if you think it'll be helpful.

> Please attach the Asterisk log leading up to the crash

Relevant output of /var/log/asterisk/messages

[Jul 10 12:28:17] WARNING[12242][C-000014df] app.c: No audio available on Recorder/ARI-000019a9;1??
[Jul 10 16:32:13] NOTICE[32371][C-00001746] res_stasis_playback.c: ast011-1562776306.31006: Playback stopped for sound:/var/audio/custom-speeks/queue/queue-yourcallisfirst
[Jul 10 16:57:09] WARNING[6014][C-000017a9] app.c: No audio available on Recorder/ARI-00001cf4;1??
[Jul 10 17:36:27] Asterisk 16.4.0 built by root @ ip-172-31-30-45.ec2.internal on a x86_64 running Linux on 2019-06-17 08:20:52 UTC
[Jul 10 17:36:27] NOTICE[15360] loader.c: 251 modules will be loaded.
[Jul 10 17:36:28] NOTICE[15360] cdr.c: CDR simple logging enabled.
[Jul 10 17:36:28] NOTICE[15360] confbridge/conf_config_parser.c: Adding default_menu menu to app_confbridge
[Jul 10 17:36:28] NOTICE[15360] res_parking.c: Adding default profile to res_parking

The crash today happened at 17:36:27 - there appears to have been no lines printed prior to the crash.

Enabling level 5 debug might be a tall order - this is a pretty busy production server during peak hours.

> Lastly, if you can, install srtp with debug symbols.

Yes... though I'm unsure how to go about doing that. Can you provide instructions on how I'd go about doing so? I've compiled libsrtp from source manually.

By: Jonas Swiatek (jonasswiatek) 2019-07-10 17:15:42.841-0500

Added extensions and pjsip.conf

By: Kevin Harwell (kharwell) 2019-07-10 18:28:30.009-0500

Thanks for the added info. srtp related problems, and crashes can be quite challenging to figure out unfortunately.
{quote}
> 1. Can you think of anything that changed just before noticing the crash? For instance, had you upgraded Asterisk, or libsrtp?

The before was Asterisk 13.25 with chan_sip, so the changes are quite substantial and probably not related
{quote}
Ha yeah quite the substantial change. You are correct that's not gonna help us here.
{quote}
> what kind of endpoints/phones are used in the crash scenario?

Could be anything really - our customers can connect whatever SIP compatible device they want. I can try to figure out which exact models were in use at the time of the crash if you think it'll be helpful.
{quote}
I'll leave whether or not to track down that info up to you (it may or may not be helpful). As mentioned srtp problems can be a pain to figure out. Any little data point might help with a solution and/or workaround. In this case as the crash seems to be occurring in the "unprotect" function it's possible the endpoint is sending a bad packet at times, which may cause the crash. If you knew it was the same endpoint involved each time it might point us in a direction to try (for instance, upgrade phone's firmware, avoid using those types of endpoints if possible, easier to packet capture only that endpoint, etc...).
{quote}
Enabling level 5 debug might be a tall order - this is a pretty busy production server during peak hours.
{quote}
No worries quite understandable. In any case I'd double check the log on each crash. Logging is done in a separate thread so it's possible more logging could be written prior to a crash vs another crash. It's all about the timing. So perhaps in one instance something might turn up.
{quote}
> Lastly, if you can, install srtp with debug symbols.

Yes... though I'm unsure how to go about doing that. Can you provide instructions on how I'd go about doing so? I've compiled libsrtp from source manually.
{quote}
hrm, how did you configure and build it? Something like the following?
{noformat}
$ ./configure --enable-openssl
$ make shared_library
{noformat}
If so try something like the following then (or similar - the main thing is adding the -g3 to CFLAGS):
{noformat}
$ CFLAGS='-fPIC -Wall -g3 -funroll-loops' ./configure --enable-openssl
$ make shared_library
{noformat}
One more thing. You could try upgrading your version of srtp. 1.5.4 to something higher. I believe 1.5.4 is the minimum version we support. We should also allow higher versions as well.

By: Jonas Swiatek (jonasswiatek) 2019-07-11 12:27:28.847-0500

Do you reckon libsrtp v2.2.0 be an appropriate version to try with?

By: Kevin Harwell (kharwell) 2019-07-11 13:41:57.872-0500

If you're going to upgrade it's usually beneficial to be on the latest, but as the [wiki|https://wiki.asterisk.org/wiki/display/AST/libsrtp] states while supported we've not tested 2.x extensively.

That said another thing to consider is upgrading to a 2.x version may require updating other dependent libraries like openssl. If that's the case and you don't want to also introduce another change then you might stick with 1.6.0

By: Jonas Swiatek (jonasswiatek) 2019-07-11 14:06:21.141-0500

One thing worth noting actually, is that I see plenty of lines like this:

{code}
 == SRTCP unprotect failed on SSRC 2104028816 because of unable to perform desired validation
{code}

As far as I remember, initially when I set up asterisk 16 some month ago (it's been a lengthy upgrade project), I saw those regardless of which version libsrtp I installed, and I just settled on 1.5.4 because that's what the wiki recommends.

I'll try 1.6 with debugging symbols tomorrow, and then we'll see how that goes.

By: Jonas Swiatek (jonasswiatek) 2019-07-12 03:19:58.526-0500

Alright it's running with libsrtp 1.6.0 w/ debugging symbols now - so now we play the waiting game...

By: Kevin Harwell (kharwell) 2019-07-12 11:23:24.412-0500

I was looking back at the backtraces. It looks like the crash is occurring either very early in a call (before it is completely setup), or during a reinvite/renegotiation.

A packet capture might prove useful as well if one can be obtained. To limit the size you might be able to just start, stop, and overwrite old data if no crash occurs each day or throughout the day.

By: Kevin Harwell (kharwell) 2019-07-12 11:26:24.147-0500

Putting this back in waiting for feedback as we await another crash.

[~jonasswiatek] Since this could take a while the issue might autoclose (after 2 weeks I think of no updates). If it does once you comment again it'll re-open it.

By: Jonas Swiatek (jonasswiatek) 2019-07-19 11:18:12.073-0500

Core dump files for new crash

By: Jonas Swiatek (jonasswiatek) 2019-07-19 11:19:29.593-0500

Just had a new crash.

This is with libsrtp v1.6.0 w/ debugging symbols compiled.

The only interesting thing I found in messages was this line, though it didn't happen at the time of the crash, but an hours time later
{code}
ERROR[32668] chan_pjsip.c: Session already DISCONNECTED [reason=200 (OK)]
{code}

By: George Joseph (gjoseph) 2019-07-19 11:59:05.457-0500

It _looks_ like there some condition where we're passing a NULL context to libsrtp.  I've acknowledged the issue so we can investigate further.


By: Jonas Swiatek (jonasswiatek) 2019-07-19 12:11:23.328-0500

Cool! Let me know if there is anything at all I can to help. I've still got the full dumb file if you need to me do GDB stuff to it or something.

By: George Joseph (gjoseph) 2019-07-19 12:31:42.554-0500

Actually, running {{ast_coredumper --tarball-coredumps --no-default-search /tmp/core-2019-07-19T13-40-30+0000}} (assuming that's the name of the latest actual coredump) would help a lot.  You won't be able to attach the tarball to the issue (too big) but you can host it on dropbox or someplace similar and send the link to "asteriskteam@asterisk.org" with the issue id in the title.  I'll then download it and attach it to the internal issue.


By: Jonas Swiatek (jonasswiatek) 2019-07-19 13:03:31.005-0500

I'm getting this error when running that command:

{code}
Processing /tmp/core.ip-172-31-29-190.ec2.internal-2019-07-19T13:40:30+0000
Creating /tmp/core.ip-172-31-29-190.ec2.internal-2019-07-19T13-40-30+0000-thread1.txt
Creating /tmp/core.ip-172-31-29-190.ec2.internal-2019-07-19T13-40-30+0000-brief.txt
Creating /tmp/core.ip-172-31-29-190.ec2.internal-2019-07-19T13-40-30+0000-full.txt
Creating /tmp/core.ip-172-31-29-190.ec2.internal-2019-07-19T13-40-30+0000-locks.txt
Creating /tmp/core.ip-172-31-29-190.ec2.internal-2019-07-19T13-40-30+0000.tar.gz
cp: cannot stat ‘/usr/lib64/libasterisk*’: No such file or directory
{code}

By: Kevin Harwell (kharwell) 2019-08-02 11:58:47.512-0500

[~jonasswiatek],

Do you still have the log around from a crash (if not can you check next time it crashes)? If so do you see any warnings in it along the lines of either:
{noformat}
Could not set policies when setting up DTLS-SRTP on ...
{noformat}
or
{noformat}
"Could not set SRTP policies"
{noformat}
Also what about the following (would be an error in the log):
{noformat}
SRTP session could not be re-created after unprotect failure
{noformat}

By: Jonas Swiatek (jonasswiatek) 2019-08-02 12:10:24.033-0500

No not really seeing anything like that. The closest to that I'm getting are a ton of these:

{code}
 == SRTCP unprotect failed on SSRC 2104028816 because of unable to perform desired validation
{code}

I think some of Granstream's phones provoke that when using SRTP. At least my GXP2140 does so.

By: Kevin Harwell (kharwell) 2019-08-07 18:21:05.552-0500

[~jonasswiatek]

This has been an interesting one, and I'm not entirely sure if the patch I've created will fix your problem. After a deep dive into the code it is hard to find a place where your situation could occur. Anyway I have created a patch that _might_ fix things.

If you can please [download the patch|https://gerrit.asterisk.org/changes/asterisk~11697/revisions/2/patch?zip] [1], and apply it and test it out.

Thanks!

[1] https://gerrit.asterisk.org/c/asterisk/+/11697

By: Jonas Swiatek (jonasswiatek) 2019-08-08 03:46:17.447-0500

Hi Kevin,

Cool - I'll apply this patch this coming weekend! Is this meant to be applied to the code at the 16.4 tag, or against the current master of the git repository?

By: Kevin Harwell (kharwell) 2019-08-08 09:55:08.510-0500

It's meant to be applied against the 16 branch so should work fine at 16.4. Let me know if it doesn't apply for some reason.

By: Friendly Automation (friendly-automation) 2019-08-09 07:49:18.588-0500

Change 11700 merged by Friendly Automation:
srtp: Fix possible race condition, and add NULL checks

[https://gerrit.asterisk.org/c/asterisk/+/11700|https://gerrit.asterisk.org/c/asterisk/+/11700]

By: Friendly Automation (friendly-automation) 2019-08-09 07:53:43.259-0500

Change 11698 merged by George Joseph:
srtp: Fix possible race condition, and add NULL checks

[https://gerrit.asterisk.org/c/asterisk/+/11698|https://gerrit.asterisk.org/c/asterisk/+/11698]

By: Friendly Automation (friendly-automation) 2019-08-09 07:53:59.839-0500

Change 11697 merged by George Joseph:
srtp: Fix possible race condition, and add NULL checks

[https://gerrit.asterisk.org/c/asterisk/+/11697|https://gerrit.asterisk.org/c/asterisk/+/11697]

By: Friendly Automation (friendly-automation) 2019-08-09 07:54:16.775-0500

Change 11699 merged by George Joseph:
srtp: Fix possible race condition, and add NULL checks

[https://gerrit.asterisk.org/c/asterisk/+/11699|https://gerrit.asterisk.org/c/asterisk/+/11699]

By: Friendly Automation (friendly-automation) 2019-08-09 07:54:32.717-0500

Change 12729 merged by George Joseph:
srtp: Fix possible race condition, and add NULL checks

[https://gerrit.asterisk.org/c/asterisk/+/12729|https://gerrit.asterisk.org/c/asterisk/+/12729]

By: Kevin Harwell (kharwell) 2019-08-09 09:38:44.337-0500

[~jonasswiatek] The patch I created has gone through the review process and has been merged (the issue gets auto closed). If in your testing the problem still occurs please comment and the issue should reopen.