[Home]

Summary:ASTERISK-26973: bridge: Crash when freeing frame and snooping
Reporter:Michel R. Vaillancourt (jkl5group)Labels:
Date Opened:2017-04-28 13:23:21Date Closed:2017-06-07 08:00:17
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Bridges/bridge_simple Resources/res_stasis_snoop
Versions:13.15.0 Frequency of
Occurrence
Constant
Related
Issues:
is caused byASTERISK-26419 audiohooks: Remove redundant codec translations when using audiohooks
is duplicated byASTERISK-27050 Crash on Transcoded Audio in PERIODIC_HOOK Function
Environment:Linux 2.6.32-642.11.1.el6.x86_64 #1 SMP Fri Nov 18 19:25:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Attachments:( 0) ARI.snoop.code
( 1) ASTERISK-26973.diff
( 2) backtrace-and-debug.tgz
( 3) core.22813.backtrace.txt
( 4) core.asterisk.15474.backtrace.txt
( 5) Snoop_Crash_Process.txt
Description:{noformat}
[Apr 26 00:11:24]     -- Channel Snoop/[REDACTED]-00000002 joined 'simple_bridge' stasis-bridge <br6308>
[Apr 26 00:11:28]   == Setting global variable 'SIPDOMAIN' to '[REDACTED]'
[Apr 26 00:11:28]     -- Executing [[REDACTED]@inbound:1] Verbose("PJSIP/[REDACTED]-0000395a", "2, Call from "[REDACTED]" <[REDACTED]> to [REDACTED]") in new stack
[Apr 26 00:11:28]   ==  Call from "[REDACTED]" <[REDACTED]> to [REDACTED]
[Apr 26 00:11:28]     -- Executing [[REDACTED]@inbound:2] Stasis("PJSIP/[REDACTED]-0000395a", "[REDACTED],external") in new stack
[Apr 26 00:11:30]     -- <PJSIP/[REDACTED]-00003894> Playing 'beep.slin16' (language 'en')
[Apr 26 00:11:30]     -- Channel PJSIP/[REDACTED]-0000395a joined 'simple_bridge' stasis-bridge <[REDACTED]>
[Apr 26 00:11:30]        > 0x7fbc741cd8a0 -- Probation passed - setting RTP source address to [REDACTED]:17564
asterisk2*CLI>
Disconnected from Asterisk server
[Apr 26 00:11:31] Asterisk cleanly ending (0).
[Apr 26 00:11:31] Executing last minute cleanups
Asterisk ending (0).
{noformat}

We upgraded one dev PBX and one production PBX to 13.15.0 and have discovered that suddenly a function that worked just fine before now causes Asterisk to SegFault.  Note that this ARI code function has been stable / safe for over a year;  the behavior began as soon as we upgraded.  

Our DEV and PRD machines still running prior versions of Asterisk using the same ARI code do not exhibit this behavior.
Comments:By: Asterisk Team (asteriskteam) 2017-04-28 13:23:22.047-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Michel R. Vaillancourt (jkl5group) 2017-04-28 13:26:14.836-0500

Core dump from testing PBX;  illustrates an example of the issue.

By: Rusty Newton (rnewton) 2017-04-28 16:08:31.185-0500

Thanks for the report. I have a few requests:

* Can you provide a Asterisk debug log that correlates with a new backtrace?
* What version were you using previous to upgrading to 13.15.0?
* Can you attach the code making the ARI call? Or at least the relevant part? (either way, point out the relevant part).


By: Michel R. Vaillancourt (jkl5group) 2017-05-02 08:22:43.833-0500

Rusty,

I'll get the rest of the information for you in the next 24h, but the prior Asterisk version we were running across the enterprise was Asterisk 13.10.0

By: Michel R. Vaillancourt (jkl5group) 2017-05-02 08:36:57.466-0500

Snoop / Stop-Snoop ARI code segment

By: Michel R. Vaillancourt (jkl5group) 2017-05-02 12:19:47.957-0500

Recreated issue on our Dev / Testing PBX ... backtrace and debug log associated with test.

By: Sean Bright (seanbright) 2017-05-03 13:26:24.140-0500

Re-attached backtrace from backtrace-and-debug.tgz as a text file.

By: Kevin Harwell (kharwell) 2017-05-23 13:41:58.888-0500

[~jkl5group] I've been trying to reproduce the crash, however I have been unsuccessful. Does this happen for you every time you run the scenario? Also what exactly are the steps that occur in the scenario?

For example based on the log something like the following might be happening?
alice dials (or channel is created) into app and then is placed into bridge 'A'.
a snoop channel is added to alice
the snoop channel is placed into bridge 'B'.
bob dials into app and then is placed into bridge 'B'.

By: Kevin Harwell (kharwell) 2017-05-23 17:19:31.060-0500

Also, since I am unable to currently replicate, and if it is possible try installing another version of Asterisk between 13.15 and your previous version (13.10). If it doesn't crash that will help narrow down what version the problem was introduced. Then it might be easier to find exactly what patch caused it.

For instance, I'd suggest starting with 13.13.0. If you install 13.13.0 and it does not crash then that tells us the problem was introduced in 13.14 or 13.15. If it does then the problem was probably introduced in 13.13 or below (13.12 or 13.11). Then you can install one of those versions and run again.

You can find the various versions you would need at the following:
https://downloads.asterisk.org/pub/telephony/asterisk/old-releases/

By: Michel R. Vaillancourt (jkl5group) 2017-05-24 12:17:56.353-0500

Ok, we're going to do some testing this afternoon and document the exact use-case that triggers the issue 100% of the time.

Once that is done, we'll start testing 13.13.0 to see what we get.  I'll report back with results.

By: Michel R. Vaillancourt (jkl5group) 2017-05-24 13:48:23.239-0500

Step-by-step reproduction of problem.

By: Michel R. Vaillancourt (jkl5group) 2017-05-25 11:08:33.598-0500

13.10.0 & 13.11.0 do not exhibit the crash-on-snoop behavior.
Asterisk 13.12.0 or higher consistently does exhibit the crash-on-snoop behavior.

Let me know if you need any further information.

By: Kevin Harwell (kharwell) 2017-05-25 18:01:18.015-0500

Thanks for the information and testing!

Unfortunately I am still unable to duplicate the crash myself. However, I hope now that I have tracked down the offending patch - commit # for Asterisk 13:
{noformat}
430f6e5388b6e18b31e1a08582112838f6cc7557
{noformat}
It's related to issue ASTERISK-26419

I've attached a patch, ASTERISK-26973.diff. If you could apply the patch and run your test to see if it fixes your crash it would be most helpful. If the attached patch does not fix it then next I suggest removing the changes (https://gerrit.asterisk.org/#/c/4024/) made by the above commit and see if that fixes it.

If neither of those work then it's something else. If that is the case then, if you can, try running Asterisk with valgrind and post the results.

By: Michel R. Vaillancourt (jkl5group) 2017-05-26 08:17:09.271-0500

Patch rejects that .DIFF out of hand for 13.15.0 ... which version did you want me to run it against?

{code:title=PATCH results|borderStyle=solid}
asterisk-dev asterisk-13.15.0]# patch -b -p1 <./ASTERISK-26973.diff
patching file main/channel.c
Hunk #1 FAILED at 5231.
1 out of 1 hunk FAILED -- saving rejects to file main/channel.c.rej
{code}

By: Kevin Harwell (kharwell) 2017-05-30 16:07:28.130-0500

It's odd that the patch would not apply. I tried using the exact command you did on 13.15.0 and it applied fine with no errors.

That being said there was a bug in the last patch anyway, but I have since updated it and re-attached the patch. This one I specifically built and created the patch against 13.15.0, so hopefully you'll have better luck applying it.

By: Michel R. Vaillancourt (jkl5group) 2017-06-02 08:09:04.449-0500

[root@asterisk-dev asterisk-13.16.0]# patch -b -p1 <../ASTERISK-26973.diff
patching file main/channel.c

asterisk-dev*CLI> core show version
Asterisk 13.16.0 built by root @ asterisk-dev.ansspc.com on a x86_64 running Linux on 2017-06-02 12:57:22 UTC

... so we'll test against the latest version, given the recent security notices that have come out.  I'll have results back today.

Do you want a core-dump analysis / backtrace from this test, if it segfaults with the patch applied?

By: Kevin Harwell (kharwell) 2017-06-02 09:12:08.892-0500

{quote}
Do you want a core-dump analysis / backtrace from this test, if it segfaults with the patch applied?
{quote}
Yes

By: Michel R. Vaillancourt (jkl5group) 2017-06-02 15:08:28.602-0500

Good news!  Issue seems resolved with this patch on the testing environment.

We'll roll this out to production on Monday or Tuesday and see how it holds.

By: Friendly Automation (friendly-automation) 2017-06-07 08:00:18.406-0500

Change 5757 merged by Jenkins2:
channel: ast_write frame wrongly freed after call to audiohooks

[https://gerrit.asterisk.org/5757|https://gerrit.asterisk.org/5757]

By: Friendly Automation (friendly-automation) 2017-06-07 08:08:46.093-0500

Change 5759 merged by Jenkins2:
channel: ast_write frame wrongly freed after call to audiohooks

[https://gerrit.asterisk.org/5759|https://gerrit.asterisk.org/5759]

By: Friendly Automation (friendly-automation) 2017-06-07 08:13:16.186-0500

Change 5758 merged by Joshua Colp:
channel: ast_write frame wrongly freed after call to audiohooks

[https://gerrit.asterisk.org/5758|https://gerrit.asterisk.org/5758]