[Home]

Summary:ASTERISK-29134: app_mixmonitor: Crash in AMI StopMixMonitor
Reporter:Joshua Elson (joshelson)Labels:
Date Opened:2020-10-19 15:48:03Date Closed:2021-01-13 10:10:08.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Applications/app_mixmonitor
Versions:17.7.0 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-28947 Segmentation fault in mixmonitor_ds_destroy
Environment:Attachments:( 0) asterisk-core-20201019-112326-brief.txt
( 1) asterisk-core-20201019-112326-full.txt
( 2) asterisk-core-20201019-112326-info.txt
( 3) asterisk-core-20201019-112326-locks.txt
( 4) asterisk-core-20201019-112326-thread1.txt
( 5) asterisk-core-20201110-154106-brief.txt
( 6) asterisk-core-20201110-154106-full.txt
( 7) asterisk-core-20201110-154106-info.txt
( 8) asterisk-core-20201110-154106-locks.txt
( 9) asterisk-core-20201110-154106-thread1.txt
(10) mmlog
(11) mmlog.txt
Description:Seeing a new crash that is occurring occasionally when issuing an out of band StopMixMonitor via AMI on a channel that is recording a call.

Don't have a complete repro scenario, but looking to see if I can generate one.
Comments:By: Asterisk Team (asteriskteam) 2020-10-19 15:48:03.782-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Joshua Elson (joshelson) 2020-10-19 15:48:56.730-0500

All the backtrace things attached here.

By: Joshua C. Colp (jcolp) 2020-10-21 03:55:53.174-0500

If you are able to create a case that reproduces this please respond so, and this seems to be memory corruption too so may be same underlying issue as your other one.

By: Asterisk Team (asteriskteam) 2020-11-06 12:00:00.965-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Joshua Elson (joshelson) 2020-11-10 17:02:25.010-0600

Got it! Finally. Managed to get the issue reproduced with mmlog on. Attaching the mmlog output and a new set of backtraces.

This box was updated to 17.8.1 for the last backtrace, so it's happening on a fully updated version of Asterisk.

Hope this helps identify what we've got going on here.

By: Asterisk Team (asteriskteam) 2020-11-10 17:02:26.008-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Sean Bright (seanbright) 2020-11-11 09:16:55.282-0600

How are you calling MixMonitor? Can you should a sample invocation?

By: Joshua Elson (joshelson) 2020-11-11 10:45:08.241-0600

This is definitely being called out of band via AMI, though it's possible there might be a dial plan invocation as well. Let me see if I can get some exact code for this.

By: Joshua Elson (joshelson) 2020-11-11 12:39:54.349-0600

Just did a PCAP of this in motion here. We are using a super basic invocation of StopMixMonitor via AMI. The channel may or may not be in a bridged state:

......
##
T 10.100.a.b:36858 -> 10.100.x.y:5038 [AP] #117
Action: StopMixMonitor.
Channel: PJSIP/SBC-00000a41.
ActionID: 11185-1605101125.
.

#
T 10.100.x.y:5038 -> 10.100.a.b:36858 [AP] #118
Response: Success.

By: George Joseph (gjoseph) 2020-11-13 08:00:08.706-0600

How are you starting MixMonitor though?  Also from AMI?
I can't seem to reproduce a simple scenario so could you give me a dialplan snippet, AMI command used to start MixMonitor,  and a scenario walkthrough?

I tried a simple call from Alice to Bob then used AMI to start MixMonitor on Alice's channel and stopped it with no problem.  Then I tried MixMonitor on Bob's channel and stopped it with no problem.

Also if you still have those coredumps around, can you run ast_coredumper again with the following options:
{{/var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps --no-default-search <path_to_coredump>}}

The resulting tarball will be big so you can host it on Google Drive, DropBox, etc and send the link to asteriskteam@digium.com.

Do this for both coredumps if you still have them.

{panel:title=Private Submission of Data Disclaimer}
You have indicated that you wish to submit unredacted information privately. It is not recommended to do this as it will substantially restrict the number of individuals who can help with your issue, as submitted information is only available to Sangoma. Note that submission of such information does not change the priorization of this issue. If you still wish to proceed you may do so by sending it to asteriskteam@digium.com with the issue number in the subject. For large files please send a link where they can then be downloaded. By sending this information you agree to the Website Terms of Use available on the Sangoma website at https://www.sangoma.com/legal/. Any exchange of private information between you and an Asterisk community member outside of the Asterisk JIRA is not subject to the Website Terms of Use and should be privately discussed between yourself and the Asterisk community member.
{panel}

By: Joshua Elson (joshelson) 2020-11-13 12:08:07.035-0600

MixMonitor is being started via the dialplan on the outbound call leg and stopped via AMI. In some cases, the start and the stop may happen in very rapid sequence.

I believe I saved this core, so I'll get the whole dump over to you in just a few minutes here.

By: George Joseph (gjoseph) 2020-11-16 08:53:41.816-0600

Well, the latest coredump definitely shows a memory corruption...

{{0x00007faebb41faa6 in _int_malloc (ar_ptr=0x7fae00000020, ptr=0x7fae00000012, str=0x7faebb525c3a "malloc(): memory corruption", action=<optimized out>) at malloc.c:4967}}

Whether MixMonitor is the victim or culprit I can't tell right now. Was there anything of interest in kernel or system logs at the time of the crash?

Anyway, I'm opening an internal issue and will hang on to the coredumps.  The last one had a lot of good info in it.  You can delete them from google drive at any time.


By: Joshua Elson (joshelson) 2020-11-16 11:44:08.661-0600

It looks my kernel/system logs have rotated out, so I'm not totally sure about this, but we've been having this crash a dozen or more times a day, so I'll be able to check for it today and will post back here.

Given how frequent it is, is there anything else it would be helpful to instrument or set up for testing to get this reproduced?

Thanks as always for the help!

By: Joshua Elson (joshelson) 2020-11-16 16:58:58.392-0600

So got a bunch more crashes with malloc_debug on. Here's what appears in kernel logs:

Dmesg:

[34021.798540] asterisk[77363]: segfault at 7feffffffffe ip 0000000000462bc5 sp 00007fefe2df6980 error 4 in asterisk[400000+2ee000]
[34168.186284] asterisk[77517]: segfault at 10 ip 00007fab000e60d5 sp 00007faa887676b8 error 4 in libpthread-2.17.so[7fab000da000+17000]
[34338.317698] asterisk[86112]: segfault at 7f16fffffffe ip 0000000000462bc5 sp 00007f179abf9980 error 4 in asterisk[400000+2ee000]
[36186.976858] traps: asterisk[2394] general protection ip:4e0164 sp:7f671db686e0 error:0 in asterisk[400000+2ee000]
[36392.238229] asterisk[2521]: segfault at fffffffc ip 000000000046233b sp 00007f8fc80e4690 error 4 in asterisk[400000+2ee000]

Messages:

2020-11-16 15:45:24.694 ubluprdasv32v asterisk[2521]: segfault at fffffffc ip 000000000046233b sp 00007f8fc80e4690 error 4 in asterisk[400000+2ee000]
2020-11-16 15:45:24.694 ubluprdasv32v node=ubluprdasv32v type=ANOM_ABEND msg=audit(1605566724.693:4323): auid=4294967295 uid=1001 gid=1001 ses=4294967295 pid=2496 comm="asterisk" reason="memory violation" sig=11
2020-11-16 15:45:24.723 ubluprdasv32v Process 9 (asterisk) of user 1001 killed by SIGSEGV - dumping core

2020-11-16 15:08:20.338 ubluprdasv32v node=ubluprdasv32v type=ANOM_ABEND msg=audit(1605564500.335:4040): auid=4294967295 uid=1001 gid=1001 ses=4294967295 pid=77483 comm="asterisk" reason="memory violation" sig=11
2020-11-16 15:08:20.338 ubluprdasv32v asterisk[77517]: segfault at 10 ip 00007fab000e60d5 sp 00007faa887676b8 error 4 in libpthread-2.17.so[7fab000da000+17000]
2020-11-16 15:08:20.366 ubluprdasv32v Process 9 (asterisk) of user 1001 killed by SIGSEGV - dumping core

If a whole core file tarball is needed for any of these, let me know.

By: Kevin Harwell (kharwell) 2021-01-06 16:41:11.343-0600

While looking into ASTERISK-28947 I also took a look to see what might be causing this issue. Nothing obvious as to the cause stood out unfortunately. However, while the backtraces are different between the two issues the [^mmlog.txt] on this issue shows bad memory at a similar place in the code where ASTERISK-28947 was crashing.

[~joshelson] wrote:
{quote}In some cases, the start and the stop may happen in very rapid sequence.{quote}

This too was essentially what would triggered ASTERISK-28947 combined with a hangup on the channel.

Not entirely sure, but it's possible the fix for ASTERISK-28947 might fix this issue too. I'd recommend trying out the [patch|https://gerrit.asterisk.org/c/asterisk/+/15235] to see if it helps.


By: Joshua Elson (joshelson) 2021-01-06 18:44:38.501-0600

Awesome! I was just looking back at causes of this and was planning to reengage on the resolution here.

I'll test with this. Thanks a bunch for the update here!

By: Joshua Elson (joshelson) 2021-01-12 18:46:04.225-0600

Hey [~kharwell]!

This patch resolves the issue I'd been seeing as well. There is one other path I believe is causing an issue, but it is quite rare for us compared to this one.

This ticket can be closed.

Josh

By: Kevin Harwell (kharwell) 2021-01-13 10:09:31.133-0600

Very good to hear it fixed this issue too. Thanks for reporting back.