[Home]

Summary:ASTERISK-29604: ari: Segfault with lots of calls
Reporter:Danila Evgrafov (EvgrDan)Labels:
Date Opened:2021-08-18 04:37:06Date Closed:2023-01-30 08:26:20.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_stasis Channels/chan_sip/General
Versions:16.18.0 16.20.0 18.6.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-30383 Asterisk crashes on snoop channel handle
Environment:Asterisk version: 16.20.0 OS: Ubuntu 16.04 LTS Hardware: 4 core CPU, 8Gb RAM, Digital ocean VPSAttachments:( 0) core-brief.txt
( 1) core-brief-2.txt
( 2) core-brief-3.txt
( 3) core-brief-5.txt
( 4) core-brief-6.txt
( 5) core-full.txt
( 6) core-full-2.txt
( 7) core-full-3.txt
( 8) core-full-5.txt
( 9) core-full-6.txt
(10) core-info.txt
(11) core-info-2.txt
(12) core-info-3.txt
(13) core-info-5.txt
(14) core-info-6.txt
(15) core-locks.txt
(16) core-locks-2.txt
(17) core-locks-3.txt
(18) core-locks-5.txt
(19) core-locks-6.txt
(20) core-thread1.txt
(21) core-thread1-2.txt
(22) core-thread1-3.txt
(23) core-thread1-5.txt
(24) core-thread1-6.txt
(25) debug_log_29604_5.txt
(26) debug_log_29604_6.txt
(27) debug_log_29604.txt
Description:Asterisk randomly crush when processing multiple calls on stasis app.
I noticed that crushes occures more often when processing playbacks with long audio. But i am not sure if there is some relation.
My ARI app creating outgoing calls. Crush seems to occure when many calls are answered at the same time and start playbacks. I don't know exact count of simultaneous calls, sometimes problem happens on 40 calls, sometimes on 100.

Protocol: SIP
Audio format: gsm

Syslog:
kernel: [1998361.116212] asterisk[18177]: segfault at 15c ip 0000000000511aef sp 00007fc593c1d480 error 4 in asterisk[400000+326000]

When call answered:
1) app creates bridge1 and append called channel to it
2) app starts recording on bridge1
3) app creates bridge2
2) app creates snoop channel from called channel
3) app creates external media channel
4) app appends snoop and media channels to bridge2
5) app starts playback on bridge1
Comments:By: Asterisk Team (asteriskteam) 2021-08-18 04:37:09.956-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Joshua C. Colp (jcolp) 2021-08-18 04:40:50.620-0500

You've marked this as a regression. Did this previously work in an older version and now is not? If so, what version?

Can you explain your exact usage of ARI? You've mentioned creating outgoing calls, but the crash is actually in Snoop channels.

By: Danila Evgrafov (EvgrDan) 2021-08-18 04:54:29.224-0500

Sorry, regression is a mistake.
When call answered:
1) app creates bridge1 and append called channel to it
2) app starts recording on bridge1
3) app creates bridge2
2) app creates snoop channel from called channel
3) app creates external media channel
4) app appends snoop and media channels to bridge2
5) app starts playback on bridge1

By: Joshua C. Colp (jcolp) 2021-08-18 05:12:49.740-0500

We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Danila Evgrafov (EvgrDan) 2021-08-18 05:37:53.916-0500

Ok, sure. I attached debug log file to the issue https://issues.asterisk.org/jira/secure/attachment/60715/debug_log_29604.txt

By: Joshua C. Colp (jcolp) 2021-08-18 06:12:16.419-0500

The information at the end of the debug log shows chan_sip retransmissions and destructions. Have you tried using chan_pjsip as a way to eliminate chan_sip as a possible cause of memory corruption?

By: Joshua C. Colp (jcolp) 2021-08-18 06:18:16.665-0500

And how often is this happening? Do you have other backtraces to see if it is in the same place or different?

By: Danila Evgrafov (EvgrDan) 2021-08-18 06:28:33.451-0500

On my production servers i face this problem about once a week. But on testing enviroment i am able to reproduce it every time when i sent many calls.
No, i haven't try pjsip, but i can do it. Yes i have other backtraces with different places. I will attach it to the issue within a minute.

By: Joshua C. Colp (jcolp) 2021-08-18 06:36:23.622-0500

That would indicate memory corruption in general. I'd first try switching away from chan_sip to eliminate that as a variable, as it is community supported.

By: Danila Evgrafov (EvgrDan) 2021-08-19 00:52:34.120-0500

I switched to chan_pjsip. Nothing changed. Same crushes

By: Danila Evgrafov (EvgrDan) 2021-08-23 01:26:01.492-0500

Anything else i can do for help solving this issue?

By: Benjamin Keith Ford (bford) 2021-08-23 08:43:14.822-0500

Can you provide logs of the crash while using chan_pjsip? Also, same question as with chan_sip, are the crashes occurring in multiple locations?

By: Danila Evgrafov (EvgrDan) 2021-08-24 01:06:49.978-0500

Attaching backtraces and debug logs for PJSIP
Number 5s are corresponding following syslog error:
kernel: [2507787.116398] asterisk[10176]: segfault at 15c ip 0000000000511aef sp 00007fb39b7c1480 error 4 in asterisk[400000+326000]
Number 6s are corresponding following syslog error:
kernel: [2508145.822659] traps: asterisk[15761] general protection ip:7f3f8a8d58e0 sp:7f3ed4279610 error:0 in res_stasis.so[7f3f8a8c7000+22000]

By: Danila Evgrafov (EvgrDan) 2021-08-24 01:12:01.060-0500

For chan_sip i already attached 3 backtraces for different locations:
Backtraces without numbers are corresponding following syslog error:
kernel: [1998361.116212] asterisk[18177]: segfault at 15c ip 0000000000511aef sp 00007fc593c1d480 error 4 in asterisk[400000+326000]
Number 2s are corresponding following syslog error:
kernel: [1985423.203495] traps: asterisk[18206] general protection ip:7fa463deb8e0 sp:7fa3f7c6e610 error:0 in res_stasis.so[7fa463ddd000+22000]
Number 3s are corresponding following syslog error:
kernel: [1986028.157977] asterisk[25618]: segfault at b0545 ip 00007fe4b4796d44 sp 00007fe3f9723890 error 4 in libpthread-2.23.so[7fe4b478d000+18000

By: Benjamin Keith Ford (bford) 2021-08-24 07:38:02.778-0500

As Josh mentioned, this could be memory corruption. The crashes are not occurring in the same place. We can try to produce the issue if you are willing to provide your ARI script / any dialplan setup you are doing and that will hopefully be enough, but these tend to be nasty issues to track down.

By: Danila Evgrafov (EvgrDan) 2021-08-24 23:07:48.114-0500

Ok, i will try to simplify my ari script so that i can provide it to you

By: Danila Evgrafov (EvgrDan) 2021-08-25 04:31:31.588-0500

This is very simplified app. App only creates bridge, add channel to it and play sound file on this bridge. But with 100 simultaneous calls it produce same crashes.
https://github.com/EvgrafovDR/ari_test

UPD: on my local machine asterisk crashes on 200 calls

By: Benjamin Keith Ford (bford) 2021-08-25 10:18:31.324-0500

Thanks for providing that! I'm going to open a ticket for this to investigate further.

By: Adrian Hand (adrianhand) 2022-06-20 17:10:49.004-0500

Hello! I have an issue that looks very similar to this. Can you tell me if it was ever resolved? Thank you!

By: Joshua C. Colp (jcolp) 2022-06-20 17:18:31.906-0500

This issue itself is open, so no fix tagged against it has been merged.

By: Artem (ari_segfault) 2022-07-30 07:31:54.867-0500

I have the same problem. I noticed that version Asterisk 17.7.0 works more stably.
Now I will try to test it on real calls.

I'll be back as soon as I get the results.

By: Adrian Hand (adrianhand) 2022-07-30 12:40:25.167-0500

I’ll be curious to see what you come back with, Artem. I’ve sunk a lot of time into trying to understand the underlying issue in code with no progress. My “remedy” so far has been a systemd setup that automatically restarts asterisk when the crash happens.

By: Artem (ari_segfault) 2022-07-30 12:55:30.398-0500

It has been working stably on the production server for two hours now.
Perhaps it is not the asterisk version that is affected, but the system version. This time the container was built on debian.
The results will be known for sure tomorrow

using the container below:
docker pull lucasvs/asterisk:17.7.0-v1

By: Artem (ari_segfault) 2022-07-31 05:40:19.204-0500

Adrian Hand, Good news. I usually get a segfault at this time, but it's not there right now.
I also found out that the system version is not affected.
My experiments are below:

Debian + Asterisk 18.5.0  segfault at 15c ip 000055f2aa279a69 sp 00007fc67cd42710 error 4 in asterisk[55f2aa0c9000+309000]
Ubuntu + Asterisk 18.4.0  segfault at 15c ip 00005579948ae39d sp 00007fdec1bf63f0 error 4 in asterisk[55799472f000+235000]
Debian + Asterisk 18.2.0 segfault at 15c ip 0000556373e85256 sp 00007f7de7ffd3f0 error 4 in asterisk[556373d07000+234000]
Ubuntu + Asterisk 18.13.0 segfault at 15c ip 00005556538fcd3f sp 00007f76e94c6710 error 4 in asterisk[55565379d000+206000]

Debian + Asterisk 18.1.0 looks like OK
Debian + Asterisk 18.0.0 looks like OK

Debian + Asterisk 17.7.0 OK - production server

By: Artem (ari_segfault) 2022-07-31 08:40:23.555-0500

I went even further and found a patch and even a specific file that caused the problems.
All because of the patch for the task below:
https://issues.asterisk.org/jira/browse/ASTERISK-28883

The patch itself:
https://gerrit.asterisk.org/c/asterisk/+/14889

The file causing the error:
https://gerrit.asterisk.org/c/asterisk/+/14889/6/res/res_stasis_snoop.c#70

What specific change is not yet clear. I don't understand C

By: Adrian Hand (adrianhand) 2022-07-31 16:00:16.401-0500

This is mega, Artem, thank you 🙏🙏🙏 I feel inclined to just push 18.1.0 to all my servers to buy stability for now. I am in the same boat as you regards understanding the code enough to explain what’s happening, but it’s fantastic that your detective work has revealed the source through change history. What remedy are you hoping for here?

By: Artem (ari_segfault) 2022-07-31 16:35:25.768-0500

I removed the changes to this file from the patch before building Asterisk 18.2.0.
I didn't find any problems, but it's better to wait for the next working day and get recommendations from Joshua Colp

By: Joshua C. Colp (jcolp) 2022-07-31 18:13:01.742-0500

The only thing I can say is that this line:

{code}
spyee_snapshot = ast_channel_snapshot_get_latest(ast_channel_uniqueid(snoop->spyee_chan));
{code}

Assumes that snoop->spyee_chan is non-NULL. If that's not the case, then it would crash.

By: Artem (ari_segfault) 2022-07-31 18:50:49.403-0500

There were no problems after adding a NULL check to all three places
If at least one is removed, then segfault crashes on my tests

{code}
 if(snoop->spyee_chan) {
   ast_channel_cleanup(snoop->spyee_chan);
 }
{code}

{code}
 if(snoop->spyee_chan) {
   spyee_snapshot = ast_channel_snapshot_get_latest(ast_channel_uniqueid(snoop->spyee_chan));
   if (spyee_snapshot) {
     ast_multi_channel_blob_add_channel(payload, "spyee_channel", spyee_snapshot);
   }
 }
{code}
{code}
 /* Keep a reference to the channel we are spying on */
 if(chan) {
   snoop->spyee_chan = ast_channel_ref(chan);
 }
{code}

By: Adrian Hand (adrianhand) 2022-08-25 11:19:07.024-0500

Are you happy with your resolution after a month, Artem? Still running 18.2.0 with your manual adjustments to the one file?

By: Danila Evgrafov (EvgrDan) 2022-09-06 23:15:58.372-0500

Artem, thank you for your investigation, i'm running asterisk 18.1.0 for about a week on a couple of my production servers and it looks like everything is alright!

By: Artem (ari_segfault) 2022-09-07 04:41:23.950-0500

Adrian Hand, I have compiled version 18.13 with my changes and it has been working for a week without failures
Danila Evgrafov, I was glad to help

By: Artem (ari_segfault) 2022-11-28 03:32:23.090-0600

The version with my fixes started to crash "segfalt" with an error at high load.
It helps to completely solve the problem only by rolling back all changes in res_stasis_snoop.c
It is not yet clear whether this will cause other problems

By: Friendly Automation (friendly-automation) 2023-01-30 08:26:21.667-0600

Change 19863 merged by Friendly Automation:
res_stasis_snoop: Fix snoop crash

[https://gerrit.asterisk.org/c/asterisk/+/19863|https://gerrit.asterisk.org/c/asterisk/+/19863]

By: Friendly Automation (friendly-automation) 2023-01-30 08:28:35.353-0600

Change 19781 merged by Friendly Automation:
res_stasis_snoop: Fix snoop crash

[https://gerrit.asterisk.org/c/asterisk/+/19781|https://gerrit.asterisk.org/c/asterisk/+/19781]

By: Friendly Automation (friendly-automation) 2023-01-30 08:31:31.489-0600

Change 19862 merged by Friendly Automation:
res_stasis_snoop: Fix snoop crash

[https://gerrit.asterisk.org/c/asterisk/+/19862|https://gerrit.asterisk.org/c/asterisk/+/19862]