ASTERISK-26868: ARI: Asterisk crash - frame copy into invalid memory during bridging operations

[Home]

Summary: ASTERISK-26868: ARI: Asterisk crash - frame copy into invalid memory during bridging operations

Reporter: Marin Odrljin (modrljin) Labels:

Date Opened: 2017-03-14 05:00:04 Date Closed:

Priority: Major Regression? No

Status: Open/New Components: Bridges/bridge_simple Resources/res_ari_bridges Resources/res_stasis

Versions: 13.11.2 14.2.1 Frequency of
Occurrence Occasional

Related
Issues:

Environment: Debian 8 Attachments: ( 0) 20170317_2_backtrace.txt
( 1) 20170317_2_debug_log_crash.txt
( 2) 20170317_2_var-log-messages.txt
( 3) 20170317_backtrace.txt
( 4) 20170317_debug_log.txt
( 5) 20170317_var-log-messages.txt
( 6) 20170318_backtrace.txt
( 7) 20170318_debug_log_crash.txt
( 8) 20170318_var-log-messages.txt
( 9) 20170319_backtrace.txt
(10) 20170319_debug_log_crash.txt
(11) 20170319_var-log-messages.txt
(12) backtrace.txt
(13) cli_log_-_simulation.txt
(14) debug_log_-_simulation.log
(15) java_app_log.txt

Description: Asterisk occasionally crashes while adding channel to bridge with ARI. I'm not sure if this is because of ARI or not, and also not 100% sure if the real problem is adding channel to bridge, but I think so. This happens in one of thousands of calls so it is not so obvious and easy to reproduce. But I got segfault several times in last few months on both asterisk 13.11.2 and 14.2.1.

I have analyzed my app's log and it looks like every time it happened after the same two actions: 1. enter bridge, 2. answer call. Currently I'm mostly working on my routing app which receives IN call, finds route for OUT call, dials it and joins them together in bridge. So basically the flow of the call is always the same:

- IN call comes to asterisk Stasis app
- App sends Ring to IN call (channel)
- Create bridge
- Add IN call (channel) to bridge
- Dial on OUT call
- Add OUT call (channel) to bridge (after it is answered)
- Answer IN call

Crash always happens on last two steps: add OUT call to bridge and answer IN call.

Finally, after 1-2 months I managed to catch core dump so I'm attaching backtrace file.

Comments: By: Asterisk Team (asteriskteam) 2017-03-14 05:00:05.949-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Joshua C. Colp (jcolp) 2017-03-14 05:49:50.395-0500

It crashed because copying of a frame went into invalid memory, but there's no information about how it got there, where it came from, etc. This is very much like looking for a needle in a haystack.

To help narrow things down we'll need more information. A console log showing the interaction before the crash would be useful. Information about the configuration would also be useful - for example, what channel driver are you using? Are you using WebRTC?
By: Marin Odrljin (modrljin) 2017-03-14 10:39:35.732-0500

Since this crash happens very rarely, in one of 10 000 to 100 000 calls, I can't turn on debug log whole the time. First time I got this crash, SIP channel (chan_sip) was dialing in, out call was dialed to DAHDI and they were put in the bridge. This happened on asterisk 13.11.2. One month later, the same crash happened on asterisk 14.2.1 and the call flow was exactly the same. Since then I made test app in the environment that will simulate this situation in loop (mass test) and hopefuly catch a backtrace on crash. After probably 30.000 calls and X days it finally happened. If I turned on debug log from the start, I'm not sure how big a debug file would be. Probably too big for the system.

What I have is following (attached):
- java app log: java_app_log.txt
- simulation of cli log: cli_log - simulation.txt
- simulation of debug log: debug_log - simulation.log

Simulation files are made afterwards with new call that didn't crash asterisk, but with exactly the same call flow as it happened while it crashed. I have marked possible crash position in the files with ############ ... Asterisk crash ... ############

I'm not using WebRTC.

Hopefuly this helps.
By: Rusty Newton (rnewton) 2017-03-14 11:01:59.290-0500

bq. If I turned on debug log from the start, I'm not sure how big a debug file would be. Probably too big for the system.

You should automatically rotate the log when it reaches a reasonable size and delete the old if a crash hasn't happened yet. There is no reason to allow the log to grow out of control. When the crash happens provide the current debug log.
By: Marin Odrljin (modrljin) 2017-03-15 07:53:32.469-0500

OK, I'll do that. Please leave this issue opened until we get this log.
By: Marin Odrljin (modrljin) 2017-03-17 04:22:57.180-0500

It looks like I became an expert for crashing asterisk :-) New files are starting with 20170317*
By: Marin Odrljin (modrljin) 2017-03-17 09:16:39.757-0500

Yet another crash, but this time I was using early bridging and got different problem in res_stasis.so

Mar 17 14:40:30 maxvoip kernel: [5725818.768795] traps: asterisk[11735] general protection ip:7f6b1bdf5e02 sp:7f6ae745ea40 error:0 in res_stasis.so[7f6b1bddf000+20000]

Attached files are starting with 20170317_2_*
By: Rusty Newton (rnewton) 2017-03-17 15:01:37.983-0500

Thanks Marin! You are an expert now! :)
By: Sean Bright (seanbright) 2017-03-17 19:43:42.117-0500

[~modrljin], if you still have the core file, could you run the following:

{noformat}
(gdb) f 0
(gdb) p control->bridge
{noformat}

And let us know what the output is?
By: Marin Odrljin (modrljin) 2017-03-18 02:01:34.142-0500

I have core files. I'm not sure what else I have to execute before, but when I do it in following order:

# gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt" --batch -c core > /tmp/backtrace.txt
# gdb -se "asterisk" -c core | tee /tmp/backtrace.txt
set logging on
(gdb) f 0
(gdb) p control->bridge

I get following:

For backtrace 20170317_backtrace.txt:
(gdb) p 0
$1 = 0
(gdb) p control->bridge
No symbol "control" in current context.

For backtrace 20170317_2_backtrace.txt:
(gdb) p 0
$1 = 0
(gdb) p control->bridge
$2 = (struct ast_bridge *) 0x632e6c6f72746e6f

Let me know if I have to do anything else differently before this. Also, I can send you core files if you like and then you can do whatever you want.
By: Marin Odrljin (modrljin) 2017-03-18 02:28:13.992-0500

What I have noticed in file '20170317_2_debug_log_crash.txt' is line: [Mar 17 14:40:29] ERROR[9860] tcptls.c: Accept failed: Too many open files

FYI in my start script I use: ulimit -n 4096

I'm attaching new crash 20170318_* which looks different to me.
By: Marin Odrljin (modrljin) 2017-03-20 03:56:48.713-0500

I've also tried to replace SIP with PJSIP just to eliminate old chan_sip driver problems, but new crash came quickly (20170319_*). P.S. my app made last outdial to IAX2 because there was a problem with call to DAHDI and in configuration I had failover settings to IAX2 trunk.

I had an idea to test different combinations of SIP, PJSIP, IAX2 and DAHDI calls with bridging, just to see if the same problems will happen and to eliminate eventual bugs in specific drivers. But since I got this last crash quite fast without using chan_sip, I suppose the problem is not there but in some general bridging operation. Can anyone confirm that, so I don't go on with all those tests?
By: Joshua C. Colp (jcolp) 2017-03-20 05:18:53.753-0500

The issue is in off-nominal bridge code:

{noformat}
[Mar 19 23:53:55] WARNING[18782] bridge_channel.c: Can't create pipe! Try increasing max file descriptors with ulimit -n
[Mar 19 23:53:55] ERROR[18782] stasis/control.c: Error adding channel to bridge
{noformat}

In fact this issue was already triaged and accepted.
By: Marin Odrljin (modrljin) 2017-03-20 11:41:36.582-0500

OK. Does it mean that we have only one problem here or probably two: one in very first report and in 20170317_* (where early bridging is not used) and other in crashes reported later (with early bridging)? Crash with early bridging I can reproduce much faster, it happens on every few thousends calls, while other without early bridging doesn't occur so often and is harder to reproduce.
By: Joshua C. Colp (jcolp) 2017-03-20 11:47:38.633-0500

The latest backtrace you have provided is a separate issue.
By: Marin Odrljin (modrljin) 2017-07-17 06:35:12.911-0500

I think I have new valuable input in scenario without early bridging.

If I put a delay of 100 ms after OUT channel is answered and before it is added to bridge, I can't reproduce asterisk crash any more. To me it looks like newly created and answered OUT channel is sometimes not immediately "ready" yet to be put in bridge and then crash happens. Before I was adding channel to bridge as soon as possible after it is answered (after event ChannelStateChange 'Up') and then crash was occasionaly occuring.
By: Marin Odrljin (modrljin) 2019-05-07 03:17:09.017-0500

Is issue ASTERISK-28377 related to this one?
By: Joshua C. Colp (jcolp) 2019-05-07 04:40:02.963-0500

The change which fixed ASTERISK-28377 was only applicable to 16 and above (that is only where the code existed that caused the problem). You have tagged this issue with 13, so it is not the same problem.