ASTERISK-29120: Crash: ast_bridge_channel_queue

[Home]

Summary: ASTERISK-29120: Crash: ast_bridge_channel_queue_frame on bridge softmix

Reporter: Joshua Elson (joshelson) Labels:

Date Opened: 2020-10-09 16:04:51 Date Closed: 2020-11-04 12:00:01.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Bridges/bridge_softmix

Versions: 17.7.0 Frequency of
Occurrence

Related
Issues:

Environment: CentOS 7 fully patched Attachments: ( 0) asterisk-core-20201009-144056-brief.txt
( 1) asterisk-core-20201009-144056-full.txt
( 2) asterisk-core-20201009-144056-info.txt
( 3) asterisk-core-20201009-144056-locks.txt
( 4) asterisk-core-20201009-144056-thread1.txt

Description: Experiencing this crash daily as well on multiple identical servers in this bridged configuration. In this case, pjproject.conf cache_pools was set to no.

Comments: By: Asterisk Team (asteriskteam) 2020-10-09 16:04:51.661-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].
By: Joshua Elson (joshelson) 2020-10-09 16:05:14.067-0500

Backtrace info attached.
By: Kevin Harwell (kharwell) 2020-10-12 15:34:12.649-0500

This looks like a memory corruption.

Would it be possible to run Asterisk with Valgrind [1]? If not that, then can you run Asterisk with _MALLOC_DEBUG_ [2] enabled?

Note, do *not* execute Asterisk with both _MALLOC_DEBUG_ enabled and Valgrind at the same time.

And then attach the results and/or new backtraces and logs here.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Valgrind
[2] https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag
By: Joshua Elson (joshelson) 2020-10-12 17:27:32.261-0500

Valgrind is probably out of the question for production, but I'm going to flip on MALLOC_DEBUG and see if we can get a repro. Should take us just a day or two to reproduce.
By: Kevin Harwell (kharwell) 2020-10-12 17:53:23.708-0500

heh yeah while Valgrind can be quite useful it will certainly slow things down.

Hopefully MALLOC_DEBUG will turn something up.
By: Joshua Elson (joshelson) 2020-10-19 13:14:41.857-0500

Well... so was able to run on a few nodes, but virtually all of them began to experience a new issue when running under production load before I could get any fence violations from MALLOC_DEBUG.

The symptom was that the whole system would lock up after a few hours of use with these errors repeated a number of times in the logs:

```
NOTICE[48]: res_pjsip/pjsip_transport_management.c:170 idle_sched_cb: Shutting down transport 'WS to 127.0.0.1:51368' since no request was received in 32 seconds
WARNING[117930]: res_http_websocket.c:559 ws_safe_read: Web socket closed abruptly
```

We're running with pjproject cache, cache_pools=no, but not sure if there's anything I can do to prevent that issue or work around it to get a debug output on this.

Any ideas?
By: Joshua C. Colp (jcolp) 2020-10-21 03:52:54.020-0500

If Asterisk appears to be locking up then you'd need to get a deadlock backtrace, it may be that MALLOC_DEBUG is still too much overhead for your specific usage patterns.

There's nothing really else that comes to mind for getting information for this specific issue at hand, besides a test environment that reproduces the issue where valgrind can be used.
By: Asterisk Team (asteriskteam) 2020-11-04 12:00:00.961-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines