ASTERISK-26219: segfault in ast_manager_build_channel_state_string

[Home]

Summary: ASTERISK-26219: segfault in ast_manager_build_channel_state_string_prefix

Reporter: Michel R. Vaillancourt (jkl5group) Labels:

Date Opened: 2016-07-19 15:28:36 Date Closed: 2016-10-31 09:55:05

Priority: Critical Regression?

Status: Closed/Complete Components: Core/Channels

Versions: 13.11.0 Frequency of
Occurrence Occasional

Related
Issues:
is related to ASTERISK-26421 Segmentation Fault with ARI originate into mixing bridge with 43 clients

Environment: CentOS release 6.8 (Final) Linux asterisk2.ansspc.com 2.6.32-573.1.1.el6.x86_64 #1 SMP Sat Jul 25 17:05:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Asterisk GIT-13-ac6824eM built by root @ asterisk2.ansspc.com on a x86_64 running Linux on 2016-06-01 13:44:27 UTC Attachments: ( 0) core.asterisk.10405.backtrace.txt
( 1) core.asterisk.10405.console.log
( 2) core.asterisk.24809.backtrace.txt
( 3) core.asterisk.26599.backtrace.txt
( 4) core.asterisk.26919.backtrace.txt
( 5) sept06segfault.full.log
( 6) sept07segfault.full.log

Description: Two segFaults in Asterisk within past week.

Jul 12 18:23:05 asterisk2 kernel: asterisk[19715]: segfault at 108 ip 000000000053cd1f sp 00007ff376f24a70 error 4 in asterisk[400000+25e000]
Jul 19 14:48:04 asterisk2 kernel: asterisk[32177]: segfault at 108 ip 000000000053cd1f sp 00007f45e5b29a50 error 4 in asterisk[400000+25e000]

We'll provide a core dump with appropriate debugging enabled as soon as we can.

Comments: By: Asterisk Team (asteriskteam) 2016-07-19 15:28:37.050-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Joshua C. Colp (jcolp) 2016-07-20 11:28:42.765-0500

Thank you for the crash report. However, we need more information to investigate the crash. Please provide:

1. A backtrace generated from a core dump using the instructions provided on the Asterisk wiki [1].
2. Specific steps taken that lead to the crash.
3. All configuration information necesary to reproduce the crash.

Thanks!

[1]: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Michel R. Vaillancourt (jkl5group) 2016-07-21 08:36:01.204-0500

Another crash:

Jul 20 16:36:34 asterisk2 kernel: asterisk[16328]: segfault at 108 ip 000000000055fd67 sp 00007f9cdb31f8e0 error 4 in asterisk[400000+34f000]

Ok, so rather frustratingly, in spite of the fact that both config files are set up to force Asterisk to generate a core dump on SegFault it did not.

In `/etc/asterisk/asterisk.conf` we have `dumpcore = yes` and in `/etc/sysconfig/asterisk` we have `COREDUMP=yes`

We'll schedule a restart tonight and manually give it a -g on the command line.
By: Asterisk Team (asteriskteam) 2016-08-04 12:00:01.935-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
By: Michel R. Vaillancourt (jkl5group) 2016-09-02 10:19:58.171-0500

After a lot of arguing with the box, and about six more crashes of this kind, we finally got a core dump we could process.
By: Rusty Newton (rnewton) 2016-09-02 14:24:38.198-0500

Thanks, can you provide any additional information about the crash? The state of the system? Call-flow? Dialplan? Message/Full logs with debug turned up?

By: Matt Jordan (mjordan) 2016-09-03 16:40:58.124-0500

The issue is that somehow, something is raising a hold indication without a channel. You can see that in the backtrace here:

{code}
#0 0x0000000000561d93 in ast_manager_build_channel_state_string_prefix (snapshot=0x0, prefix=0x6ce4db "")
at manager_channels.c:417
out = 0x7f23bc04bbe0
res = 0
caller_name = 0x7f241dd00aa0 "\300\n\320\035$\177"
connected_name = 0x563b05 "UH\211\345SH\203\354xH\211}\270H\211u\260H\211U\250H\213E\250H\211\307\350\345\340\a"
#1 0x0000000000562273 in ast_manager_build_channel_state_string (snapshot=0x0) at manager_channels.c:483
No locals.
#2 0x0000000000563b54 in channel_hold_cb (data=0x0, sub=0x1d01c88, message=0x7f2380141e38) at manager_channels.c:1104
obj = 0x7f23802caa68
musicclass_string = 0x7f23bc009600
channel_event_string = 0x1d020e8
__PRETTY_FUNCTION__ = "channel_hold_cb"
#3 0x00000000005e2365 in router_dispatch (data=0x1d01f78, sub=0x1d01c88, message=0x7f2380141e38)
at stasis_message_router.c:201
router = 0x1d01f78
route = {message_type = 0x1c09fa8, callback = 0x563b05 <channel_hold_cb>, data = 0x0}
{code}

Note that in {{ast_manager_build_channel_state_string}}, the snapshot is NULL. That's bad, as that has to be pulled directly out of the {{obj}} that was passed onto the Stasis message bus. That means that when we made the message to be published, we were unable to get a snapshot for the channel out of the Stasis cache:

{code}
struct stasis_message *ast_channel_blob_create_from_cache(const char *channel_id,
struct stasis_message_type *type,
struct ast_json *blob)
{
RAII_VAR(struct ast_channel_snapshot *, snapshot,
NULL,
ao2_cleanup);

if (!type) {
return NULL;
}

snapshot = ast_channel_snapshot_get_latest(channel_id);

return create_channel_blob_message(snapshot, type, blob);
}
{code}

That's pretty odd, as I'm not sure how you'd get a hold indication raised on a channel whose state wasn't being tracked appropriately.

What channel technologies are you using?
By: Michel R. Vaillancourt (jkl5group) 2016-09-07 07:30:17.831-0500

SegFault that happened yesterday.
By: Joshua C. Colp (jcolp) 2016-09-07 07:32:51.862-0500

As [~mjordan] mentioned - what channel technologies are you using? Is it also possible to get a console log?
By: Michel R. Vaillancourt (jkl5group) 2016-09-07 07:33:10.189-0500

So we are using ARI/Stasis to manage the entire dialplan from ring to click. The channel technology we are using, end-to-end, is PJSIP. The calls come in on VoIP trunks and are passed to VoIP agents; both attached via PJSIP.

The agent is waiting in a bridge, and new calls being delivered to them are transferred by the ARI/Stasis application into the bridge.

Does that help?
By: Michel R. Vaillancourt (jkl5group) 2016-09-07 07:33:59.520-0500

Good morning, Joshua. I'll see if I can pull the console log for yesterday's segFault. I'll post back shortly.
By: Michel R. Vaillancourt (jkl5group) 2016-09-07 07:50:35.627-0500

Unfortunately, we did not have "verbose" enabled in our "messages" log file, so the only place that CLI output is being captured in any fashion is the "full" log file.

I've attached an extraction of the five minute window surrounding the crash. I hope there is something valuable you can find in there.

We've changed out logger.conf to be capturing CLI at Verbose(3) for the next time around.
By: Michel R. Vaillancourt (jkl5group) 2016-09-07 12:25:45.591-0500

SegFault, Sep 7 11:08am EDT

Log extract to follow.
By: Michel R. Vaillancourt (jkl5group) 2016-09-07 12:33:29.444-0500

"Full" log of the 2 minutes around the SegFault.
By: Michel R. Vaillancourt (jkl5group) 2016-09-13 13:22:09.155-0500

We had another segFault today at 5:30am Pacific time. Do you folks want the backtrace for this one as well?
By: Rusty Newton (rnewton) 2016-09-13 19:04:24.370-0500

Another trace never hurts and the accompanying log too. Thanks!
By: Michel R. Vaillancourt (jkl5group) 2016-09-14 08:06:59.005-0500

SegFault at Sep 13 08:29:47 EDT.
By: Michel R. Vaillancourt (jkl5group) 2016-10-06 13:42:35.740-0500

We get bitten in production by this about every three days. Any progress so far? Anything we can do to help?
By: Rusty Newton (rnewton) 2016-10-07 17:30:09.168-0500

Any updates will be posted on the issue.. core developers haven't had a chance to look into it yet. We have other issues affecting more people that are higher priority at the moment.

If you need quicker resolution you can always try to rally some community developers with a bug bounty: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties

Otherwise we just have to wait until core developers are available. I don't have an ETA.

By: Joshua C. Colp (jcolp) 2016-10-29 12:27:03.765-0500

I'm looking into this issue and at least one of the crashes (and maybe all) are a result of two channels having the same unique ID. Is it possible your ARI application has a race condition where two channels are created with the same unique ID at the same period of time? Unfortunately with the logs you've provided I can't tell if this is the case. Just a note that Asterisk 13.13.0 will include a change which will detect and prevent originated calls from having the same unique ID as an existing channel.
By: Joshua C. Colp (jcolp) 2016-10-31 07:48:23.381-0500

[~jkl5group] I just reproduced this using the scenario I outlined. Two channels with the same uniqueid. First one is hung up. Second one is told to play moh. Crash.
By: Michel R. Vaillancourt (jkl5group) 2016-10-31 08:12:40.785-0500

Good morning, Joshua Colp. I'll check with our Dev team today and get back to you. I'm not sure about how we are doing channel creation.
By: Joshua C. Colp (jcolp) 2016-10-31 08:17:29.474-0500

[~jkl5group] You do appear to be specifying the uniqueid, and in the case of one of the crashes it was "chan6262". I'd guess it is based off of an identity and there is an expectation in your application that only one channel will be up at a time with it, when a race condition causes that to become false.
By: Michel R. Vaillancourt (jkl5group) 2016-10-31 09:38:24.900-0500

I've heard back from our development lead, and it is possible that this is triggered from our side. In short the channel ID is based on the agent UID, and the current thought is that there might be a rare situation where an agent could get themselves logged in twice -- thus creating the duplicate channel.

Our dev team is going to patch that out once they prove it is actually a thing. In the meantime we'll keep our eyes open for 13.13.0 to come out.
By: Joshua C. Colp (jcolp) 2016-10-31 09:55:05.430-0500

I'm marking this out as not a bug for now, if after your changes and testing it ends up still occurring we can investigate again.