ASTERISK-21356: Segfault during bridge channel proxy inspection in a masquerade caused by an AMI Redirect of two channels

[Home]

Summary: ASTERISK-21356: Segfault during bridge channel proxy inspection in a masquerade caused by an AMI Redirect of two channels

Reporter: William luke (greenlightcrm) Labels:

Date Opened: 2013-03-29 11:08:07 Date Closed: 2013-04-22 11:45:13

Priority: Critical Regression?

Status: Closed/Complete Components: Core/Bridging Core/Channels

Versions: 11.3.0 Frequency of
Occurrence Frequent

Related
Issues:
must be completed before resolving ASTERISK-21773 Asterisk 1.8.22.0 Open Blockers

must be completed before resolving ASTERISK-21774 Asterisk 11.4.0 Open Blockers

is duplicated by ASTERISK-21793 Segmentation fault when dealing with Agent channels

Environment: CentOS 6.4 Attachments: ( 0) backtrace.txt
( 1) debuglog.tar.gz
( 2) jira_asterisk_21356_v11.patch

Description: Asterisk segfaulted during normal operation.

After speaking to mjordan on #asterisk he advised that it looks like the a redirect thread (Thread 1) is trying to access a bridged channel using a NULL pointer.

The scenario is that two channels are in a bridge, and then we execute a dual Redirect (via AMI) to put them both into a ConfBridge, so that we can bring a 3rd party into the call.
This happens successfully hundreds of times a day on this system, so must be some sort of edge case when it goes wrong.

I also *should* have a full debug log available, but it's quite a busy system and would be rather a large file (Circa 50Gb/day). I'm sure I could make it available if it would help.

{noformat}
[gl-agentconf]
exten => _X.,1,Set(ConfBridgeNumber=99${EXTEN})
same => n,NoOp(Putting AGENT into ConfBridge ${ConfBridgeNumber})
; same => n,Set(JITTERBUFFER(fixed)=default);Adaptive with defaults.
same => n,Answer()
same => n,ConfBridge(${ConfBridgeNumber})
same => n,NoOp(Putting AGENT back into their Dialling Hub. Exten: ${EXTEN})
same => n,Goto(gl-diallinghub,${EXTEN},agentinhub)

[gl-customerconf]
exten => _X.,1,Set(ConfBridgeNumber=99${EXTEN})
; same => n,Set(JITTERBUFFER(fixed)=default);Adaptive with defaults.
same => n,NoOp(Putting CUSTOMER into ConfBridge ${ConfBridgeNumber})
same => n,Answer()
same => n,ConfBridge(${ConfBridgeNumber})
{noformat}

Comments: By: Rusty Newton (rnewton) 2013-03-29 12:40:32.914-0500

If you can provide an *excerpt* from the full DEBUG Log around the time of the core dump that may be helpful.
By: William luke (greenlightcrm) 2013-03-29 13:17:51.166-0500

In the full debug there's millions of entries such as "[Mar 29 14:44:47] DEBUG[9921] audiohook.c: Write factory 0x7f1bac3b0618 was pretty quick last time, waiting for them".

I've removed any lines with "audiohook.c" in them to filter these out. In the debug log the actual segfault (and then Asterisk restart) happens at line 455398.

Please let me know if you require any additional details.

By: William luke (greenlightcrm) 2013-03-29 13:19:09.962-0500

Debug log from time of crash.
By: William luke (greenlightcrm) 2013-04-02 11:42:25.202-0500

This issue has occurred again today, so it's certainly happening a LOT more frequently that I'd initially suspected.

I'll not put another backtrace in, as it's almost identical looking over it.

One thing I did notice (not sure if it's relevance) is that in each crash the channel which seemed to lead to the null pointer being dereferenced after a dual redirect (the SIP/gl-asgw... channel) had started of life in a Queue. It had then been ripped out of the Queue by means of an AMI bridge. The two channels in the bridge had been speaking away for some time before then being Dual Redirected and crashing. This *may* just be coincidence, as I suspect 90% or so of our calls which are Dual Redirected in this manner come from a Queue, but thought I'd mention it anyway.

As this is happening with alarming frequency now, I'd be immensely grateful for any insight into ways I could even go about avoiding the issue pending a more longterm fix. I am just not familiar enough with the code, and especially the masquerade stuff to tell what might help and what might not.
Would doing a single redirect rather than a dual redirect, combined with some dialplan logic to get the second channel there be a possible fix? Am thinking we'd still end up inside "__ast_channel_masquerade" and so hit the same problem.

At first glance it seems the section of code where things go a bit crazy is checking for channel proxy (chan_agent is mentioned in the comments); since I'm not using any chan_agent's can I comment this code out?

Upon deeper inspection, this must be some sort of race condition (not sure how as we've locked both original and clongchan), but:

if (ast_channel_internal_bridged_channel(clonechan)
&& (ast_channel_internal_bridged_channel(clonechan) != ast_bridged_channel(clonechan))
&& (ast_channel_internal_bridged_channel(ast_channel_internal_bridged_channel(clonechan)) != clonechan)) {
final_clone = ast_channel_internal_bridged_channel(clonechan);
}

So the check is done first against ast_channel_internal_bridged_channel(clonechan) being null (and it must not be null at that point, or else lazy eval would stop right there).
Then by the time we evaluate the third condition, it's now returning null, which is then dereferenced when passed to ast_channel_internal_bridged_channel as an argument.

So.... could we do a check against null pointer inside ast_channel_internal_bridged_channel itself? Or would that be equally susceptible to whatever rogue thread is changing things without getting a lock first?

By: Richard Mudgett (rmudgett) 2013-04-02 12:37:54.893-0500

[^jira_asterisk_21356_v11.patch] - This should fix the crash.

The two party bridging loops were changing the bridge peer pointers without the channel locks held. Thus when ast_channel_massquerade() tested and used the pointer there is a small window of opportunity for the pointers to become NULL even though the masquerade code has the channels locked.

Please test and report.
By: William luke (greenlightcrm) 2013-04-03 04:17:08.925-0500

Thanks for the patch. The not locking while changing pointers fits in exactly with what it looked like was happening, I just had no idea how to find out where.

I've applied this to our system where we were seeing the issues last night, and will let you know how it runs for the next few days.

Again, many thanks for the swift turnaround on this!
By: William luke (greenlightcrm) 2013-04-08 07:11:36.347-0500

So far so good... 5 days 12 hours without a crash.

By: Rusty Newton (rnewton) 2013-04-22 09:31:25.628-0500

William, still looking good?

If so, we'll consider this fixed.
By: William luke (greenlightcrm) 2013-04-22 09:34:26.633-0500

Yup:

[root@asterisk ~]# asterisk -rx "core show uptime"
Setting max files open to 6506
System uptime: 2 weeks, 5 days, 15 hours, 6 minutes, 6 seconds
Last reload: 2 weeks, 5 days, 15 hours, 6 minutes, 6 seconds
[root@asterisk ~]#

I'm happy to consider this fixed as well - thanks again!

Will this make it into next release?

By: Richard Mudgett (rmudgett) 2013-04-22 11:09:00.354-0500

It will be in v11.5.0 since v11.4.0-rc1 has already been tagged.