[Home]

Summary:ASTERISK-24521: [patch] Segfault due to null pointer in ast_bridged_channel
Reporter:Ben Smithurst (bensmithurst)Labels:
Date Opened:2014-11-14 08:15:04.000-0600Date Closed:2014-12-22 13:21:58.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/Channels
Versions:11.8.1 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Attachments:( 0) ast_bridged_channel.diff
( 1) backtrace.txt
Description:We have observed a crash in ast_bridged_channel due to a null pointer.  We do not know at present how to reproduce it, it is something we haven't really seen before but then saw several times in a single day.

The cause appears to be a bridged channel existing without a 'tech' field, so the ast_bridged_channel function dereferences a null pointer, the fix is quite simple and seems to work for us, we've seen no further occurences of the crash.

*Hopefully* I still have the backtrace/core file, but if not, as I say we don't know how to reproduce it, apologies.
Comments:By: Ben Smithurst (bensmithurst) 2014-11-14 08:19:31.143-0600

backtrace and dump of the offending channel structure attached, let me know if you need any other info out of the core dump.

By: Matt Jordan (mjordan) 2014-11-14 14:06:19.254-0600

It looks like {{sip_hangup}} somehow got called on a channel that was bridged with a channel that was allocated but clearly not populated in any usable fashion - the bridged channel has no {{name}}, {{uniqueid}}, or most other properties that are assigned during allocation. The only thing it does have that shows where it came from is the {{appl}}/{{data}} field.

It may be that your patch is correct, but it's almost impossible to say. It's equally likely that some other part of code in {{app_dial}} is not error checking appropriately, or is early bridging two channels together before they should. There could also be a race condition between completely populating the outbound channel with its information and some other occurrence in {{chan_sip}} and {{app_dial}}.

If someone encounters the same issue your patch may help them, but I'm not sure it's the right solution to the problem.

It would be extremely helpful to get a log showing how this occurred, or any information that would help us understand how the system got into this state.

By: Ben Smithurst (bensmithurst) 2014-11-14 14:30:05.218-0600

I've noticed this before the crashes happened,

{code}
[2014-11-12 15:22:51] WARNING[51912][C-00000505] channel.c: No path to translate from SIP/sip-in-f2.voip.thw.gradwell.net-00001134 to IAX2/aa.bb.cc.dd:4569-281
[2014-11-12 15:22:51] WARNING[51912][C-00000505] channel.c: Can't make SIP/sip-in-f2.voip.thw.gradwell.net-00001134 and IAX2/aa.bb.cc.dd:4569-281 compatible
[2014-11-12 15:22:51] WARNING[51912][C-00000505] features.c: Bridge failed on channels SIP/sip-in-f2.voip.thw.gradwell.net-00001134 and IAX2/aa.bb.cc.dd:4569-281
{code}

(customer IP address removed)

By: Ben Smithurst (bensmithurst) 2014-11-14 14:36:53.359-0600

These are consistently the last 3 lines before each crash on that day (same IP address), and looking through our records, this is a customer number which was migrated onto this server just before the crashes started happening.  Unfortunately being a customer endpoint there's only so much we can find out from that side as to why our Asterisk doesn't like it.  Based on the above can you provide any suggestions about possibilities and any areas we should look for when trying to reproduce this further?

By: Matt Jordan (mjordan) 2014-12-21 21:10:37.966-0600

Are you sure you are running an unmodified version of Asterisk?

{{ast_channel_tech}} should *never* return NULL. It is set to a 'fake tech' before a channel is linked into the channels container to prevent this. Channel technologies are static, immutable structures, and are never removed from a channel. Grepping through the source for {{ast_channel_tech_set}} shows that it is never set to NULL.

Your check should never be needed, as {{ast_channel_tech(bridged)}} should - in an unmodified version of Asterisk - always return a valid pointer.

By: Ben Smithurst (bensmithurst) 2014-12-22 04:00:35.671-0600

We do have some modifications - I wouldn't expect any of them to cause this but you never know.  We have since observed the pointer being a corrupt pointer in other ways rather than null, so it could be something being free'd too early and overwritten or something.  We've now updated to 11.15.0 anyway so it might be worth closing this and we'll look into it further at our end if it carries on with the latest version.

By: Matt Jordan (mjordan) 2014-12-22 13:21:50.575-0600

If you are running a modified version of Asterisk, there's a good chance there isn't anything we can do to help. If you can find a way to reproduce the issue reliably - or if you can narrow down the problem - that would help a lot. As it is, I don't see how the patch provided would alleviate the problem, or how an {{ast_channel_tech}} is going to become a garbage pointer. I _suppose_ that could happen if you are unloading a channel driver while a call using that channel driver is in progress - although, all channel drivers will also bump their module ref count to prevent that very thing from occurring.

I'll suspend this for now, barring more information.