[Home]

Summary:ASTERISK-26421: Segmentation Fault with ARI originate into mixing bridge with 43 clients
Reporter:Andrew Nagy (tm1000)Labels:
Date Opened:2016-09-28 19:53:48Date Closed:2016-10-19 08:33:41
Priority:MinorRegression?Yes
Status:Closed/CompleteComponents:
Versions:13.11.2 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-26219 segfault in ast_manager_build_channel_state_string_prefix
is related toASTERISK-26550 segfault error 6 in asterisk[400000+2b8000]
Environment:Attachments:( 0) backtrace.txt
Description:Asterisk crashed in the middle of our dangerous demo. We setup a system that would get an SMS then make an outbound call and bridge that call in. At around 43 calls Asterisk was at about 177% cpu. Since we lost the Dangerous Demo because of an Asterisk crash we'd like this moved to our L4 support for immediate fixing. Then please bring back everyone for Astricon's Dangerous Demos so that we can try this again and win, of course.
.
.
.
.
.
.
.
.
.
.
.
.






Just kidding. This is not a priority I am just attaching the backtrace incase it's relevant for anyone.

Thanks for all you do and thank for Astricon and Asterisk!
Comments:By: Asterisk Team (asteriskteam) 2016-09-28 19:53:48.777-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: xrobau (xrobau) 2016-09-28 19:59:00.526-0500

Yo dawg, I heard you like internal_ao2_refs, so I put cdr_object_dtor in your internal_ao2_refs so you could internal_ao2_refs while you cdr_object_dtor

By: George Joseph (gjoseph) 2016-10-03 08:13:02.557-0500

I think this backtrace holds the record for the number of frames in a single stack: 6567 :)


By: Mark Michelson (mmichelson) 2016-10-12 16:05:17.265-0500

This issue has made it up the queue and is now assigned to me. When I saw how many frames were in the stack trace, my initial thought was that it had to be a loop in the CDR list. *It's not*. There actually is a chain of distinct CDRs that is that long somehow. The unreferencing pattern for a CDR is to call the destructor for the CDR, which unreferences the next CDR in the chain, which calls the destructor for that CDR, and repeat down the line. Usually, this chain is not very long so this pattern does not overflow the stack. In this particular case, though, there somehow got to be 1639 CDRs in the chain. That resulted in the 6500+ frame stack trace you saw and the eventual stack overflow.

So, there are two avenues to go down when solving this:
# Change the nature of CDR destruction. Rather than growing the stack as CDRs are destroyed, use an iterative approach, performing all destruction within the same stack frame. This probably is not incredibly difficult, and it will solve the case of the crash happening.
# Figure out how the chain got to be that long, and determine if that is a valid case. For this, I think I'll need to have some smaller test case to be able to reproduce the issue. I understand the basic premise that Asterisk was used to place calls into a mixing bridge. And I'm pretty sure based on the stack trace that we are talking about an ARI mixing bridge here and not ConfBridge. I'm curious if you can describe (just with words, no code necessary -- yet) what your dangerous demo was doing under the hood. From the stack trace, I see some channels running the ARI "smsconference" application, but I can't tell what types of channels these are (Local?, SIP?). I also see some SIP channels dialing out to PSTN numbers. I'm not sure where these SIP channels are connected to, though, on the other side (presumably they're in the ARI mixing bridge?).

By: Mark Michelson (mmichelson) 2016-10-12 16:10:20.803-0500

I have a follow-up question. I noticed that in the backtrace, the channel unique ID that we used to look up the root CDR in the chain was "74a06be5-f378-4f5e-a9c8-c688e6863589". That isn't the pattern that Asterisk uses for setting channel unique IDs, so I'm guessing that the unique ID was set by the ARI application instead. Is it possible, somehow, that you used the same unique ID for every channel that you originated? If so, I could see how we would end up with a single long chain of CDRs rather than lots of small chains of CDRs.

By: Mark Michelson (mmichelson) 2016-10-17 09:23:35.080-0500

While I don't have confirmation that using the same unique ID is what tanked your case, I have confirmed that when I run a test where I use the same uniqueID for every channel, I get a loooong chain of CDR destruction as soon as one hangs up.

By: Andrew Nagy (tm1000) 2016-10-17 10:50:24.441-0500

I don't see us generating that uuid anywhere.

However in the official node library for ari (which we use). It does use node-uuid which generates a uuid:

https://github.com/asterisk/node-ari-client/blob/0d31568c3b6ed895f988b954716c5d1233f8b1a4/package.json#L30
https://github.com/asterisk/node-ari-client/blob/99146554786d8e7d8a5e868c0640d93e22952c3a/lib/resources.js#L84


Here is the source of our application:
{code}
var ampconf = {'FPBX_ARI_USER':'freepbxuser', 'FPBX_ARI_PASSWORD': '074cdbe438331c9622910f803249ac99'};
var express = require('express');
var Q = require('q');
var app = express();
var ari = null;
var client = require('ari-client');
var bridgelist = [];
var channellist = [];
//Express Server
app.get('/', function(req, res) {
   var bridge = req.query['bridge'];
   var number = req.query['number'];
 createBridge({'bridgeName':bridge,'number':number})
 .then(originateCall)
 .catch(function(err){console.log(err)});
   res.send('frobulated');
});
app.listen(3001);
console.log('Listening on port 3001...');
//Ari Connection
//
client.connect('http://localhost:8088', ampconf["FPBX_ARI_USER"], ampconf["FPBX_ARI_PASSWORD"])
 .then(function (ariconn) {
   ari = ariconn
     ari.start('smsconference');
 })
 .catch(function (err) {
   throw err;
 });
function createBridge(options){
 var bridgeName = options['bridgeName'];
 var number = options['number'];
 var deferred = Q.defer();
 console.log(bridgeExist(bridgeName));
 if(bridgeExist(bridgeName)){
   deferred.resolve({'bridgeName':bridgeName, 'number':number});
   return deferred.promise;
 }
 bridgelist[bridgeName] = bridgeName;
 ari.bridges.createWithId({
   bridgeId: bridgeName,
   type: 'mixing, dtmf_events, proxy_media'
 })
 .then(function (bridge) {
   console.log('bridge created ' + bridge.id);
       /*
       bridge.on('ChannelLeftBridge', function(event, instances) {
           if(bridge.channels.length === 0){
               bridge.destroy(function (err) {
                   delete bridgelist[bridgeName];
               });
           }
       });
       */
   deferred.resolve({'bridgeName':bridgeName, 'number':number});
 })
 .catch(function (err) {
   deferred.reject(new Error(err));
 });
 return deferred.promise;
}
function bridgeExist(bridgeName){
 if(typeof bridgelist[bridgeName] == "undefined"){
   return false;
 } else{
   return true;
 }
}
function originateCall(options){
 var bridgeName = options['bridgeName'];
 var number = options['number'];
 var deferred = Q.defer();
 var channel = ari.Channel();
 channel.originate({endpoint: 'local/'+number+'@from-internal', app: 'smsconference', appArgs: 'dialed', formats:'ulaw'})
 .then(function (channel) {
   channel.answer(function(err){
     channel.on('StasisStart', function (event, channel) {
       if(typeof channellist[bridgeName] == "undefined"){
         channellist[bridgeName] = []
       }
       channellist[bridgeName].push(channel.id);
       addToBridge({'bridgeName':bridgeName, 'number':channel});
     });
     deferred.resolve({'bridgeName':bridgeName, 'number':channel});
   });
   console.log("Answered "+ number);
 })
 .catch(function (err) {
   deferred.reject(new Error("Couldn't call "+ number));
 });
 return deferred.promise;
}
function addToBridge(options){
 var bridgeName = options['bridgeName'];
 var channelID = options['number'];
 var deferred = Q.defer();
 ari.bridges.addChannel({
   bridgeId: bridgeName,
   channel: channellist[bridgeName]
 })
 .then(function () {
   console.log("added "+channelID.id+" to "+bridgeName);
   deferred.resolve({'bridgeName':bridgeName, 'number':number});
 })
 .catch(function (err) {
   deferred.reject(new Error("Couldn't add "+channelID.id+" to "+bridgeName));
 });
 return deferred.promise;
}
{code}

By: James Finstrom (TheJames) 2016-10-17 13:13:15.767-0500

The basic flow.

User sends SMS with the word "Danger" we call back the number that sent the SMS. This went through the normal freepbx dialplan putting them into an ARI mixing bridge.

Example:

480-555-1212 text "Danger"

A Hook calls express passing it "Danger" and "4805551212"

The script checks for a bridge with the id "Danger" and if it doesn't exist it makes a bridge called "danger"

Then it originates a call to local/4805551212@from-internal which goes through the freepbx dialplan landing ultimately at a chan_sip trunk.  The channel is dropped in to the brudge with all other users who sent the "danger" text.

Relevant code:

origination
{code}
 channel.originate({endpoint: 'local/'+number+'@from-internal', app: 'smsconference', appArgs: 'dialed', formats:'ulaw'})
{code}
Bridge Creation
{code}
//bridgename "danger"
ari.bridges.createWithId({
   bridgeId: bridgeName,
   type: 'mixing, dtmf_events, proxy_media'
 })
{code}


By: Mark Michelson (mmichelson) 2016-10-18 17:20:48.375-0500

I have two sets of reviews up relating to this issue.

https://gerrit.asterisk.org/#/c/4150/ is a bigger patch that fixes the channel unique ID duplication issue. It's tangentially related to this issue since I found that re-using the same unique ID would result in big CDR chains. I don't necessarily think that's what caused your issue, but it's still a problem.

https://gerrit.asterisk.org/#/c/4151/ more directly addresses this issue with a simple change to how CDRs are destroyed. Rather than following a recursive pattern, we now use an iterative pattern so the stack will not grow out of control no matter the length of the CDR chain.