[Home]

Summary:ASTERISK-17969: Asterisk 1.8.2.3 crashes when dialling from IAX2 to IAX2
Reporter:ppower (ppower)Labels:
Date Opened:2011-06-06 14:20:22Date Closed:2014-02-13 22:26:59.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:1.8.2 Frequency of
Occurrence
Constant
Related
Issues:
is duplicated byASTERISK-19597 Failure to pass NULL data pointer with AST_CONTROL_HOLD frame causes crash when MOH is started
Environment:OS: gentoo 2.6.36; 8GB RAM; intel 2.6GHz dual quad core hyper threadedAttachments:( 0) backtrace.txt
( 1) valgrind.txt
Description:When I do something like this:
IAX -> dial(some_number)

some_number,1,Dial(SIP/XXX,10) ; do not answer
some_number,n,Answer()
some_number,n,Wait(1)
some_number,n,Dial(IAX/some_number_on_the_same_server,60)

asterisk crashes.  I tried removing the Answer and Wait, but no crash.

A long story brought me here. I am trying to upgrade our systems from 1.2.31 to 1.8.2.3.
running 30 to 40 phones on a server worked and continues to work just fine.
Triple that number or so and periodic problems started; SIP registrations failed , IAX calls failed, unresponsive servers at the CLI, Max retries exceeded to host XXX on IAX2/XXX messages showed up and killing asterisk to restart is was the only way to get back control.

Since this problem appears to be IAX related, a little IAX torture test was created.
When IAX calls are directed back to the server the number of active IAX channels goes up.
Eventually the MAX retries thing starts happening, CPU gets very busy, VMStats reports very large number of context switches.  If asterisk retains some control, the number of IAX channels sometimes goes down.

After looking at other posted issues (ASTERISK-16711,ASTERISK-16258,ASTERISK-13156 and possibly others). I have the DAHDI timing module loaded only and the number of IAX threads set to 1. Now i am left with the short dial plan shown above and a regularly crashing server. When IAX debugging is set on the asterisk does not crash. When asterisk retains control i notice the CPU and context switching significantly increase when one dial ends and another begins.  the number of IAX channels required to do this less than 40.

DONT_OPTIMIZE and DEBUG_THREADS are set ON

I have a core dump and will provide a back trace when i figure out how to use this jira thing :)
here is a snippet:
Program terminated with signal 11, Segmentation fault.
#0  0x00007f7eb34518d9 in free () from /lib/libc.so.6
#0  0x00007f7eb34518d9 in free () from /lib/libc.so.6
No symbol table info available.
#1  0x00007f7e97ac8fc6 in free_signaling_queue_entry (s=0x7f7ea4e0c650) at chan_iax2.c:1823
No locals.
#2  0x00007f7e97ac9027 in send_signaling (pvt=0x7f7ea5826a68) at chan_iax2.c:1835
       s = 0x7f7ea4e0c650
#3  0x00007f7e97af2ed5 in socket_process (thread=0x7f7e867923f0) at chan_iax2.c:10252

This is a very reproducible problem.
Let me know what i can to help out with this. Please let me know what i may need to do differently (fist time posting an issue).
Comments:By: ppower (ppower) 2011-06-09 13:56:43.149-0500

I have found a case where signaling frames get queued up in an IAX channel to be processed later and when they are processed, asterisk crashes.
The crash happens in free_signaling_queue_frame, which says:
   ast_free(s->f.data.ptr);
   ast_free(s);
This code assumes that s->f.data.ptr is valid. In my case s->f.data.ptr was NULL.
I have changed the code on my box to deal with this. Perhaps the main repository could be remedied as well..




By: ppower (ppower) 2011-06-09 14:08:41.822-0500

My apologies, s->f.data.ptr was not NULL.  Crashing continues.

By: Richard Mudgett (rmudgett) 2011-06-09 14:14:32.721-0500

Freeing a NULL pointer is safe so you do not need to check if the pointer is NULL before calling ast_free()/free().

By: ppower (ppower) 2011-06-09 16:57:31.856-0500

I am about to be gone from work for a week.
This is what the day has wrought:
When crashes occur queue_signaling_frame does get called, and when the subsequent send_signaling gets called, the crash happens.
Sometimes the crash does not occur (after a new compile with debugging statements put in). In this case queue_signaling_frame does NOT get called.

One thing i noticed that does not seem right in queue_signaling_frame:
memcpy(&new->f, f, sizeof(new->f)); /* copy ast_frame into our queue entry */

if (new->f.datalen) { /* if there is data in this frame copy it over as well */
   if (!(new->f.data.ptr = ast_calloc(1, new->f.datalen))) {
       free_signaling_queue_entry(new);
       return -1;
   }
    memcpy(new->f.data.ptr, f->data.ptr, sizeof(*new->f.data.ptr));
}

if the ast_calloc for new->f.data.ptr fails then free_signaling_queue_entry is called with will try to free data.ptr, something that was not properly allocated to begin with.

and since i am editing this after seeing the previous post the return may be NULL and therefore safe.

see you in a week.


By: Russell Bryant (russell) 2011-06-14 19:40:10.992-0500

Running Asterisk under valgrind and trying to reproduce this would be helpful.

By: ppower (ppower) 2011-06-17 12:11:06.971-0500

I followed some instructions i found on wiki.asterisk.org to get this.  I hope it is helpful.  

By: ppower (ppower) 2011-06-20 15:21:01.390-0500

I think i have finally tracked this down.
The crash happens in free_signaling_queue_frame, which says:
ast_free(s->f.data.ptr);
ast_free(s);

sometimes s->f.data.ptr is not NULL and s->f.datalen is zero.
so, when i check to see if the datalen > 0 before attempting ast_free, crashing ceases.
i have not tried to track down why f.data.ptr would not be NULL with a datalen of zero.



By: Matt Jordan (mjordan) 2014-02-13 22:26:48.892-0600

I have a feeling that this was actually a duplicate of ASTERISK-19597.

In ASTERISK-19597, it was found that an AST_CONTROL_HOLD frame could be passed without a suggested MoH class. This would leave junk in the data pointer, with a datalen of 0. The relevant portion from the patch on that issue shows where this was fixed:

{noformat}
+ /*
+ * Clear fr->af.data if there is no data in the buffer.  Things
+ * like AST_CONTROL_HOLD without a suggested music class must
+ * have a NULL pointer.
+ */
+ if (!fr->af.datalen) {
+ memset(&fr->af.data, 0, sizeof(fr->af.data));
+ }
{noformat}

I'm going to go ahead and close this out as a duplicate of ASTERISK-19597. If you or someone else is still running into this problem with IAX2 to IAX2 calls, please let a bug marshal know in #asterisk-bugs and we'll reopen the issue.