[Home]

Summary:ASTERISK-23328: Asterisk crash in ast_cdr_setapp() at cdr.c
Reporter:David Hajek (hajekd)Labels:
Date Opened:2014-02-19 06:21:23.000-0600Date Closed:2014-03-26 14:45:08
Priority:CriticalRegression?
Status:Closed/CompleteComponents:CDR/cdr_adaptive_odbc CDR/General
Versions:1.8.25.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:OS: Linux CentOS 5.5 Kernel: 2.6.18-348.16.1.el5PAE Hardware: Dell PowerEdge R210 IIAttachments:( 0) backtrace_20140304.txt
( 1) backtrace.txt
( 2) callfile.txt
( 3) extensions_cc.conf
( 4) full_20140304.log
Description:We are experiencing random asterisk crash (segfaults) on a busy server, which is doing about 50k calls a day. We see asterisk crash every other day.

This server is running a callcenter and in addition to regular in/out calls this server is processing callfiles (about 3 per second). Some AGI dialplan scripts are processed using FastAGI over tcp sockets.

We see that all crashes are in ast_cdr_setapp() - we tried both 1.8.24 and 1.8.25.

For CDR we use cdr_adaptive_odbc with MySQL backend.
Comments:By: David Hajek (hajekd) 2014-02-19 06:22:46.246-0600

Backtrace, compiled with DONT_OPTIMIZE and BETTER_BACKTRACES

By: Matt Jordan (mjordan) 2014-02-19 08:12:23.658-0600

If you still have the {{core}} file, using gdb, can you print out what chan points to in frame 1? The commands should be:

{noformat}
# frame 1
# p *chan
{noformat}

Somehow, when the channel is returning from the AGI it no longer has a valid CDR object on it. I can see that occurring for two reasons:
# The channel has already been destroyed, which would indicate a ref counting problem somewhere. If that happens to be the case, it will be important to understand the sequence of events that led up to the crash. A DEBUG log may be necessary, or at least steps that reproduce the problem.
# A masquerade occurred and the channel that replaced the channel in the AGI never had a CDR object allocated on it - or something equally strange.

Dumping what chan pointed to will help determine what occurred.

By: David Hajek (hajekd) 2014-02-19 08:22:21.033-0600

Matt, here's the output:

{noformat}
Program terminated with signal 11, Segmentation fault.
#0  0x0809f696 in ast_cdr_setapp (cdr=0xc8, app=0xc47e5e "AGI", data=0xb61339d0 "GET") at cdr.c:826
826                     if (!ast_test_flag(cdr, AST_CDR_FLAG_LOCKED)) {
(gdb) frame 1
#1  0x00c45828 in agi_handle_command (chan=0xa46aa3c8, agi=0xb61342ac, buf=0xb61339d0 "GET", dead=0) at res_agi.c:3381
3381                            ast_cdr_setapp(chan->cdr, "AGI", buf);
(gdb) p *chan
$1 = {tech = 0x7f521c0, tech_pvt = 0x9b52de8, music_state = 0x0, generatordata = 0x0, generator = 0x0, _bridge = 0x0, masq = 0x0, masqr = 0x0,
 blockproc = 0x81d61a2 "ast_waitfor_nandfds", appl = 0x9843218 "AGI", data = 0xb61345b6 "agi://10.230.2.102/cc/dialer_set_user.php", sched = 0xa41ff5d8, stream = 0x0,
 vstream = 0x0, timingfunc = 0, timingdata = 0x0, pbx = 0x0, writetrans = 0xa41d6750, readtrans = 0xa4634ba0, audiohooks = 0x0, framehooks = 0x0, cdr = 0x9809bb8, zone = 0x0,
 monitor = 0x0, __field_mgr_pool = 0xa4086a28, name = 0xa4086a66 "SIP/680-000450bb", language = 0xa4086a8c "cs", musicclass = 0x8234646 "", accountcode = 0x8234646 "",
 peeraccount = 0x8234646 "", userfield = 0x8234646 "", call_forward = 0x8234646 "", uniqueid = 0xa4086a3a "1392806826.1483358", linkedid = 0xa4086a50 "1392806808.1483074",
 parkinglot = 0xa4086a9c "default", hangupsource = 0x8234646 "", dialcontext = 0x8234646 "", __field_mgr = {last_alloc = 0xa4086a9c "default", embedded_pool = 0x0},
 whentohangup = {tv_sec = 0, tv_usec = 0}, blocker = 3054762896, dialed = {number = {str = 0x0, plan = 0}, subaddress = {str = 0x0, type = 0, odd_even_indicator = 0 '\000',
     valid = 0 '\000'}, transit_network_select = 0}, caller = {id = {name = {str = 0x9822c90 "HodoninCC 680", char_set = 1, presentation = 0, valid = 1 '\001'}, number = {
       str = 0x9b39470 "680", plan = 0, presentation = 0, valid = 1 '\001'}, subaddress = {str = 0x0, type = 0, odd_even_indicator = 0 '\000', valid = 0 '\000'},
     tag = 0x98c3bd0 ""}, ani = {name = {str = 0x0, char_set = 1, presentation = 0, valid = 0 '\000'}, number = {str = 0x0, plan = 0, presentation = 0, valid = 0 '\000'},
     subaddress = {str = 0x0, type = 0, odd_even_indicator = 0 '\000', valid = 0 '\000'}, tag = 0x0}, ani2 = 0}, connected = {id = {name = {
       str = 0x9accfa0 "Auto dial - 606264598", char_set = 1, presentation = 0, valid = 1 '\001'}, number = {str = 0x99dd2d8 "606264598", plan = 0, presentation = 0,
       valid = 1 '\001'}, subaddress = {str = 0x0, type = 0, odd_even_indicator = 0 '\000', valid = 0 '\000'}, tag = 0x0}, ani = {name = {str = 0x0, char_set = 1,
       presentation = 0, valid = 0 '\000'}, number = {str = 0x9a148c0 "606264598", plan = 0, presentation = 0, valid = 1 '\001'}, subaddress = {str = 0x0, type = 0,
       odd_even_indicator = 0 '\000', valid = 0 '\000'}, tag = 0x0}, ani2 = 0, source = 0}, redirecting = {from = {name = {str = 0x0, char_set = 1, presentation = 0,
       valid = 0 '\000'}, number = {str = 0x0, plan = 0, presentation = 0, valid = 0 '\000'}, subaddress = {str = 0x0, type = 0, odd_even_indicator = 0 '\000',
       valid = 0 '\000'}, tag = 0x0}, to = {name = {str = 0x0, char_set = 1, presentation = 0, valid = 0 '\000'}, number = {str = 0x0, plan = 0, presentation = 0,
       valid = 0 '\000'}, subaddress = {str = 0x0, type = 0, odd_even_indicator = 0 '\000', valid = 0 '\000'}, tag = 0x0}, count = 0, reason = 0}, dtmff = {frametype = 0,
   subclass = {integer = 0, codec = 0}, datalen = 0, samples = 0, mallocd = 0, mallocd_hdr_len = 0, offset = 0, src = 0x0, data = {ptr = 0x0, uint32 = 0,
     pad = "\000\000\000\000\000\000\000"}, delivery = {tv_sec = 0, tv_usec = 0}, frame_list = {next = 0x0}, flags = 0, ts = 0, len = 0, seqno = 0}, varshead = {
   first = 0xa40ab030, last = 0x98d3fd8}, callgroup = 0, pickupgroup = 0, readq = {first = 0xa41bb908, last = 0xa68029f8}, jb = {conf = {flags = 0, max_size = -1,
     resync_threshold = -1, impl = '\000' <repeats 11 times>, target_extra = -1}, impl = 0x0, jbobj = 0x0, timebase = {tv_sec = 0, tv_usec = 0}, next = 0, last_format = 0,
   logfile = 0x0, flags = 0}, dtmf_tv = {tv_sec = 0, tv_usec = 0}, datastores = {first = 0xa41ce0a8, last = 0x98bc888}, autochans = {first = 0x0, last = 0x0}, insmpl = 0,
 outsmpl = 0, fds = {920, 927, -1, -1, -1, -1, -1, -1, 742, -1}, _softhangup = 0, fdno = -1, streamid = -1, vstreamid = 0, oldwriteformat = 0, timingfd = 742,
 _state = AST_STATE_UP, rings = 0, priority = 2, macropriority = 1, amaflags = 3, adsicpe = AST_ADSI_UNKNOWN, fin = 6, fout = 0, hangupcause = 16, flags = 1536, alertpipe = {-1,
   -1}, nativeformats = 4, readformat = 64, writeformat = 64, rawreadformat = 4, rawwriteformat = 4, emulate_dtmf_duration = 0, visible_indication = 0, transfercapability = 0,
 bridge = 0x0, timer = 0xa40983a0, context = "macro-qmagent-answer-inbound", '\000' <repeats 51 times>, exten = "s\000\060", '\000' <repeats 76 times>,
 macrocontext = "qmqueue", '\000' <repeats 72 times>, macroexten = "680", '\000' <repeats 76 times>, emulate_dtmf_digit = 0 '\000', sending_dtmf_digit = 0 '\000',
 sending_dtmf_tv = {tv_sec = 0, tv_usec = 0}}
(gdb)
{noformat}

By: David Hajek (hajekd) 2014-02-19 08:27:11.240-0600

Here's *chan->cdr from frame 1

{noformat}
(gdb) p *chan->cdr
$1 = {clid = "\"HodoninCC 680\" <680>", '\000' <repeats 58 times>, src = "680", '\000' <repeats 76 times>, dst = "s", '\000' <repeats 78 times>,
 dcontext = "from-internal", '\000' <repeats 66 times>, channel = "SIP/680-000450bb", '\000' <repeats 63 times>, dstchannel = '\000' <repeats 79 times>,
 lastapp = '\000' <repeats 79 times>, lastdata = '\000' <repeats 79 times>, start = {tv_sec = 1392806826, tv_usec = 732459}, answer = {tv_sec = 0, tv_usec = 0}, end = {
   tv_sec = 0, tv_usec = 0}, duration = 0, billsec = 0, disposition = 1, amaflags = 3, accountcode = '\000' <repeats 19 times>, peeraccount = '\000' <repeats 19 times>,
 flags = 0, uniqueid = "1392806826.1483360", '\000' <repeats 131 times>, linkedid = "1392806808.1483074", '\000' <repeats 13 times>, userfield = '\000' <repeats 255 times>,
 sequence = 1942255, varshead = {first = 0x9e56208, last = 0x9e56208}, next = 0x0}
{noformat}

By: David Hajek (hajekd) 2014-02-19 08:36:40.474-0600

This is example of callfile, we're using.

By: David Hajek (hajekd) 2014-02-19 08:44:34.402-0600

My feedback in comments. Next time will sent here. :)

By: Matt Jordan (mjordan) 2014-02-24 09:46:51.451-0600

Hm. This doesn't make a lot of sense.

In your crash, the {{cdr}} parameter in {{ast_cdr_setapp}} is clearly a bad pointer.

{noformat}
#0  0x0809f696 in ast_cdr_setapp (cdr=0xc8, app=0xc47e5e "AGI", data=0xb61339d0 "GET") at cdr.c:826
826                     if (!ast_test_flag(cdr, AST_CDR_FLAG_LOCKED)) {
{noformat}

{{0xc8}} is clearly not going to be valid memory. I would expect, then, that someone had a junk CDR sitting on their structure when they passed it off to {{ast_cdr_setapp}}. However, your dumping of {{*chan->cdr}} in frame 1 shows that isn't the case. The CDR in frame 1 - which is the thing passed to {{ast_cdr_setapp}} - is perfectly fine and valid.

That does appear to be the pointer value that will be passed to {{ast_cdr_setapp}} as well - from {{res_agi}}:

{noformat}
if (chan->cdr && !ast_check_hangup(chan) && strcasecmp(argv[0], "EXEC"))
ast_cdr_setapp(chan->cdr, "AGI", buf);
{noformat}

The only other thing that it could be at that point is some CDR on the chain of CDRs has gone bad. However, that's not going to be the case here either - if we look at the next pointer on {{chan->cdr}}, we can see that it points to NULL, and not {{0xc8}}.

I'm not sure how your system is getting into this state, but I can't see where the bug may be here. I'm not sure how the cdr pointer decided to jump off to a random location in memory.

As a random thing to look at, are you sure that all of the modules that are loaded on your system were compiled against the same version of Asterisk?


By: Matt Jordan (mjordan) 2014-02-24 09:48:55.596-0600

I threw this back into Triage because quite frankly, I have no idea how this got into this state - and I have no idea how we would go about fixing something that doesn't seem to mesh with what the backtrace is showing us.

By: David Hajek (hajekd) 2014-02-24 10:33:04.220-0600

We do some investigation as well. Our dialplan was this:

;; This context is called from Queue()
;; Our member interface is similar to this: Local/10001@dakqueue/n
[dakqueue]
exten => _X.,1,DoSomething()
exten => _X.,n,Dial(...,M(answer))
exten => h,1,DoCleanup()

[macro-answer]
exten => s,1,DoSometing()
exten => s,2,Macro(somethingelse)
exten => s,3,Gosub(cleanup)

We rewrote macro-answer to remove Macro(somethingelse) and Gosub(cleanup). We think that calling Gosub() from Macro() can be a bad idea. Can you confirm? Can you imagine this can be the issue?

We will keep running this configuration for couple of days and I will get back with results. Next step can be to completly get rid of Macro() and use Gosub() only.





By: Matt Jordan (mjordan) 2014-02-27 09:07:01.133-0600

Hm. Your dialplan doesn't seem to mesh with the backtrace either, as the crash occurred while the channel was in an AGI.

I'm not aware of any problems with calling a GoSub from a Macro; however, there is a reason why Macros have been deprecated completely in Asterisk 11+. They're dangerous. They cause stack smashing issues - and while your backtrace doesn't appear to show any evidence of the usual stack smashing that occurs when Macros do this, remove Macros is always a good idea.

Was the AGI called from the macro?


By: David Hajek (hajekd) 2014-02-27 09:28:00.987-0600

Yes, AGI (where it crashed) was called from [macro-answer] context. Actually there are two AGI scripts called.

;; This is how it looks when it crashes
[macro-answer]
exten => s,1,AGI(script1)
exten => s,n,AGI(cc/dialer_set_user.php) ;; It always crashed here
exten => s,n,Macro(somework)
exten => s,n,Gosub(somework)

;; This is after rewriten
[macro-answer]
exten => s,1,AGI(script1)
exten => s,n,AGI(cc/dialer_set_user.php)
exten => s,n,SomeDialPlanWorkHere   ; No Macro
exten => s,n,SomeOtherDialPlanWorkHere ; No Gosub

Its a bit early to do some final resolutions, but its running for 6 days without crash.

Regarding calling Gosub from Macro I'm refering to http://www.voip-info.org/wiki/view/Asterisk+cmd+Macro

"You really shouldn't be calling a Gosub routine from Macro. We've already had to deal with some really odd interactions between the two. If you're going to make the jump to Gosub, go completely over and only use Gosub. Don't use Macro in conjunction with Gosub (Tilghman, May 2010)."







By: David Hajek (hajekd) 2014-03-04 15:26:36.441-0600

Unfortunatelly we got crash today, after 10 days. It crashed again in ast_cdr_setapp().

By: Rusty Newton (rnewton) 2014-03-04 19:25:05.384-0600

Can you show a backtrace, logs and dialplan from the latest crash?

You might also try to get "agi set debug on" output for the next crash. Make sure that is captured along with a full log file where DEBUG and VERBOSE levels are turned up to 5 or higher. [See the wiki instructions|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information]

By: David Hajek (hajekd) 2014-03-09 14:55:48.053-0500

Backtrace for crash from 04/03/2014

By: David Hajek (hajekd) 2014-03-09 15:10:19.258-0500

Asterisk full log from the crash 2014/03/04. Crash occured at Mar  4 14:26:43, and after this crash astdb was corrupted. Safe asterisk was not able to run asterisk again. We had to manually repair the astdb from the backup.

By: David Hajek (hajekd) 2014-03-09 15:21:44.586-0500

Attached is the relevant dialplan. Context [qmqueue] is called from Queue() when call is directed to available Queue member, which is logged with this interface to the Queue():

Example: Local/10001@qmqueue/n

In the [qmqueue] context is called SIP interface of the logged member and Macro(qmagent-answer-inbound) is called and crash occures. Every time on the same place. Dialplan code for this macro is also in uploaded file.

By: Rusty Newton (rnewton) 2014-03-10 16:50:01.439-0500

David, thanks for the additional information. However, I don't see DEBUG message types or AGI debug in your full log. Is there a reason you were unable to get those?

{quote}
You might also try to get "agi set debug on" output for the next crash. Make sure that is captured along with a full log file where DEBUG and VERBOSE levels are turned up to 5 or higher. See the wiki instructions
{quote}

By: David Hajek (hajekd) 2014-03-10 17:04:12.546-0500

Rusty, this is server is quite busy and generates easily over 2-3 GB asterisk full logs daily (with the current verbosity, no debugs etc). We didn't have agi debug and debug verbosity set to on before the last crash. I can turn it on now while we're waiting for the next crash, but I have to watch for performance issues with additional logging. ;(

By: Rusty Newton (rnewton) 2014-03-11 17:14:22.164-0500

I understand the difficulty. I'm not sure we'll be able to track it down without being able to understand how to reliably reproduce the issue, or else with that additional information.

By: Rusty Newton (rnewton) 2014-03-26 14:45:08.950-0500

Closing this for now as cannot reproduce. As we can't reproduce or identify the issue yet with the current debug and backtraces. If further information can be provided or a method for reproduction found, please contact a bug marshal in #asterisk-bugs on irc.freenode.net to re-open the issue.