[Home]

Summary:ASTERISK-23319: Segmentation fault in queue_exec at app_queue.c
Reporter:Vadim (cron333)Labels:
Date Opened:2014-02-17 09:52:49.000-0600Date Closed:2015-04-08 21:05:52
Priority:MajorRegression?
Status:Closed/CompleteComponents:Applications/app_queue
Versions:1.8.25.0 Frequency of
Occurrence
Related
Issues:
Environment:CentoOS 6.2Attachments:( 0) coredumpcrashqueuetimeronpatchedasterisk13.txt
( 1) irrelevantqueuepatch1.txt
( 2) queuepatch2.txt
( 3) trace.txt
Description:gdb /usr/sbin/asterisk /tmp/coreXX.XXX
....
[Edit by Rusty Newton - removed inline debug as per the guidelines...]

Comments:By: Vadim (cron333) 2014-02-17 09:59:57.414-0600

queues.conf
--------------

[callcenter-msk]
...
defaultrules=msk
...


queuerules.conf
---------------

[msk]
penaltychange => 30,8
penaltychange => 60,10



P.S. the same in version 1.8.18.0

By: Rusty Newton (rnewton) 2014-02-19 16:30:01.815-0600

Attaching reporters trace.

By: Matt Jordan (mjordan) 2014-02-19 17:00:50.532-0600

Can you go ahead and attach your full {{queues.conf}} and {{queuerules.conf}}? There may be some combination of settings that is leading up to this, and I'd like to make sure that we can reproduce it using the correct configuration.

It would also be good to have a DEBUG log leading up to the crash, as somehow the queue penalty rules got blown out. Did a reload operation occur?

By: Rusty Newton (rnewton) 2014-02-19 20:11:54.701-0600

I attempted to reproduce earlier and could not get it to crash using what you have supplied here, so we will definitely need the items Matt requested.

In addition, the option "defaultrules=msk" that you mention has a typo that would prevent it from taking effect. The correct option is "defaultrule" with no "s". I'm not sure whether you had a typo when typing that in the comment, or if you are using that in your configuration.

{noformat}
; If you wish to implement a rule defined in queuerules.conf (see
; configs/queuerules.conf.sample from the asterisk source directory for
; more information about penalty rules) by default, you may specify this
; by setting defaultrule to the rule's name
;
; defaultrule = myrule
{noformat}

You might as well include the extensions.conf section that is executed up to the crash as well. Thanks!

By: Vadim (cron333) 2014-02-20 02:45:49.695-0600

defaultrule =  ! (no 's')

queues.conf
-------------
{noformat}
[general]
persistentmembers = yes
monitor-type = MixMonitor
shared_lastcall=yes
wrapuptime=10
setqueuevar=yes
eventwhencalled = yes
ringinuse = no


[callcenter-msk]
monitor-format = wav
monitor-type = MixMonitor
musicclass = default
context = callback-queue-quest
timeout = 15
retry = 3
ringinuse = no
autofill = yes
defaultrule = msk
joinempty = paused,invalid,unavailable,unknown
leavewhenempty = paused,invalid,unavailable,unknown
timeoutrestart = no
setinterfacevar = yes
setqueuevar = yes
eventwhencalled = yes
strategy = rrmemory
periodic-announce = 999/SERV_dolgo
periodic-announce-frequency = 90
servicelevel = 60
wrapuptime = 7
weight=2
memberdelay=3
{noformat}

******************


queuerules.conf
----------------
{noformat}
[msk]
penaltychange => 30,8
penaltychange => 60,10


[kred]
penaltychange => 60,10
{noformat}

**********************

By: Vadim (cron333) 2014-02-20 02:58:27.556-0600

DEBUG_LOG(DEBUG log leading up to the crash):
-----------------

{noformat}
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:1] NoOp("Local/1100@callback-00014881;2", "CALL_ID: 17587601") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:2] GotoIf("Local/1100@callback-00014881;2", "0?500") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:3] Set("Local/1100@callback-00014881;2", "__AUTO_SYS=YES") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:4] MYSQL("Local/1100@callback-00014881;2", "Set timeout 3") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:5] MYSQL("Local/1100@callback-00014881;2", "Connect connid db-master.local asterisk shuxieru default_sm") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:6] GotoIf("Local/1100@callback-00014881;2", "1?9") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:9] MYSQL("Local/1100@callback-00014881;2", "Query resultid 15 UPDATE orders_status_otzvon SET `asterix_count`=asterix_count+1, `asterix_status`=asterix_status+1 WHERE `id_zv`='17587601'") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:10] MYSQL("Local/1100@callback-00014881;2", "Disconnect 15") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:11] Set("Local/1100@callback-00014881;2", "__TRANSFER_CONTEXT=transfer") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:12] Set("Local/1100@callback-00014881;2", "__CON_TEXT=callback") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:13] Set("Local/1100@callback-00014881;2", "CALLERID(name)=bc_backcall_tov-578009-522107-923281") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:14] GotoIf("Local/1100@callback-00014881;2", "0?:50") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:50] NoOp("Local/1100@callback-00014881;2", "") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:51] Set("Local/1100@callback-00014881;2", "MONITOR_FILENAME=79224541482_1100_16-02-2014_1392578446.297449") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:52] Set("Local/1100@callback-00014881;2", "MONITOR_EXEC=/bin/rm /var/spool/asterisk/monitor/^{MONITOR_FILENAME}.wav") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:53] Goto("Local/1100@callback-00014881;2", "60") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [1100@callback:60] Macro("Local/1100@callback-00014881;2", "auto-system,1100") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-auto-system:1] NoOp("Local/1100@callback-00014881;2", "") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-auto-system:2] Macro("Local/1100@callback-00014881;2", "queue-directory,1100") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:1] NoOp("Local/1100@callback-00014881;2", "1") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:2] GotoIf("Local/1100@callback-00014881;2", "1?que-1100:que-1100") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:101] Set("Local/1100@callback-00014881;2", "ARRAY(Q_NAME,Q_EXIT,Q_NUMBER_MOB)=callcenter-msk,spb,") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:102] Set("Local/1100@callback-00014881;2", "QUEUE_MAX_PENALTY=5") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:103] Goto("Local/1100@callback-00014881;2", "exit") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:329] NoOp("Local/1100@callback-00014881;2", "") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-queue-directory:330] Set("Local/1100@callback-00014881;2", "__QUEUENAME=callcenter-msk") in new stack
[Feb 16 23:20:46] VERBOSE[633] pbx.c:     -- Executing [s@macro-auto-system:3] Queue("Local/1100@callback-00014881;2", "callcenter-msk,t,,,60") in new stack
[Feb 16 23:20:46] VERBOSE[633] res_musiconhold.c:     -- Started music on hold, class 'default', on Local/1100@callback-00014881;2
[Feb 16 23:20:49] VERBOSE[633] res_musiconhold.c:     -- Stopped music on hold on Local/1100@callback-00014881;2
[Feb 16 23:20:49] VERBOSE[633] res_musiconhold.c:     -- Started music on hold, class 'default', on Local/1100@callback-00014881;2
{noformat}


By: Vadim (cron333) 2014-02-20 03:11:20.168-0600

extensions.conf
------------------
{noformat}
[callback]
;
exten => _X.,1,NoOp(CALL_ID: ${CALL_ID})
same => n,GotoIf($["${SERV_NAME}" != "${CUR_SERV}"]?500)
same => n,Set(__AUTO_SYS=YES)
same => n,MYSQL(Set timeout ${MYSQL_TIMEOUT})
same => n,MYSQL(Connect connid ${MYSQL_SERV} asterisk shuxieru default_sm)
;;
same => n,GotoIf($["${QUERY_LOG_ALL}" != "1"]?$[${PRIORITY}+3])
same => n,Set(QUERY_DB=UPDATE orders_status_otzvon SET asterix_count=asterix_count+1, asterix_status=asterix_status+1 WHERE id_zv='${CALL_ID}')
same => n,Macro(query-db-log)
;;
same => n,MYSQL(Query resultid ${connid} UPDATE orders_status_otzvon SET asterix_count=asterix_count+1, asterix_status=asterix_status+1 WHERE id_zv='${CALL_ID}')
same => n,MYSQL(Disconnect ${connid})
same => n,Set(__TRANSFER_CONTEXT=transfer)
same => n,Set(__CON_TEXT=callback)
same => n,Set(CALLERID(name)=bc_${COMMENT})
same => n,GotoIF($[${LEN(${MANAGER})} > 4]?:50)
...
;
exten => _XXXX,50,NoOp()
;same => n,Set(MONITOR_FILENAME=callback-${UNIQUEID})
same => n,Set(MONITOR_FILENAME=${CALLERID(num)}_${EXTEN}_${STRFTIME(,,%d-%m-%G)}_${UNIQUEID})
same => n,Set(MONITOR_EXEC=/bin/rm /var/spool/asterisk/monitor/^{MONITOR_FILENAME}.wav)
same => n,GoTo(60)
;
...
;
exten => _11XX,60,Macro(auto-system,${EXTEN})
same => n,Hangup()
;
exten => _X.,n,Hangup



[macro-auto-system]
exten => s,1,NoOp()
same => n,Macro(queue-directory,${ARG1})
same => n,Queue(${Q_NAME},t,,,60)


[macro-queue-directory]
exten => s,1,NoOp(${DIALPLAN_EXISTS(macro-queue-directory,s,que-${ARG1})})
same => n,GotoIf($["${DIALPLAN_EXISTS(macro-queue-directory,s,que-${ARG1})}" = "1"]?que-${ARG1}:que-1100)
...
same => n(que-1100),Set(ARRAY(Q_NAME,Q_EXIT,Q_NUMBER_MOB)=callcenter-msk,spb,${DB(DOP/${EVAL(${ARG1})})})
same => n,Set(QUEUE_MAX_PENALTY=5)
same => n,GoTo(exit)
...
same => n(exit),NoOp()
same => n,Set(__QUEUENAME=${Q_NAME})
{noformat}


By: Vadim (cron333) 2014-02-20 03:13:12.716-0600

!, the problem is not constant ...

P.S.

>>Did a reload operation occur?

Yes.

By: Vadim (cron333) 2014-02-26 01:30:53.468-0600

Waiting for Feedbatsk status? Something else you need from me?

By: David Brillert (aragon) 2014-02-26 08:00:24.106-0600

In the ticket actions section you will see a button to enter feedback.  Click on it and 'enter feedback'

By: Vadim (cron333) 2014-02-27 06:18:00.365-0600

>>Matt Jordan added a comment - 20/Feb/14 3:00 AM
Can you go ahead and attach your full queues.conf and queuerules.conf? There may be some combination of settings that is leading up to this, and I'd like to make sure that we can reproduce it using the correct configuration. It would also be good to have a DEBUG log leading up to the crash, as somehow the queue penalty rules got blown out. Did a reload operation occur?



I put it all ...

By: Rusty Newton (rnewton) 2014-03-04 09:56:54.196-0600

bq. I put it all ...

What you posted in the comments is the entire contents of those two files?

By: Vadim (cron333) 2014-03-05 13:44:38.408-0600

The number of queues option defaultrule - 2.
And so, the amount of queues, much more.

By: Rusty Newton (rnewton) 2014-03-24 16:05:43.205-0500

Vadim thanks for posting the additional information. However when we say attach to the issue, we do mean attach as files to the issue and not post logs into the comment fields. See "More" "Attach Files".

Anyway, you said:

{quote}
!, the problem is not constant ...
{quote}

Can you give us an idea of what "not constant" is?  If I make 100 calls to simulate the call flow, could I expect 1 of them to cause a crash, or 90 of them? How many calls do you run through it to reproduce the issue?

I am unable to reproduce the issue so far, but I'd like to be able to reproduce it so the developers can have an easier time investigating the crash.

By: Rusty Newton (rnewton) 2014-04-08 09:44:10.302-0500

Closing, as we don't have a response on the previous question, we can't reproduce so far and don't have an accurate guide on how to reproduce this crash in order to investigate.

By: Matt Jordan (mjordan) 2015-03-11 17:28:18.743-0500

Dropping this back into Triage, as Stefan26 thinks he can reproduce the issue.

Note that I took a gander through {{app_queue}} with him, and there may be some paths where {{qe.pr}} goes garbage on us.

{quote}
{noformat}
(04:55:35 PM) Stefan26: Can leave_queue(qe) and queue_exec(chan, data) run concurrently (where qe->chan=chan)? It seems leave_queue(qe) can do a ast_free on my qe.pr
(04:59:22 PM) sruffell left the room (quit: Ping timeout: 265 seconds).
(05:00:39 PM) Stefan26: that ast_free is enclosed in a block with ao2_lock(q) {...} ao2_unlock(q); what is that meant to protect it from? i wonder if i should test putting a similar ao2_lock/ao2_unlock in queue_exec
(05:00:51 PM) Stefan26: but it's such a wild goose chase if i can't reproduce it with confidence
(05:02:00 PM) Stefan26: reproduce the crash*
(05:03:53 PM) superscrat [~asanders@173-17-133-2.client.mchsi.com] entered the room.
(05:04:37 PM) sruffell [~sruffell@asterisk/the-kernel-guy/sruffell] entered the room.
(05:04:37 PM) mode (+o sruffell) by ChanServ
(05:04:49 PM) sruffell left the room (quit: Client Quit).
(05:07:12 PM) mjordan: Stefan26: so, qe.pr points to the first rule in qe.qe_rules
(05:07:30 PM) mjordan: which is a list allocated without a lock. That implies something else should provide the synchronization.
(05:08:31 PM) mjordan: leave_queue can be called in multiple locations.
(05:09:09 PM) mjordan: in wait_our_turn, queue_exec (which shouldn't be a problem), and try_calling
(05:10:17 PM) mjordan: since those in turn are called from queue_exec, my guess would be that there shouldn't need to be any synchronization.
(05:10:34 PM) mjordan: do you still have the core file from the crash?
(05:11:48 PM) mjordan: now...
(05:11:49 PM) mjordan: hm.
(05:12:01 PM) Stefan26: no but I can get it from a colleague tomorrow, only he managed to produce it. from my understanding the use case was that some kind of hang_up action was performed on the chan_sip channel which corresponded to the queue_ent qe for which the seg-fault happend
(05:12:39 PM) mjordan: One way this could go all nutso on us is if, somehow, leave_queue was called from try_calling/wait_our_turn, *AND* somehow we managed to loop back around in the queue such that the 'see if we need to move to the next penalty level' was called again
(05:13:58 PM) mjordan: Stefan26: which version of Asterisk?
(05:14:39 PM) Stefan26: Asterisk 13.1.0
(05:14:44 PM) Stefan26: but source from 13.1.0 looked similar
(05:14:45 PM) mjordan: so, it isn't impossible for try_calling to call leave_queue and return non-zero
(05:14:51 PM) Stefan26: 13.2.0*
(05:14:51 PM) mjordan: er, zero
(05:15:21 PM) mjordan: Assuming everything actually is 'answered', try_calling returns the value of ast_bridge_call_with_flags
(05:15:53 PM) mjordan: which returns 0 on success, -1 on failure
(05:16:50 PM) mjordan: that could cause a problem if it returns successfully and we don't actually leave the queue somehow.
(05:17:24 PM) mjordan: assuming the inbound channel never hung up.
(05:19:15 PM) mjordan: the same issue exists in wait_our_turn however
(05:19:36 PM) Stefan26: I can log (to file) all return values of res = try_calling(&qe, opts, opt_args, args.announceoverride, args.url, &tries, &noption, args.agi, args.macro, args.gosub, ringing) and check that log next crash, what else can I log to test your ideas?
(05:19:56 PM) mjordan: I would log out the return value of try_calling and wait_our_turn
(05:20:20 PM) Stefan26: and calls to leave_queue?
(05:20:22 PM) mjordan: yes
(05:20:39 PM) mjordan: I'm pretty sure that's the problem though.
(05:21:01 PM) mjordan: Both try_calling and wait_our_turn need to make sure that their callers treat a call to leave_queue as "get the hell out"
(05:21:17 PM) mjordan: to that point, leave_queue should also set qe->pr to NULL when done
(05:21:24 PM) mjordan: leaving it as garbage in this case is bad
(05:21:55 PM) [TK]D-Fender [~joe@64.235.216.2] entered the room.
(05:23:07 PM) Stefan26: OK, ill log those items, if it ever crashes again, ill try patching qe->pr to NULL after next crash if log suggests it would have helped
(05:23:17 PM) putnopvut left the room (quit: Quit: Leaving).
(05:24:06 PM) Stefan26: I can't make a JIRA if the use-case is unknown?
(05:24:39 PM) mjordan: well, since we had the same crash elsewhere, we can just comment on that issue
(05:24:49 PM) mjordan: and re-open it
(05:24:56 PM) mjordan: of course, in that case, it was against 1.8
(05:25:15 PM) mjordan: so it may be worth looking at 11 to see if it has the same issues. I'd suspect that it does.
(05:25:34 PM) mjordan: it would be nice to know how it occurred before we try modifying app_queue: it has a tendency to be fragile :-)
(05:25:45 PM) Stefan26: If you re-open it I can post my crashlog tomorrow if collague gives it too much
(05:26:06 PM) Stefan26: I can't comment on a closed JIRA otherwise?
(05:26:24 PM) Stefan26: worst case scenario it never happens again and ull have to close it once more :)
(05:26:38 PM) pnlarsson [~niklas@fw1.gml.g.icnet.infracom.se] entered the room.
(05:26:42 PM) Stefan26: s/too much/to me/
{noformat}
{quote}

By: Stefan Engström (StefanEng86) 2015-03-11 17:43:48.017-0500

This crash with a nearly identical footprint happend on an Asterisk 13.1.0 installation. When I get to the office tomorrow I'll try locate the coredump, and if it happens again, I'll provide additional logs, too.

By: Rusty Newton (rnewton) 2015-03-13 17:47:24.477-0500

Stefan, thanks. Make sure, if you can to recompile Asterisk with the appropriate flags so that the backtrace will be useful. See the instructions here: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace



By: Stefan Engström (StefanEng86) 2015-03-14 06:07:05.563-0500

Unfortunately this crash has NOT happend again in the last week despite robots calling the queue with queue-timers.

Moreover, unfortunately the one crash was done on an asterisk-13.1.0 with a few custom small patches applied to it. I did find the crash-log and I cannot see how this crash could not have happend with untouched asterisk-13.1.0-code as the patches does not really alter logical flow. I attach those patches too but note they are not code contributions, they are just old patches applied to a special installation.

What is the real possible harm in putting a qe.pr=NULL; in the leave_queue code? (cf mjordans post)

By: Matt Jordan (mjordan) 2015-03-23 09:42:51.394-0500

[~StefanEng86]: I think putting a {{qe.pr = NULL; }} would not be a bad idea. Since other locations check to see if that pointer is NULL before dereferencing it, setting it to NULL after being freed is probably appropriate.

If you'd like to provide that patch, that'd be great - you can find information on providing a patch back to the project at [https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Stefan Engström (StefanEng86) 2015-03-25 08:11:16.213-0500

https://reviewboard.asterisk.org/r/4522/

By: Rusty Newton (rnewton) 2015-04-08 09:52:25.686-0500

Stefan or Vadim, have either of you been able to reproduce this crash again?

By: Matt Jordan (mjordan) 2015-04-08 21:05:40.697-0500

I've committed [~StefanEng86]'s patch on review board to 11+. If this issue happens again with this patch, we can re-open the issue.

[~StefanEng86]: thanks for the contribution!