ASTERISK-23745: Call file problem, DelayedRetry/retrying spite MaxRetries: 0

[Home]

Summary: ASTERISK-23745: Call file problem, DelayedRetry/retrying spite MaxRetries: 0

Reporter: Mikael Fredin (Zedekul) Labels:

Date Opened: 2014-05-16 03:24:53 Date Closed: 2014-06-25 09:00:28

Priority: Major Regression?

Status: Closed/Complete Components: Core/PBX PBX/pbx_spool

Versions: 1.8.16.0 1.8.27.0 Frequency of
Occurrence Occasional

Related
Issues:

Environment: Red Hat Enterprise Linux Server release 5.9 (Tikanga) HP ProLiant DL360 G7, 24 GB RAM, 24 cores Attachments:

Description: Following problem arises from time to time, a call will successfully terminate:

[May 14 14:31:41] VERBOSE[3274] pbx_realtime.c: -- Executing [t@project_init:1] Hangup("SIP/peer-2-00002f7e", "")
[May 14 14:31:41] VERBOSE[3274] pbx.c: == Spawn extension (project_init, t, 1) exited non-zero on 'SIP/peer-2-00002f7e'

<bye message, Really destroying SIP dialog, etc>

This is the call file:

Channel: SIP/peer-2/00numberhere
CallerID: "" <+calleridhere>
Extension: 123
SetVar: someid=123
Context: setup
WaitTime: 30
MaxRetries: 0
RetryTime: 300
Account: 123
Priority: 1

Some time after the call has hung up, the call file is still there and this is appended to the file:
StartRetry: 20354 1 (1400070906) (My note: Wed May 14 14:35:06 CEST 2014)

DelayedRetry: 20354 0 (1400070906) same time...

DelayedRetry: 20354 0 (1400071206) five minutes...

DelayedRetry: 20354 0 (1400071506) and so on...

DelayedRetry: 20354 0 (1400071806) never deleting this file

DelayedRetry: 20354 0 (1400072106) are we?

DelayedRetry: 20354 0 (1400072406) nope....

DelayedRetry: 20354 0 (1400072706) waiting for someone....

DelayedRetry: 20354 0 (1400073006) to do manual work

Asterisk log:
[May 14 14:35:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 14:40:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 14:45:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 14:50:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 14:55:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 15:00:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 15:05:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'
[May 14 15:10:06] DEBUG[20421] pbx_spool.c: Delaying retry since we're currently running '/var/spool/asterisk/outgoing/callfile'

It happens only once in a while, my guess it happens when many call files are generated at once (I do use unique names for them, based on an auto_increment in a database). This will happen to a lot of call files, also, so it's not just one call file (seen 40+ get stuck like this).

core show channels will report far too many "active calls", but very few channels compared to "active calls" (and "core show calls" will show the same, too high, number).

So somehow, sometimes, something stops registering that calls are hanging up and that a call file should be removed.

Been thinking of skipping the check:
if (o->retries <= o->maxretries) {

On row 446 in pbx/pbx.c just to get the calls flowing, because I never want a retry on a call file (it would just skip to delete the file and report "expired without completion" instead). In fact, I have specified "MaxRetries: 0", so IMO I shouldn't get to the "Delaying retry" part anyways.

But not sure if that would be a good idea - I don't know the inner workings of asterisk and maybe things would start burning anyways. Especially considering something goes wrong with the active call count.

Comments: By: Mikael Fredin (Zedekul) 2014-05-16 04:37:33.087-0500

Noticed something in the log now also:

In pbx_spool.c, line 385, the calls never seem to reach this:
ast_log(LOG_NOTICE, "Call completed to %s/%s\n", o->tech, o->dest);

in attempt_thread.

This happens normally (just checked calls going through at the moment), but the log capture at the time of failure has no such messages.
By: Mikael Fredin (Zedekul) 2014-05-19 10:52:05.596-0500

I think it's safe to conclude that this is not a call file problem but a problem with channels not hanging up. Some output during error:

[root@s tests]# asterisk -rx "core show channels"
Channel Location State Application(Data)
SIP/peer-00002e24 s@macro-call-transfe Up Dial(SIP/provider/00123456,
SIP/peer-00002e28 (None) Up AppDial((Outgoing Line))
2 active channels
158 active calls

While "sip show channels" shows:
<output with lots of ACK messages, and two BYE>
44 active SIP dialogs

I can see in the log that there's two calls who are stuck in a long running Hangup-channel, as there's many of these messages:
[May 14 15:12:25] WARNING[20406] chan_sip.c: Autodestruct on dialog '3d0d0e305260dc1240bf974510ed50b7@123.123.123.123' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms
[May 14 15:12:28] WARNING[20406] chan_sip.c: Autodestruct on dialog 'dEZoUSApp_IJ0-ZBbG-ytw..' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms

These correspond with the two BYE's in the "sip show channel" output (and with the only channels showing in "core show calls"). No idea if that can cause a massive problem like this.

During time of failure I would expect to have 24 calls running (24 call files), but the call files had already completed and hanged up their calls - without channels getting destroyed (meaning the call files never get removed). And with "active calls" reaching 158 for some reason, while we only had a few incoming calls active.

I'm using NAT with:

externaddr = 212.x.x.x in sip.conf and
nat=yes, qualify=yes for the peers

In case it may be related to NAT.

I should know sooner or later if it's related, since I have made a new setup without NAT.
By: Mikael Fredin (Zedekul) 2014-05-19 11:00:48.441-0500

The SIP dialogues do get destroyed (really destroying SIP dialogue-message), but maybe it's some problem with ACK:s since there's a lot of those stuck in the "sip show channels" output.
By: Rusty Newton (rnewton) 2014-05-21 18:13:54.583-0500

Can you provide a full Asterisk log? With VERBOSE, DEBUG, everything, both verbose and debug turned up to 10, showing the time up until the issue starts occurring?

[Collecting Debug Information|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information]

I'm not sure that anything but having exact steps for reproduction will help, but you might as well follow the instructions for getting traces for a deadlock, just in case. [Getting a backtrace|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace?src=search#GettingaBacktrace-GettingInformationForADeadlock]
By: Mikael Fredin (Zedekul) 2014-05-22 01:50:36.800-0500

Thank you for helping me along. I will talk with my colleagues about this since I am now using a new server (without a router/NAT), where currently the issue does not seem to persist. I do have the old machine/setup available so I will see if I can make some tests and try to reproduce this while getting a backtrace.
By: Rusty Newton (rnewton) 2014-06-09 07:25:30.240-0500

Thanks. I'll put this in Waiting For Feedback for a couple weeks to give you time.
By: Rusty Newton (rnewton) 2014-06-25 09:00:03.670-0500

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested. Further information can be found at http://www.asterisk.org/developers/bug-guidelines