ASTERISK-20416: Unknown issue forces maximum calls limit reached at 500 calls - Autodestruct warnings spamming CLI and extension to extension calls not possible

[Home]

Summary: ASTERISK-20416: Unknown issue forces maximum calls limit reached at 500 calls - Autodestruct warnings spamming CLI and extension to extension calls not possible

Reporter: David Brillert (aragon) Labels:

Date Opened: 2012-09-13 17:49:15 Date Closed: 2012-11-07 11:51:18.000-0600

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_local

Versions: SVN Frequency of
Occurrence Frequent

Related
Issues:

Environment: Centos 5.8 x64 quadcore 8 GB RAM Attachments: ( 0) asterisk.stack.txt
( 1) autodestruct_agi_debug.txt
( 2) autodestructverbosecli.txt
( 3) autodestructverbosecli2.txt
( 4) cli_debug_autodestruct.rar
( 5) cli_debug_autodestruct.rar
( 6) log.rar
( 7) production_ref_count_log_1.txt
( 8) production_ref_count_log_2.txt
( 9) reflog.rar

Description: Every single day Asterisk cannot complete transfers or local extensions cannot call.
While incoming and outgoing calls are working...
Autodestruct warnings spam the CLI with hundreds of WARNINGS per minute.
Local calls become possible only after the Autodestruct messages go away or Asterisk is restarted.
It is speculated that Autodestructs frequently occur during dynamic ACD logins or outgoing calls requiring PINs so this could be related to some ast_waitfordigit issue discovered at ASTERISK-20375.
All incoming calls terminate to ACD queue.
All outgoing calls require a PIN to complete.
No IVR's in use.
No DTMF *2 transfers being done by users.
Opening a new ticket since I am not 100% sure they are related issues.

Comments: By: David Brillert (aragon) 2012-09-13 18:52:04.227-0500

Two snippets of the messages log from a 200 MB file I collected during the outages.
I hope there is something of value there.
By: David Brillert (aragon) 2012-09-14 08:01:51.385-0500

Comparing events between CDR and the appearances of Autodestruct warnings I think the effected code path could be ast_waitfordigit during ACD logins that require a password.
By: Rusty Newton (rnewton) 2012-09-15 15:27:37.230-0500

You indicate that this is a regression? What was the last version you were using before 373047 ?
By: David Brillert (aragon) 2012-09-16 23:11:51.206-0500

Rusty - I have tried revisions as old as 1.8.12 without a working version to fix this problem.
I don't know how far back I would have to go to find a working version (I assume a working version does exist and hence the regression) and I don't know which revision causes the problem :(
Regardless due to some other required patches I don't think I could go to anything older than 1.8.12

I'll try to get some more CLI output tomorrow that could help narrow down cause or steps to reproduce.
The problem affects multiple sites of mine and the warning always seems to appear shortly after an ACD login. Not every ACD login is affected so there are some bizarre conditions. I have seen on systems using dummy and DAHDI timing modules.
The problem is so service affecting that the site is unworkable many times per day... it's a very busy system and we have already tried running a DEBUG_THREADS version with zero success. We cannot run in Valgrind either.
By: David Brillert (aragon) 2012-09-17 10:16:47.404-0500

Today I captured debug CLI with autodestruct warnings and I'm hoping someone can analyze for root cause of this problem. Uploaded as cli_debug_autodestruct.rar
By: David Brillert (aragon) 2012-09-19 11:51:31.251-0500

Another DTMF possibility exists in entering a wrapup code

[2012-09-19 12:42:33] -- Executing [502@defaultmain-restricted:7] Set("SIP/266-00004174", "ASTERISK_DBPUT(defaultmain/wrapup/266/lastcall)=1348072953.38713") in new stack
[2012-09-19 12:42:34] WARNING[18298]: chan_sip.c:3925 __sip_autodestruct: Autodestruct on dialog '0752902accaba9eaafac60c267b19aab@10.0.50.170' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms
By: Rusty Newton (rnewton) 2012-09-25 16:08:16.893-0500

Matt Jordan indicated this may be a reference counting issue.

Please follow the instructions very carefully here: https://wiki.asterisk.org/wiki/display/~mjordan/Reference+Count+Debugging. Enable for both chan_sip and chan_local, then after the issue occurs attach the resulting files to this issue.
By: David Brillert (aragon) 2012-09-25 16:18:58.049-0500

ooooh that does looks useful. I gotta ask what kind of performance hit will this put on a production system?... this is a very busy call center
By: Matt Jordan (mjordan) 2012-09-25 16:27:26.678-0500

Performance hit is potentially noticeable, but nowhere near the level of DEBUG_THREADS. The real issue is that you have to write out to another file, typically on the same thread that services the ao2 objects in general (SIP's do_monitor, various pbx_threads, etc.) So there is an impact.

The real issue is that this *will* create a massive file. (Which is okay, it can be tar'd and split up) However, you'll want to monitor it and stop it as soon as you hit a problem or if the file gets to be too big.
By: David Brillert (aragon) 2012-09-27 07:50:19.082-0500

We had some problems this morning and were able to collect some data with this option.
ps -LlFm -p `pidof asterisk`
pstack `pidof asterisk` > /tmp/asterisk.stack.txt

Attaching /tmp/asterisk.stack.txt
By: David Brillert (aragon) 2012-09-30 11:00:23.645-0500

We have narrowed it down to some interaction with all-hangup agi script.
which does:
sip*CLI> dialplan show all-hangup
[ Context 'all-hangup' created by 'pbx_config' ]
's' => 1. Set(CDR(hangupside)=${HANGUPSIDE}) [pbx_config]
2. Set(CDR(hangupcause)=${HANGUPCAUSE}) [pbx_config]
3. Set(CDR(codec1)=${CHANNEL(audioreadformat)}) [pbx_config]
4. Set(CDR(codec2)=${CHANNEL(audiowriteformat)}) [pbx_config]
5. Set(CDR(recvip)=${CHANNEL(recvip)}) [pbx_config]
6. Set(CDR(fromchan)=${CHANNEL(from)}) [pbx_config]
7. Set(CDR(uri)=${CHANNEL(uri)}) [pbx_config]
8. Set(CDR(useragent)=${CHANNEL(useragent)}) [pbx_config]
9. Set(CDR(llp)=${CHANNEL(rtpqos,audio,local_maxrxploss)}) [pbx_config]
10. Set(CDR(rlp)=${CHANNEL(rtpqos,audio,remote_maxrxploss)}) [pbx_config]
11. Set(CDR(ljitt)=${CHANNEL(rtpqos,audio,local_maxjitter)}) [pbx_config]
12. Set(CDR(rjitt)=${CHANNEL(rtpqos,audio,remote_maxjitter)}) [pbx_config]
13. GotoIf($["${FAXID}xxx" != "xxx"]?all-faxnotify,s,1:$[${PRIORITY} + 1]) [pbx_config]
14. ResetCDR(w) [pbx_config]
15. NoCDR() [pbx_config]
16. Hangup() [pbx_config]

The only relevant lines here are 14-16
Where I think we try to write CDR and Hangup on a leg that is already disappeared and that causes failed to write frame and Autodestructs. I have no idea why.

Unfortunately I think we are weeks away from being able to run asterisk in ref count debugging mode.

Also I think I found a duplicate ticket on bug tracker ASTERISK-20486 and it looks like that reporter is much further along in diagnosing the problem than I am.
By: David Brillert (aragon) 2012-10-02 08:19:44.748-0500

Someone on my team came across this reference.
http://forums.asterisk.org/viewtopic.php?f=1&t=80755&p=166484&hilit=chan_sip.c%3A+Autodestruct+on+dialog#p166484

We don't use RADIUS on our systems but I speculate the other comments in this thread (regarding BYE's) are applicable to what is going on here.
We are using cdr_adaptive_odbc

I quote:
'1. When a call (either inbound or outbound...doesn't matter which leg of the call) is in progress, and one of the legs hangs up, Asterisk just sits there like a lump without sending a BYE message to the remaining active channel.

2. If you wait and watch, and the remaining channel keeps its RTP stream to Asterisk open (phone off-hook still), the remaining leg will FINALLY get sent a BYE from Asterisk exactly 30 seconds after the first leg hangs up. It doesn't matter how long the call lasted (seconds or hours), you can count on needing to wait 30 seconds from the instant the first channel is dead. When this happens, the channel is cleaned up as expected.

3. If you hang up the remaining channel (put phone back on-hook) BEFORE 30 seconds are up, Asterisk really doesn't like this. A few seconds after you do this, you will see a "WARNING: chan_sip.c: __sip_autodestruct: Autodestruct on dialog '[...]' with owner in place (Method: BYE)" show on the console, and the BYE dialog will remain "stuck" in 'sip show channels' until you restart Asterisk. After a while, these really begin to accumulate.'

So without being an expert I assume we lock the CDR tables and the ODBC driver is locked waiting for Asterisk to hang up the orphaned leg.

This is what our cdr_pre locking script looks like
{noformat}
Index: cron/cdr.php
===================================================================
--- cron/cdr.php (revision 9992)
+++ cron/cdr.php (working copy)
@@ -1,8 +1,27 @@
<?php
+function isTableLocked($table) {
+
+ global $asterisk;
+
+ $sql = "SHOW OPEN TABLES WHERE In_use > 0 like '$table'";
+ $res = $asterisk->_db_report_read->query($sql);
+ $r = $res->fetchRow(DB_FETCHMODE_ASSOC);
+
+ if (!is_null($r)) {
+ print " * TABLE '$table' IS LOCKED ...\n";
+ return true;
+ }
+ return false;
+}
+
function FixCDRTenant($table = 'cdr') {
global $asterisk;
+ if (isTableLocked($table)) {
+ return false;
+ }
+
$sql = "SELECT channel,dstchannel,dcontext,uniqueid FROM $table WHERE tenant IS NULL OR tenant = ''";
$res = $asterisk->_db_report_read->query($sql);
@@ -44,6 +63,10 @@
function FixCDRLocation($table = 'cdr') {
global $asterisk;
+
+ if (isTableLocked($table)) {
+ return false;
+ }
$sql = "SELECT dst,dstchannel,uniqueid FROM $table WHERE location IS NULL OR location = ''";
$res = $asterisk->_db_report_read->query($sql);
@@ -88,11 +111,18 @@
// *******************************************************************
print "- Move data from CDR-pre to the CDR table\n";
+ if (isTableLocked('cdr_pre')) {
+ return false;
+ }
+
+ $asterisk->_db_report_write->query("LOCK TABLES cdr_pre WRITE");
+
$sql = "SELECT * FROM cdr_pre";
$res = $asterisk->_db_report_read->query($sql);
if (is_a($res, 'PEAR_Error')) {
print " - ERROR GETTING DATA FROM DATABASE\n";
+ $asterisk->_db_report_write->query("UNLOCK TABLES");
return false;
}
@@ -122,5 +152,8 @@
$sql = "DELETE FROM cdr_pre WHERE uniqueid = " . $asterisk->_db->quote($r['uniqueid']);
$asterisk->_db_report_write->query($sql);
}
+
+ $asterisk->_db_report_write->query("UNLOCK TABLES");
+
print ".";
}
{noformat}
By: David Brillert (aragon) 2012-10-02 09:48:38.515-0500

In regards to the other reporters comment=
3. If you hang up the remaining channel (put phone back on-hook) BEFORE 30 seconds are up, Asterisk really doesn't like this. A few seconds after you do this, you will see a "WARNING: chan_sip.c: __sip_autodestruct: Autodestruct on dialog '[...]' with owner in place (Method: BYE)" show on the console, and the BYE dialog will remain "stuck" in 'sip show channels' until you restart Asterisk. After a while, these really begin to accumulate.'

Inadvertently we have reproduced the reporters steps to reproduce.
In step 16 our all-hangup agi script forces a hangup on the channel waiting for the BYE which causes the channel to hang until we restart Asterisk. I presume the hangup() cancels Asterisk ever sending the BYE... so I am left pondering why Asterisk waits so long to send a BYE in the first place.
By: Doug Reid (doug_ir) 2012-10-08 12:57:16.354-0500

Hi Guys

Just a thought, what codecs are being used when trying to re-create the problem? This problem seems to happen only when using the G729 Codec. With the Autodestruct warnings we see the channels in a "BYE" state and eventually they will destruct but sometimes under high call volume they build up to fast and reach the maximum licenses in this case 180 concurrent calls. When the error starts we also notice a sometimes significant delay in local channel calls 10+ seconds before the call will terminate.

So to add my 2 cents worth I would say this occurs when using the G729 codec.

1/180 encoders/decoders of 180 licensed channels are currently in use

Licenses Found:
File: G729-XXXXXXXXXX.lic – Key: G729-XXXXXXXXXXX – Host-ID: cb:e8:4c:e6:50:07:49:40:1a:a3:a3:5e:2c:b1:4e:d1:79:25:9e:59 – Channels: 60 (Expires: 2032-10-02) (OK)
File: G729-XXXXXXXXXX.lic – Key: G729-XXXXXXXXXXX – Host-ID: cb:e8:4c:e6:50:07:49:40:1a:a3:a3:5e:2c:b1:4e:d1:79:25:9e:59 – Channels: 120 (Expires: 2032-09-28) (OK)

[2012-10-08 10:43:21] WARNING[25966] codec_g729a.c: out of G.729 decoder licenses
[2012-10-08 10:43:21] WARNING[25966] translate.c: g729tolin did not update samples 0
[2012-10-08 10:43:21] WARNING[25804] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[26327] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[25825] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[25736] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[22435] chan_sip.c: Autodestruct on dialog '31794e800637dccc1d6756c01c91462d@172.17.1.2:5060' with owner SIP/voip-00004dfd in place (Method: BYE). Rescheduling destruction for 10000 ms

By: Doug Reid (doug_ir) 2012-10-10 09:36:40.892-0500

Scrap previous G729 comment we have seen this in Alaw as well.

By: Rusty Newton (rnewton) 2012-10-10 17:59:59.796-0500

Unselected regression, since we haven't identified the root problem and whether it doesn't occur in a previous version.
By: David Brillert (aragon) 2012-10-10 21:06:41.604-0500

OK we set up a lab environment to test.
DONT_OPTIMIZE DEBUG_THREADS and chan_sip and chan_local ref counting options enabled on SVN-branch-1.8-r374802M
Hammered the lab server but could not reproduce the Autodestruct warnings.
Uploading reflog.rar
By: David Brillert (aragon) 2012-10-11 09:52:44.607-0500

This ref log was captured on a production server today and has at least one autodestruct event. The system didn't peak during our maintenance window so the log is not as spammy as I would have liked. But here it is...
By: David Brillert (aragon) 2012-10-11 10:22:16.319-0500

production ref count log 1.txt

By: David Brillert (aragon) 2012-10-11 10:23:35.685-0500

A second log in case it proves useful.
I don't think it captures a period where autodestructs were visible in the CLI...
production ref count log 2.txt
By: David Brillert (aragon) 2012-10-15 08:22:15.456-0500

PING...

Is anything else required to move this along?
I have provided every log and trace I can think of...
The problem for us is quite critical and appears on multiple sites.
By: Rusty Newton (rnewton) 2012-10-15 09:20:32.511-0500

aragon, thanks for all the debug and information. The core team and community developers are already working on many issues. The amount of information you have provided will certainly speed things up when someone has time allocated to dive into the issue.

By: David Brillert (aragon) 2012-10-15 09:28:31.905-0500

Rusty, I can appreciate the workload on over 700 open tickets but this issue is causing us and our partners to lose thousands of dollars per day in lost calls and sales. Anything that can be done to advance the progress on this issue would be much appreciated.
By: Doug Reid (doug_ir) 2012-10-17 05:45:19.271-0500

I have added a cli debug log "autodestruct_agi_debug" with agi debug enabled and allot of Autodestruct spamming, hope this will help.
By: Doug Reid (doug_ir) 2012-10-17 05:46:31.427-0500

added "autodestruct_agi_debug"
By: David Brillert (aragon) 2012-10-17 09:06:18.801-0500

A couple of observations.

1. Step 16 in this context does not execute an AGI it executes Asterisk Hangup macro and if we remove it from the context the Autodestructs still appear.
{noformat}
[ Context 'all-hangup' created by 'pbx_config' ]
's' => 1. Set(CDR(hangupside)=${HANGUPSIDE}) [pbx_config]
2. Set(CDR(hangupcause)=${HANGUPCAUSE}) [pbx_config]
3. Set(CDR(codec1)=${CHANNEL(audioreadformat)}) [pbx_config]
4. Set(CDR(codec2)=${CHANNEL(audiowriteformat)}) [pbx_config]
5. Set(CDR(recvip)=${CHANNEL(recvip)}) [pbx_config]
6. Set(CDR(fromchan)=${CHANNEL(from)}) [pbx_config]
7. Set(CDR(uri)=${CHANNEL(uri)}) [pbx_config]
8. Set(CDR(useragent)=${CHANNEL(useragent)}) [pbx_config]
9. Set(CDR(llp)=${CHANNEL(rtpqos,audio,local_maxrxploss)}) [pbx_config]
10. Set(CDR(rlp)=${CHANNEL(rtpqos,audio,remote_maxrxploss)}) [pbx_config]
11. Set(CDR(ljitt)=${CHANNEL(rtpqos,audio,local_maxjitter)}) [pbx_config]
12. Set(CDR(rjitt)=${CHANNEL(rtpqos,audio,remote_maxjitter)}) [pbx_config]
13. GotoIf($["${FAXID}xxx" != "xxx"]?all-faxnotify,s,1:$[${PRIORITY} + 1]) [pbx_config]
14. ResetCDR(w) [pbx_config]
15. NoCDR() [pbx_config]
16. Hangup() [pbx_config]
{noformat}
2. Our MySQL locking scripts have been re-written to be certain we are not blocking on CDR writes.
3. Channels are getting hung and the Autodestruct still leaves them hanging around.
4. Executing the Hangup macro doesn't kill the hanging channels.

The attached log.rar file already in the ticket includes full pcap traces of both SIP and AGI during the Autodestruct periods.
By: David Brillert (aragon) 2012-10-17 09:13:13.925-0500

In order to be certain our [all-hangup] context was not causing problems.
In [all-hangup] from /etc/asterisk/all/extensions.conf ...

We tested:

[all-hangup]
exten => s,1,Set(CDR(hangupside)=${HANGUPSIDE})
exten => s,2,Set(CDR(hangupcause)=${HANGUPCAUSE})
exten => s,3,Set(CDR(codec1)=${CHANNEL(audioreadformat)})
exten => s,4,Set(CDR(codec2)=${CHANNEL(audiowriteformat)})
exten => s,5,Set(CDR(recvip)=${CHANNEL(recvip)})
exten => s,6,Set(CDR(fromchan)=${CHANNEL(from)})
exten => s,7,Set(CDR(uri)=${CHANNEL(uri)})
exten => s,8,Set(CDR(useragent)=${CHANNEL(useragent)})
exten => s,9,Set(CDR(llp)=${CHANNEL(rtpqos,audio,local_maxrxploss)})
exten => s,10,Set(CDR(rlp)=${CHANNEL(rtpqos,audio,remote_maxrxploss)})
exten => s,11,Set(CDR(ljitt)=${CHANNEL(rtpqos,audio,local_maxjitter)})
exten => s,12,Set(CDR(rjitt)=${CHANNEL(rtpqos,audio,remote_maxjitter)})
exten => s,13,GotoIf($["${FAXID}xxx" != "xxx"]?all-faxnotify,s,1:$[${PRIORITY} + 1])
exten => s,14,ResetCDR(w)
exten => s,15,NoCDR()
exten => s,16,Hangup()

to -->

[all-hangup]
exten => s,1,Hangup()

And we still suffer from the Autodestruct problems.
By: David Brillert (aragon) 2012-10-17 09:16:31.988-0500

In order to be certain our [all-hangup] context was not causing problems.
In [all-hangup] from /etc/asterisk/all/extensions.conf ...

We tested:

{noformat}
[all-hangup]
exten => s,1,Set(CDR(hangupside)=${HANGUPSIDE})
exten => s,2,Set(CDR(hangupcause)=${HANGUPCAUSE})
exten => s,3,Set(CDR(codec1)=${CHANNEL(audioreadformat)})
exten => s,4,Set(CDR(codec2)=${CHANNEL(audiowriteformat)})
exten => s,5,Set(CDR(recvip)=${CHANNEL(recvip)})
exten => s,6,Set(CDR(fromchan)=${CHANNEL(from)})
exten => s,7,Set(CDR(uri)=${CHANNEL(uri)})
exten => s,8,Set(CDR(useragent)=${CHANNEL(useragent)})
exten => s,9,Set(CDR(llp)=${CHANNEL(rtpqos,audio,local_maxrxploss)})
exten => s,10,Set(CDR(rlp)=${CHANNEL(rtpqos,audio,remote_maxrxploss)})
exten => s,11,Set(CDR(ljitt)=${CHANNEL(rtpqos,audio,local_maxjitter)})
exten => s,12,Set(CDR(rjitt)=${CHANNEL(rtpqos,audio,remote_maxjitter)})
exten => s,13,GotoIf($["${FAXID}xxx" != "xxx"]?all-faxnotify,s,1:$[${PRIORITY} + 1])
exten => s,14,ResetCDR(w)
exten => s,15,NoCDR()
exten => s,16,Hangup()
{noformat}
to -->
{noformat}
[all-hangup]
exten => s,1,Hangup()
{noformat}
And we still suffer from the Autodestruct problems.
By: David Brillert (aragon) 2012-11-07 11:18:09.482-0600

I was able to reproduce the problem reliably by dialing the phrase management feature code to record an IVR prompt.
If the prompt was password protected and I didn't enter the password before hanging up I got an autodestruct right away and every time.
Basically what was happening is that Asterisk would hangup and then the all-hangup script did not catch the hangup. Trying to execute the hangup again caused an infinite loop.

We modified the all-hangup script to catch the Asterisk hangup so we could avoid the infinite loop.
So my first theory was right about trying to dial out on an outgoing line that was PIN protected since the users did not always enter a PIN before hanging up.

Not sure if this is still a bug in Asterisk or not but our workaround is successful.
A bug marshal can make the judgement call on whether or not to close this out. But if I don't hear back in a week or so I will close this out myself.

By: David Brillert (aragon) 2012-11-07 11:51:18.523-0600

Actually, since a workaround now exists I am just going to close this out now.
By: David Brillert (aragon) 2012-11-07 11:55:53.658-0600

Also if I had to guess I think the patch being written at https://reviewboard.asterisk.org/r/2186/ might actually address the root cause.
And possibly some stuff in https://reviewboard.asterisk.org/r/2135/