[Home]

Summary:ASTERISK-20335: Crash in ast_cel_report_event
Reporter:David Brillert (aragon)Labels:
Date Opened:2012-08-29 09:05:31Date Closed:2012-11-07 11:44:44.000-0600
Priority:CriticalRegression?Yes
Status:Closed/CompleteComponents:Channels/chan_local
Versions:1.8.16.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Centos server 5.8x64, 8 core CPU, 8GB RAMAttachments:( 0) AST18-core-verbose.txt
( 1) asterisk.txt
( 2) ASTERISK-20335.patch
( 3) autodestruct.txt
( 4) backtrace-20140724.txt
( 5) core_show_channels.txt
( 6) default-dial-cav-joh-002-000918.txt
( 7) gdbsegfault1.txt
( 8) gdbsegfault2.txt
( 9) gdbsegfault3.txt
(10) optimized_backtrace.txt
(11) SIP_show_channels.txt
(12) verbose_CLI_sip_set_debug_on.txt
Description:On a pretty busy system we get deadlocks and crashes daily since installing Asterisk 1.8.16rc1
Upgraded from 1.8.12 because we were having problems with leaking bye's fixed in ASTERISK-19455
We were able to collect verbose CLI, core show channels, sip show channels, and Asterisk CLI with sip set debug on.
Also back traced a core dump file but this is an optimized build since we could not run non-optimized in this environment.
Including the back trace anyway since it might help diagnose the problem.
Comments:By: Matt Jordan (mjordan) 2012-08-29 09:16:23.871-0500

So I can see the crash in ast_cel_report_event () in the backtrace, but I don't see where a deadlock would be occurring.  We'd need to get a core show locks or - at the very least - a gdb backtrace from a system that is currently experiencing a deadlock symptom.

By: David Brillert (aragon) 2012-08-29 09:21:28.508-0500

gdb asterisk --pid=`pidof asterisk`
And doing "thread apply all bt" would give the needed information?


By: David Brillert (aragon) 2012-08-29 09:24:53.464-0500

Also in the verbose CLI sip set debug on.txt file there are lots of these warnings, are these relevant?
WARNING[2799] chan_sip.c: Autodestruct on dialog '278b4fd635749fac45447ce8162d1070@41.221.0.38' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms

By: Matt Jordan (mjordan) 2012-08-29 09:27:01.579-0500

Lets take this one problem at a time.  Attempting to track multiple issues on a single JIRA issue never ends well.

This issue we'll track the crash occurring in ast_cel_report_event.

If you think you're having a deadlock, then you'll need to attempt to gather the information as per the wiki guidelines:

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock

Since we're tracking your crash on this issue, you should file a separate issue for your deadlock, once you have the necessary information.

By: Matt Jordan (mjordan) 2012-08-29 09:30:03.959-0500

For your crash: please try to run with the DONT_OPTIMIZE compiler flag.  You don't need to enable DEBUG_THREADS for the crash issue, but it would be helpful to get some additional symbol information.

By: Matt Jordan (mjordan) 2012-08-30 09:15:09.741-0500

It appears as if your deadlock is occurring in cdr_mysql.  I would recommend filing a separate JIRA issue for that problem.

That being said, cdr_mysql is not a core supported module.  As such, response times for that problem will reflect that.  You can get more information about module support states here:

https://wiki.asterisk.org/wiki/display/AST/Asterisk+Module+Support+States

By: Matt Jordan (mjordan) 2012-08-30 09:15:52.246-0500

If you can, please attach a backtrace for your crash with the DONT_OPTIMIZE flag enabled.  If you can, BETTER_BACKTRACES would also be helpful.  Thanks!

By: David Brillert (aragon) 2012-08-30 09:42:08.799-0500

asterisk.txt log attached ... gdb output from running asterisk that have lock ...

By: David Brillert (aragon) 2012-08-30 10:40:32.920-0500

rebuilding * with DONT_OPTIMIZE and BETTER_BACKTRACES and will test on server.

By: David Brillert (aragon) 2012-08-30 20:29:33.925-0500

I opened a separate ticket ASTERISK-20347 to report the deadlock.

By: David Brillert (aragon) 2012-08-31 07:48:00.847-0500

Hi Matt,

This is an 8 core 64 bit machine with 8 GB RAM.
And even so the system was so bogged down by the non-optimized build I installed last night that no phones could even register.  The system was completely unusable.

I hope there is enough data already provided to get a handle on this issue and get it fixed because I don't know what else to do...

Having said that can you offer any advice on disabling the cel module since we do not use cel in this system for the time being?

By: David Brillert (aragon) 2012-08-31 08:33:20.245-0500

I went through all of the Asterisk 1.8.16rc1 and svn change logs and recent changes were made to cel.
Therefore I wonder if this is a regression...

2012-07-18 19:12 +0000 [r370183-370205]  Kevin P. Fleming <kpfleming@digium.com>

* main/cel.c: Resolve severe memory leak in CEL logging modules. A
 customer reported a significant memory leak using Asterisk 1.8.
 They have tracked it down to
 ast_cel_fabricate_channel_from_event() in main/cel.c, which is
 called by both in-tree CEL logging modules (cel_custom.c and
 cel_sqlite3_custom.c) for each and every CEL event that they log.
 The cause was an incorrect assumption about how data attached to
 an ast_channel would be handled when the channel is destroyed;
 the data is now stored in a datastore attached to the channel,
 which is destroyed along with the channel at the proper time.
 (closes issue AST-916) Review:
 https://reviewboard.asterisk.org/r/2053/

By: Matt Jordan (mjordan) 2012-08-31 09:29:28.485-0500

Yes, I would suspect that is the cause, although its a little confusing as to how its happening in your scenario.  If you can't provide an optimized backtrace, can you provide the portion of the dialplan that the call is in when the crash occurs?  It appears as if its in macro default-dial-cav-joh-002-000918, called from an AGI.  That information should help us reproduce the problem.

By: David Brillert (aragon) 2012-08-31 11:12:35.765-0500

attaching to ticket

By: David Brillert (aragon) 2012-08-31 13:23:59.330-0500

Again today,first peers become unreachable then hundreds of scrolling messages like:
[2012-08-31 12:14:34] WARNING[18284] channel.c: Exceptionally long voice queue length queuing to Local/412@default-agent-9d84;1
[2012-08-31 12:14:34] WARNING[4564] chan_sip.c: Autodestruct on dialog '87755c9b8645ad4341d95bf2c323121d@192.168.1.141' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms

I think Asterisk is hitting maximum call limit

By: Joel Vandal (joel_vandal) 2012-09-06 09:44:29.587-0500

We disabled CEL on /etc/asterisk/cel.conf and no crash (coredump) on the server since last 5 days.

Each day, from 8-17h, we answer about 15000 calls on multiple Queue.

The original cel.conf file is :

{code:title=cel.conf}
[general]
enable          =  yes
apps            =  dial,queue,park
events          =  ALL

[manager]
enabled         =  yes
{code}


By: Matt Jordan (mjordan) 2012-09-06 09:49:53.787-0500

Were you running 1.8.16.0-rc1 as well?  If so, can you provide a backtrace with BETTER_BACKTRACES and DONT_OPTIMIZE enabled?

By: Joel Vandal (joel_vandal) 2012-09-06 09:59:54.449-0500

Sorry Matt, but this is the same server that aragon reported. So we (aragon and me) do not have others backtrace.


By: Mark Michelson (mmichelson) 2012-09-06 12:55:12.744-0500

I've been working on this the past day, and I've unfortunately not made any good progress with it.

I've run thousands of calls through Asterisk 1.8.16.0-rc1 with CEL enabled and I have not managed to get a crash yet. I have tried dialing from within and from outside of a macro. I've tried dialing extensions that just play back audio, and I've tried dialing extensions that dial out to SIP endpoints. I have issued reload commands on the CLI at random times and I have logged in and out over AMI to try to elicit some sort of change.

I have tried running with both with and without valgrind enabled as well. Valgrind reports no invalid memory usage during my tests.

I've also looked through the CEL code, including the code as it exists in 1.8.16.0-rc1 as well as all changes that were made between 1.8.12.0 and 1.8.16.0-rc1. I'm not finding anything that seems like it would have caused a crash in the place where your backtrace indicates.

When you ran an unoptimized build and it was unusable, did you also have DEBUG_THREADS enabled? If you did, then disabling that might make the system capable of handling your load. Unfortunately, I have a feeling that you're not going to be re-enabling CEL any time soon since it has caused issues for you. If you have any other backtraces handy, uploading them might help a bit, because it may be that while the crashes were happening in similar places in the code, the circumstances leading to the crash might be different.

In an attempt to replicate your environment a bit more, I think my next move will be to set up a CentOS 5.8 VM and see if I can reproduce the problem there.

Do you happen to know circumstances behind the crash? Did it happen when a reload was issued? I can tell that the crash is occurring while calling Dial() from within your macro, but nothing seems out of the ordinary in the code path leading to the crash, nor does there seem to be anything out of the ordinary in any of the CEL code, as far as I can tell.

By: David Brillert (aragon) 2012-09-06 17:00:21.875-0500

The only thing I find unusual is that there are a lot of ACK's hanging around sip show channels which you can see in the previous attachments (sip show channels)
There are also a lot of zombies hanging around; I think because of the ACK's (I hate zombies).  I think this leads to the locks and crashes because the call limit is reached.  If I had to guess it is because something is waiting on cel to write data to MySQL (pure speculation).
We had DEBUG_THREADS enabled to also capture the core show locks info.
You are correct when you assume it will be some time before we attempt to enable cel any time soon.
But we are willing to test a patch if you can figure things out.
We also haven't had any locks since disabling cel.

By: Mark Michelson (mmichelson) 2012-09-06 17:34:46.834-0500

Two things

1) Looking at your logs, I see messages like:

{noformat}
[2012-08-29 12:19:44] WARNING[2799] chan_sip.c: Autodestruct on dialog '36142e6e-e7e7e10b-18ec8044@192.168.2.179' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms
{noformat}

This is indicative of a channel reference leak, and unfortunately it's not easy to pinpoint where this might be happening. One potential cause of this is that an AGI may not be exiting properly, therefore holding onto a channel forever. There are other potential corner cases for this sort of thing as well. I recently opened ASTERISK-20375, where I have figured out an odd case using transfers and local channels that results in a channel reference leak. While the steps to reproduce are really bizarre, I imagine that the same reference leak could be triggered through more conventional means as well.

2) Once I started bringing transfers into the mix, I was able to get valgrind to start spewing errors in {{ast_cel_report_event()}} when I attempt to do an attended transfer. I suspect that this could be the same root cause of the crash you are seeing. I'm going to look into it a bit more to try to find the source of the problem. Basically, it appears that a channel being passed to {{ast_cel_report_event()}} has already been freed. Thus attempting to access that channel's data is causing invalid read errors in valgrind.

By: Mark Michelson (mmichelson) 2012-09-10 11:51:59.898-0500

I have uploaded ASTERISK-20335.patch

This corresponds to a review request I have put on review board at https://reviewboard.asterisk.org/r/2103

This fixes a misuse of channel data during a DTMF-triggered attended transfer. I do not know that this is the same problem that you are experiencing, but the invalid reads that valgrind reported to me were happening for me in {{ast_cel_report_event}}. If you are willing to try it, that would be great. If you do still get a crash with this patch applied, please upload a backtrace from the new crash. Thank you!

By: David Brillert (aragon) 2012-09-10 14:15:26.613-0500

Hi Mark,

I have patched two installations for testing.
The first patched system is local in which I was easily able to reproduce the Autodestruct warning (attached autodestruct.txt).
(I don't know the significance of the Autodestruct warning so I am reporting the result.
I have no crash or significant uptime to report yet.)
Steps to reproduce Autodestruct warning:
1. Answer incoming call
2. Extension 300 presses *2 to start transfer
3. 300 Enter digits = 220
4. 300 Hangup to complete transfer before 220 answers
5. After 2 rings at 220 unanswered Autodestruct warning appears in CLI

By: David Brillert (aragon) 2012-09-10 14:23:01.642-0500

And I often see the Autodestruct warning when adding or removing an agent from a queue.

[2012-09-07 09:44:18] NOTICE[20092] app_queue.c: Added interface 'Local/409@debcomainbtn-agent/n' to queue 'debcomainbtn-scrubbedqueue1'
[2012-09-07 09:44:18] NOTICE[20092] app_queue.c: Added interface 'Local/409@debcomainbtn-agent/n' to queue 'debcomainbtn-scrubbedqueue2'
[2012-09-07 09:44:31] WARNING[4564] chan_sip.c: Autodestruct on dialog 'dca2af964963e4914e73aebf9d0dda7f@10.0.50.181' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms


By: Mark Michelson (mmichelson) 2012-09-10 15:24:57.774-0500

The autodestruct warning may be the same code path that I used to produce the channel ref leak on issue ASTERISK-20375. It also uses a DTMF attended transfer and local channel combination.

I think it's just a coincidence that you're seeing that warning appear when adding members to a queue because there are no channels are created or even touched during such an operation.

By: David Brillert (aragon) 2012-09-10 16:55:19.307-0500

Mark, should I be expecting another patch based on those warnings?
On another note I am going to try producing those warnings with some queue tests.  Like an overflowed queue call... IVR transfers (anything else I can think of that would cause a transfer)
BTW my sites users aren't doing any Asterisk transfers, I happened on the warnings testing something else.

By: David Brillert (aragon) 2012-09-10 17:16:07.244-0500

I manually ran every other transfer scenario I could think of but could not reproduce the Autodestruct warning outside of DTMF *2 transfer.

By: Mark Michelson (mmichelson) 2012-09-11 09:23:43.325-0500

{quote}
Mark, should I be expecting another patch based on those warnings?
{quote}

Not on this issue report. I'm trying to limit the scope of this report to the CEL crash. As far as I know, the two problems are not directly related. I opened a separate issue for the channel reference leak that I found, and while I cannot say for sure, I'd be willing to believe that you are seeing the same problem as me (both problems are based on DTMF *2 transfers). If I were you, I'd watch that issue (ASTERISK-20375) and try whatever patch gets posted there.

By: David Brillert (aragon) 2012-09-11 09:30:55.804-0500

Thanks for clearing that up, I am already watching 20375 so I'll watch for any patch.

By: Mark Michelson (mmichelson) 2012-09-11 10:02:44.632-0500

An update:

The patch on this issue, while not yet confirmed by you to fix the reported crash, definitely fixes a bad use of memory and the review board posting has been approved.

I am going to commit the patch since it needs to go in anyway. I am not going to close this issue since it is not confirmed that it fixes your reported crash.

By: David Brillert (aragon) 2012-09-13 16:58:24.725-0500

4 crashes today and all in CEL.
3 backtraces uploaded with NON_OPTIMIZED SVN build but some values still optimized out of trace.
I'm back to disabling CEL so I hope you can use these backtrace files...

By: Mark Michelson (mmichelson) 2012-09-17 08:43:04.223-0500

Yep, all the crashes you reported are happening the same way that your previous crashes occurred. My patch fixed an issue, but not yours unfortunately.

By: David Brillert (aragon) 2012-09-17 16:27:34.932-0500

Going through /var/log/asterisk/messages today and found a bunch of these with different linkedid's:
[2012-09-11 15:27:56] ERROR[28865] cel.c: Something weird happened, couldn't find linkedid 1347391474.43843
[2012-09-11 15:27:56] ERROR[28865] cel.c: Something weird happened, couldn't find linkedid 1347391474.43843
[2012-09-11 15:27:56] ERROR[26215] cel.c: Something weird happened, couldn't find linkedid 1347391474.43843
[2012-09-11 15:27:56] ERROR[26215] cel.c: Something weird happened, couldn't find linkedid 1347391474.43843

Using the CDR records I tracked the lifetime of this linked ID to: enter queue, connect, completeagent

I tracked down another cel error linkedid to an ACD dynamic login.
All the "weird" errors were related to ACD calls or logins.


By: Doug Reid (doug_ir) 2012-10-08 11:17:20.835-0500

Hi Guys

Just a thought, what codecs are being used when trying to re-create the problem? This problem seems to happen only when using the G729 Codec. There may be some relation to CEL but I think that is a result of the underlying bug. With this situation if CEL is removed as a module we experience the same problem with the Autodestruct warnings but there is no core dump we see the channels in a "BYE" state and eventually they will destruct but sometimes under high call volume they build up to fast and reach the maximum licenses in this case 180 concurrent calls. When the error starts we also notice a sometimes significant delay in local channel calls 10+ seconds before the call will terminate.

So to add my 2 cents worth I would say this occurs when using the G729 codec and the core dump is a result of the underlying bug when using CEL.

1/180 encoders/decoders of 180 licensed channels are currently in use

Licenses Found:
File: G729-XXXXXXXXXX.lic -- Key: G729-XXXXXXXXXXX -- Host-ID: cb:e8:4c:e6:50:07:49:40:1a:a3:a3:5e:2c:b1:4e:d1:79:25:9e:59 -- Channels: 60 (Expires: 2032-10-02) (OK)
File: G729-XXXXXXXXXX.lic -- Key: G729-XXXXXXXXXXX -- Host-ID: cb:e8:4c:e6:50:07:49:40:1a:a3:a3:5e:2c:b1:4e:d1:79:25:9e:59 -- Channels: 120 (Expires: 2032-09-28) (OK)

[2012-10-08 10:43:21] WARNING[25966] codec_g729a.c: out of G.729 decoder licenses
[2012-10-08 10:43:21] WARNING[25966] translate.c: g729tolin did not update samples 0
[2012-10-08 10:43:21] WARNING[25804] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[26327] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[25825] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[25736] translate.c: No translator path from alaw to unknown
[2012-10-08 10:43:21] WARNING[22435] chan_sip.c: Autodestruct on dialog '31794e800637dccc1d6756c01c91462d@172.17.1.2:5060' with owner SIP/voip-00004dfd in place (Method: BYE). Rescheduling destruction for 10000 ms




 

By: David Brillert (aragon) 2012-11-07 11:44:44.745-0600

Since we disabled the CEL module in Asterisk and there are no more crashes as a result, I can no longer assist with debugging this issue.
I'm closing this out as workaround available.

By: rsw686 (rsw686) 2014-07-28 18:12:49.415-0500

I'm seeing this same crash on Certified Asterisk 1.8.15-cert5 and have attached backtrace-20140724.txt. It occurs about once a week during high calling periods. We usually see around 10k calls per day with rates up to 1 call per second. CEL logging is enabled and we do use G729 for some calls offloaded to a Sangoma D100 card.