[Home]

Summary:ASTERISK-25426: Core dump in CDR handler
Reporter:Morten Tryfoss (mtryfoss)Labels:
Date Opened:2015-09-28 02:12:16Date Closed:2015-11-11 10:39:36.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:CDR/General
Versions:13.5.0 Frequency of
Occurrence
One Time
Related
Issues:
Environment:Centos 6.6, x86_64Attachments:( 0) backtrace.txt
Description:We have a custom developed queue solution. This also have a callback-module which tries to originate a two-way call between an agent on the callcenter and the customer. For this specific case the call was not answered by the agent before the callcenter closed, and the agent did not log out of the queue so the system continued to try to deliver the call through the night.

I think this led to a build up of cdr variables and at some point, it crashes.

Partial output from gdb:
{noformat}
#6919 0x000000000045c759 in __ao2_ref (user_data=0x7feb94050ed8, delta=-1) at astobj2.c:516
#6920 0x000000000045c7d4 in __ao2_cleanup (obj=0x7feb94050ed8) at astobj2.c:529
#6921 0x000000000049d10e in _dtor_cdr (v=0x7febd12d0af8) at cdr.c:2070
#6922 0x000000000049d4cc in handle_channel_cache_message (data=0x0, sub=0x1cf68e8, message=0x7feac4d950a8) at cdr.c:2070
#6923 0x00000000005d3cc0 in router_dispatch (data=0x1cf21f8, sub=0x1cf68e8, message=0x7feac4d950a8) at stasis_message_router.c:201
#6924 0x00000000005c3359 in subscription_invoke (sub=0x1cf68e8, message=0x7feac4d950a8) at stasis.c:433
#6925 0x00000000005c3eaf in dispatch_exec_async (local=0x7febd12d0c40) at stasis.c:695
#6926 0x00000000005de2e4 in ast_taskprocessor_execute (tps=0x1cf6608) at taskprocessor.c:766
#6927 0x00000000005dca7f in default_tps_processing_function (data=0x1cf6548) at taskprocessor.c:184
#6928 0x00000000005f20d4 in dummy_start (data=0x1cf6870) at utils.c:1237
#6929 0x000000385f2079d1 in start_thread () from /lib64/libpthread.so.0
#6930 0x000000385eee88fd in clone () from /lib64/libc.so.6
{noformat}
Comments:By: Asterisk Team (asteriskteam) 2015-09-28 02:12:17.667-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Morten Tryfoss (mtryfoss) 2015-09-28 02:14:34.681-0500

File too large for upload.

http://code.tryfoss.no/files/core.27275.gz

By: Rusty Newton (rnewton) 2015-10-01 06:42:19.833-0500

Thank you for the crash report. However, we need more information to investigate the crash. Please provide:

1. A backtrace generated from a core dump using the instructions provided on the Asterisk wiki [1].
2. Specific steps taken that lead to the crash.
3. All configuration information necesary to reproduce the crash.

Thanks!

[1]: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace



By: Rusty Newton (rnewton) 2015-10-01 06:42:35.567-0500

We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Rusty Newton (rnewton) 2015-10-01 06:44:13.485-0500

Please attach the requested backtrace (following the instructions from wiki page linked) as well as the Asterisk debug log (or whatever logs you have available that correlate with the crash).

Attachments should be .txt for easy accessibility. The debug log can be the last 10000 lines before the crash.

By: Morten Tryfoss (mtryfoss) 2015-10-01 06:50:00.910-0500

Backtrace attached.

By: Rusty Newton (rnewton) 2015-10-01 14:25:27.177-0500

Thanks for the trace. Will you be able to provide a debug log as well or is that unavailable?

If you have to turn up logging and wait until it crashes again that will be alright too.

By: Morten Tryfoss (mtryfoss) 2015-10-02 03:12:14.589-0500

Sorry. I don't have any more detailed logging of what happened. It was an agi script doing several Dial attempts in a loop. I've tried to reproduce with some simple dialplan logic, but can't make it happen.

Also, I don't use the CDRs to anything so I simply turned it off on the production servers. It has not crashed since. I just wanted to report the bug if anyone else experience the same.

By: Matt Jordan (mjordan) 2015-10-06 10:54:46.281-0500

Hm. I wish we knew how you got it to do that; that was basically an infinitely created CDR (yikes).

Do you have any aspect of the AGI script that you could share?

By: Morten Tryfoss (mtryfoss) 2015-10-06 14:12:32.365-0500

I had a log of all executed AGI-commands, but unfortunately some cron job did as it was supposed to.. I'm trying to reproduce it again now (with asterisk debug as well).

But in general the script is executed on a two-way originate from asterisk. When the AGI manages to get answered by an agent, it will call the customer on the other leg.
The AGI is a big while(true) which dials local channels to different end points/agents (either mobile through a sip trunk or a local sip phone).

The agents had not logged out after work and the queue was not emptied as it should so this had been doing a little over 4000 attempts before it crashed. That would be a really large CDR, I guess!

After what i remember from the log, there was a little mix of hangup causes generated from the different phones.


By: Morten Tryfoss (mtryfoss) 2015-10-06 14:14:34.975-0500

One extra detail. We've still not migrated to pjsip. Don't know if that has anything to do with the case.

By: Morten Tryfoss (mtryfoss) 2015-10-07 02:19:06.826-0500

Managed to make asterisk crash again now. Answered the two-way call that has been ringing through the night.

One thing I noticed in the debug log for the call that has been running over about 10-11 hours:
[Oct  7 08:09:05] DEBUG[1625] cdr.c: Finalized CDR for Local/1_morten@ncomqueue-000623c6;2 - start 1444198139.266233 answer 1444198144.500025 end 1444198145.510915 dispo ANSWERED
[Oct  7 08:09:05] DEBUG[1625] cdr.c: Finalized CDR for Local/1_morten@ncomqueue-000623c6;2 - start 1444198145.515189 answer 1444198145.515189 end 1444198145.516975 dispo ANSWERED

While when asterisk doesn't crash:
[Oct  7 08:26:38] DEBUG[14737] cdr.c: Finalized CDR for Local/1_morten@ncomqueue-00000049;2 - start 1444199191.033973 answer 1444199194.872059 end 1444199198.514642 dispo ANSWERED
[Oct  7 08:26:38] DEBUG[14737] cdr.c: Finalized CDR for Local/1_morten@ncomqueue-00000049;2 - start 1444199198.517425 answer 1444199198.517426 end 1444199198.518279 dispo ANSWERED
[Oct  7 08:26:38] DEBUG[14737] cdr.c: Finalized CDR for Local/1_morten@ncomqueue-00000049;1 - start 1444199194.872478 answer 1444199194.872112 end 1444199198.518948 dispo ANSWERED

The CDR for the local channel ending with ";1" is not written.
This particular channel is the one involved directly in the two-way originate. The CDR data (level XX) is building up on the channel ending with ";2".
All other CDR's involved in the call seems to be written correctly based on a compare of the two logs.

I'm starting to think that this might have something to do with a long running originate instead? Are there any clean up routines running which messes up the CDR?

By: Rusty Newton (rnewton) 2015-11-06 07:51:49.734-0600

Were you able to get the AGI log now that you can make it crash?

By: Morten Tryfoss (mtryfoss) 2015-11-10 04:12:12.257-0600

Unfortunately, I did not keep that. We've disabled the CDR engine on our nodes now since it wasn't used for anything.

By: Rusty Newton (rnewton) 2015-11-11 10:39:36.862-0600

[~asanders] looked over the issue with me. We don't have enough data to investigate this any further at the current time. Especially since we can't get an AGI log along with debug or reproduce the issue.

If these issues occur for someone else in the future we'll open the issue again or create a new issue on the version where it occurs. Hopefully in that case we'll get further debug.

Thanks!