[Home]

Summary:ASTERISK-25025: Periodic crashes (in ast_channel_snapshot_create at stasis_channels.c) with Certified Asterisk 13.
Reporter:Chet Stevens (cwstevens)Labels:
Date Opened:2015-04-28 12:28:34Date Closed:2015-05-12 14:41:04
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/General
Versions:13.1.0 Frequency of
Occurrence
Occasional
Related
Issues:
is a clone ofASTERISK-25080 res_pjsip_refer: Refer code invoked with unexpected NULL channel on session
Environment:Asterisk 13.1-cert2 Ubuntu Server 14.04.1 LTS (GNU/Linux 3.13.0-35-generic x86_64) HP ProLiant DL380p Gen8 with 16 GB memory Wildcard AEX2400: wctdm24xxp+ Wildcard TE420 (5th Gen): wct4xxp+Attachments:( 0) 20150506_backtrace.txt
( 1) 20150506_debug.txt
( 2) 2-backtrace.txt
( 3) 2-core_dump_debug.zip
( 4) 2-refs.zip
( 5) atlantic_street_backtrace_2.txt
( 6) atlantic_street_backtrace.txt
( 7) atlantic_street_verbose_2.txt
( 8) backtrace.txt
( 9) facil_and_audit_backtrace.txt
(10) facil_and_audit_debug.zip
(11) gdb_bt_full.txt
(12) gdb_bt.txt
(13) gdb_thread_apply_all_bt.txt
(14) user_support_backtrace.txt
(15) user_support_debug.zip
Description:We have been experiencing periodic crashes with Certified Asterisk 13 that had not been seen previously before upgrading from Certified Asterisk 11. A backtrace will be attached to this issue. Any additional information can be provided. Thank you.

Comments:By: Chet Stevens (cwstevens) 2015-04-28 12:30:20.424-0500

Backtrace files.

By: Rusty Newton (rnewton) 2015-04-29 08:13:54.394-0500

Chet,

It would be helpful if you could also provide the following:

* [Asterisk debug log|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information] showing the state of the system for the last few minutes running up to the crash.
* [REF_DEBUG log|https://wiki.asterisk.org/wiki/display/AST/Reference+Count+Debugging] since we see the bad magic number warnings

Thanks!

By: Chet Stevens (cwstevens) 2015-04-29 15:38:54.200-0500

Thank you, Rusty. I have turned up the debugging and will compile with REF_DEBUG tonight. It may take a day or two for the next backtrace.

By: Chet Stevens (cwstevens) 2015-04-30 11:28:43.309-0500

This is a new backtrace, debug, and ref files from another of our systems that experienced an Asterisk crash this morning. I hope this is useful. The system hardware is identical with the exception of the Wilcard boards and OS and Asterisk (Asterisk 13.1-cert2) are identical. This one, however, produced a segfault:

Apr 30 08:06:41 0106-Area-Service-Center-2 kernel: [5867168.133543] asterisk[40450]: segfault at 0 ip 00007fa3d9efb67a sp 00007fa350aa7b58 error 4 in libc-2.19.so[7fa3d9e72000+1bb000]

I am including the refs and debug for the hour leading up to the segfault. I don't know if backtrace.txt is useful or not. gdb is aborting with the core file which can be seen in the backtrace file.

We have recently upgraded all of our systems (~100) from Asterisk 11.6-certX to Asterisk 13.1-cert2 and see, among all the systems, Asterisk stop about 1-3 times a day. I have recompiled Asterisk for backtraces on 93 of the systems so we should be to add additional backtraces as needed.

Thank you.

By: Richard Mudgett (rmudgett) 2015-04-30 12:27:20.688-0500

[^2-backtrace.txt] -  The file is useless as there is nothing in it.  gdb did not abort on the core file.  It said that asterisk was terminated because of a seg fault.  You then told gdb to quit without telling gdb to generate the backtrace.

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Chet Stevens (cwstevens) 2015-04-30 12:38:41.308-0500

Thank you, Richard. I am following the directions per the page you referenced. gdb stops almost immediately with:

telecom@0106-Area-Service-Center-2:~/coredump$ gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt" --batch -c core > backtrace.txt    
62      ../sysdeps/x86_64/multiarch/../strcspn.S: No such file or directory.
Aborted (core dumped)
telecom@0106-Area-Service-Center-2:~/coredump$

We should be able to provide a new backtrace soon though.

By: Mark Michelson (mmichelson) 2015-04-30 14:59:54.642-0500

Thanks for all the excellent debugging information. I believe I have zeroed in on the problem here. The crash likely happened on a transfer operation. Asterisk was attempting to log information about the dial attempt between two of the parties, but one of the parties had already been hung up. The problem appears to be that the bookkeeping code is not making proper use of channel reference counts. The fix should be pretty easy. Once I have a fix prepared, I'll put it up for review and link the reviews on this issue.

By: Mark Michelson (mmichelson) 2015-04-30 15:44:45.964-0500

Reviews are up at the following locations:

https://gerrit.asterisk.org/#/c/321/
https://gerrit.asterisk.org/#/c/322/
https://gerrit.asterisk.org/#/c/323/


By: Mark Michelson (mmichelson) 2015-05-01 08:39:53.032-0500

The fix has been merged into Asterisk now, so I am closing this issue.

By: Chet Stevens (cwstevens) 2015-05-06 15:08:02.829-0500

We compiled and installed the latest snapshot that included the "Prevent potential crash on blond transfer." and "stasis: Fix dial masquerade datastore lifetimestasis: Fix dial masquerade datastore lifetime" changes to stasis_channels.c but experienced another segfault of the Asterisk process. System ran for about 20 hours after installation of the snapshot. I will be attaching the backtrace and debug logs for the hour leading up to the crash. I have refs as well but the text file is 2.5 gigs. Would a section of it be useful? For reference, the segfault occurred at:

May  6 10:36:46 0190-Telecommunication-Services kernel: [3554274.649448] asterisk[4346]: segfault at 0 ip 00007f42152a3aea sp 00007f412ab92b98 error 4 in libc-2.19.so[7f421521b000+1bb000]

By: Chet Stevens (cwstevens) 2015-05-06 15:09:44.915-0500

Here is the backtrace and debug of the last hour prior to the segfault. Refs.txt is available but is 2.5 gigs.

By: Chet Stevens (cwstevens) 2015-05-06 15:50:28.839-0500

We experienced an Asterisk segfault at another facility this morning as well. This is another location that was recently upgraded from Asterisk 11.6-cert9 to Asterisk 13.1-cert2 and is now seeing segfaults and "bad magic number". This system is not using the latest snapshot. I am attaching a backtrace and debug log of the hour leading up to the segfault. For reference it occurred at:

May  6 10:15:23 0099-User-Support kernel: [17660295.117673] asterisk[27318]: segfault at 954 ip 00000000004ce7b4 sp 00007fcadee0ebc0 error 6 in asterisk[400000+2c1000]

By: Chet Stevens (cwstevens) 2015-05-06 15:59:12.933-0500

The backtrace and debug of one hour prior to the segfault are attached. For reference the time of the segfault is:

May  6 10:15:23 0099-User-Support kernel: [17660295.117673] asterisk[27318]: segfault at 954 ip 00000000004ce7b4 sp 00007fcadee0ebc0 error 6 in asterisk[400000+2c1000]

By: Richard Mudgett (rmudgett) 2015-05-06 16:35:45.241-0500

[^user_support_backtrace.txt] - This appears to be an instance of ASTERISK-21893
[^20150506_backtrace.txt] - This appears to be an instance of ASTERISK-24869 / ASTERISK-24884

By: Chet Stevens (cwstevens) 2015-05-06 16:44:05.246-0500

Thank you Richard. I don't know if it is important but I notice in ASTERISK-24884 that the issue is associated with Chan_SIP. The system that produced [^https://issues.asterisk.org/jira/secure/attachment/52362/52362_20150506_backtrace.txt is running PJSIP. https://issues.asterisk.org/jira/secure/attachment/52364/52364_user_support_backtrace.txt is from a system running Chan_SIP. Please let me know if I can provide any more information.

By: Richard Mudgett (rmudgett) 2015-05-06 16:56:58.660-0500

The problem shown by [^20150506_backtrace.txt] is that the channel application pointer is NULL.  This is not related to the channel technology but a race to set the current application on the channel. (ASTERISK-24869 and ASTERISK-24884 are actually duplicates)

The problem shown by [^user_support_backtrace.txt] is crashing in chan_dahdi/sig_pri in the exact same place as the backtrace in ASTERISK-21893.

By: Chet Stevens (cwstevens) 2015-05-06 17:16:28.273-0500

Another system that was recently upgraded from Asterisk 11.6-cert9 to Asterisk 13.1-cert2 has experienced a general protection fault this morning. Attaching a backtrace and debug for hour prior to crash. For reference:

May  6 07:09:21 0651-Facilities-Audit-Campus kernel: [18626096.632678] traps: asterisk[58551] general protection ip:7f410fccd56a sp:7f40857ee548 error:0 in libc-2.19.so[7f410fc44000+1bb000]

By: Chet Stevens (cwstevens) 2015-05-06 17:19:42.790-0500

Attached debug and ~hour prior to general protection crash.

May  6 07:09:21 0651-Facilities-Audit-Campus kernel: [18626096.632678] traps: asterisk[58551] general protection ip:7f410fccd56a sp:7f40857ee548 error:0 in libc-2.19.so[7f410fc44000+1bb000]

By: Richard Mudgett (rmudgett) 2015-05-06 17:27:04.456-0500

[^facil_and_audit_backtrace.txt] - is another instance of this issue (ASTERISK-25025)

By: Chet Stevens (cwstevens) 2015-05-06 17:28:18.247-0500

Replaced facil_and_audit_debug.zip. Original one was for wrong date.

By: Chet Stevens (cwstevens) 2015-05-06 17:33:17.243-0500

We have another system that was recently installed as Asterisk 13.1-cert2 with an Asterisk segfault about 25 minutes ago. The backtrace will be attached. For reference:

{noformat}
May  6 15:05:07 0652-Atlantic-Street-Complex kernel: [2341043.793097] asterisk[20359]: segfault at 35 ip 00007f5bfc62467a sp 00007f5a538c65b8 error 4 in libc-2.19.so[7f5bfc59b000+1bb000]
{noformat}

By: Richard Mudgett (rmudgett) 2015-05-06 17:44:26.889-0500

[^atlantic_street_backtrace.txt] - gdb is complaining of a corrupt stack.  Haven't seen this before.

By: Chet Stevens (cwstevens) 2015-05-12 14:04:35.221-0500

We experienced another Asterisk segfault on a system running the latest snapshot of certified/13.1 (which included "bridge.c: NULL app causes crash during attended transfer"). The backtrace will be attached. I apologize in advance as the debug level was not set so the log is only verbose which will also be attached. For reference:

{noformat}
May 12 11:36:01 0652-Atlantic-Street-Complex kernel: [174566.656461] asterisk[35994]: segfault at 9a0 ip 00000000004cbf79 sp 00007fc5bc826b30 error 4 in asterisk[400000+2ab000]
{noformat}

By: Malcolm Davenport (mdavenport) 2015-05-12 14:41:04.852-0500

Original issue is closed.  Latest backtraces are a different issue.  Cloning into ASTERISK-25080.