[Home]

Summary:ASTERISK-19245: Fresh install of 1.8.9.0-rc3 hangs during module load and pegs one processor at 100%
Reporter:Craig Arno (2craigarno)Labels:
Date Opened:2012-01-25 13:49:50.000-0600Date Closed:2012-01-30 23:52:37.000-0600
Priority:CriticalRegression?
Status:Closed/CompleteComponents:Functions/func_module
Versions:1.8.9.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:OpenSUSE 12.1 (kernel 3.1.0) x86_64, new install with C/C++ development tools ASUS Sabertooth 990FX with AMD FX-8120 8-core processor, 8GB RAMAttachments:( 0) backtrace.txt
( 1) modules.conf
( 2) threadtrace.txt
Description:*REPEATABLE*:

{{tar xzf asterisk-1.8.9.0-rc3.tar.gz}}
{{rm /etc/asterisk/*}}
{{rm /usr/lib64/asterisk/modules/*}}
{{./configure}}
{{./configure --disable-xmldoc}}    {color:green};_./configure says this option must be used.  YaST says lib is installed_{color}
{{make}}
{{make install}}
{{make samples}}
{{asterisk -vvvc}}
...
 {{== Registered application 'CallCompletionRequest'}}
 {{== Registered application 'CallCompletionCancel'}}
 {{== Parsing '/etc/asterisk/ccss.conf':   == Found}}
{{Asterisk Dynamic Loader Starting:}}
 {{== Parsing '/etc/asterisk/modules.conf':   == Found}}
[Jan 25 10:57:00] NOTICE[12325]: loader.c:1127 load_modules: 180 modules will be loaded.

At this point one of the processors will go to 100% utilization.  I left this running overnight and asterisk eventually crashes.  The only way to stop it before crashing is...

{{^Z}}
[1]+  Stopped                 asterisk -vvvc
{{pluto:/home/install/asterisk/asterisk-1.8.9.0-rc3 # kill -9 %1}}

So far I've invested 2 days into resolving this issue with Google, forums, reading documentation, and culminating in this bug report.

*+OFFER to developers+*:
If it will help you resolve what's going on, I'll be happy to build you an account and provide remote ssh (includes ssh -X) access for debug.  Or, less desirable, I'll give you temporary root access so we can figure out what is going on.  This is a _fast_ system running with 8-cores at 3.1GHz with the OS installed on an SSD (solid state drive).  Builds happen quickly.  Without a working phone system I can be reached at *[mailto:2craigarno@gmail.com]*

{color:blue}/var/log/asterisk/messages{color} only contains these two lines:
[Jan 25 10:57:00] NOTICE[12325] cdr.c: CDR simple logging enabled.
[Jan 25 10:57:00] NOTICE[12325] loader.c: 180 modules will be loaded.

Lower directories
{color:gray}_empty_{color}

{color:blue}/var/log/messages{color}:
{color:gray}_nothing new_{color}


Thank you,
Craig Arno
Comments:By: Richard Mudgett (rmudgett) 2012-01-25 14:11:53.546-0600

Thank you for your bug report. In order to move your issue forward, we require a backtrace[1] from the core file produced after the crash. Also, be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then:

make install

After enabling, reproduce the crash, and then execute the backtrace[1] instructions. When complete, attach that file to this issue report.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

A backtrace while the system is stuck would point to the code that seems to be in an infinite loop.

By: Craig Arno (2craigarno) 2012-01-25 15:06:18.566-0600

This is as far as I can get for the moment (Nice backtrace instructions BTW!):


{{# rm /etc/asterisk/* /usr/lib64/asterisk/modules/*}}
{{# make clean}}
{{# ./configure --disable-xmldoc}}
{{# make menuselect}}

[*] DONT_OPTIMIZE
[*] DEBUG_THREADS
[*] LOADABLE_MODULES
XXX BETTER_BACKTRACES

{{# make}}
{{# make install}}
{{# make samples}}
{{# gdb -v}}
{{GNU gdb (GDB) SUSE (7.3-41.1.2)}}
{{Copyright (C) 2011 Free Software Foundation, Inc.}}
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
{{This is free software: you are free to change and redistribute it.}}
{{There is NO WARRANTY, to the extent permitted by law.  Type "show copying"}}
{{and "show warranty" for details.}}
{{This GDB was configured as "x86_64-suse-linux".}}
{{For bug reporting instructions, please see:}}
<http://www.gnu.org/software/gdb/bugs/>.

{{# asterisk -vvvvvg -c}}
...
{{Asterisk Dynamic Loader Starting:}}
 {{== Parsing '/etc/asterisk/modules.conf':   == Found}}
[Jan 25 12:29:47] NOTICE[29333]: loader.c:1127 load_modules: 180 modules will be loaded.

{{2 # asterisk -rx "core show locks"}}

{{=======================================================================}}
{{=== Currently Held Locks ==============================================}}
{{=======================================================================}}
{{===}}
{{=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)}}
{{===}}
{{=======================================================================}}

{{2 # asterisk -rx "core show taskprocessors"}}

       {{+----- Processor -----+--- Processed ---+- In Queue -+- Max Depth -+}}
  {{core_event_dispatcher                   1            0            0}}
               {{pbx-core                   1            0            0}}
              {{CCSS core                   0            0            0}}
       {{+---------------------+-----------------+------------+-------------+}}
       {{3 taskprocessors}}

{{2 #}}

At this moment Core #6 is at 100%, then Core #8... almost as if it is waiting for all 180 modules to individually timeout in sequential order.

I haven't been able to produce a core file with any method I know to kill the task.  So I wait, possibly overnight for a failure which will produce a core file.

Back on in the cmd window where asterisk was launched, I see:

{{# asterisk -vvvvvg -c}}
   {{-- Remote UNIX connection}}
   {{-- Remote UNIX connection disconnected}}
   {{-- Remote UNIX connection}}
   {{-- Remote UNIX connection disconnected}}
   {{-- Remote UNIX connection}}
   {{-- Remote UNIX connection disconnected}}
...

Which I suppose is to be expected.
I'll attach further information if/when it becomes available.

By: Craig Arno (2craigarno) 2012-01-25 15:20:24.998-0600

Here is a thread trace from a running "hung" system.

By: Craig Arno (2craigarno) 2012-01-25 23:53:17.218-0600

This is the Backtrace output after Asterisk Crashed.  It took roughly 3 hours to crash.  I was not in front of the console when it happened.

Console output looks like this:

{{Asterisk Dynamic Loader Starting:}}
 {{== Parsing '/etc/asterisk/modules.conf':   == Found}}
[Jan 25 02:04:15] NOTICE[9229]: loader.c:1127 load_modules: 180 modules will be loaded.
{{Segmentation fault (core dumped)}}


By: Craig Arno (2craigarno) 2012-01-26 02:51:58.312-0600

*More info*:

* This bug is reproducable in a _VMware Workstation 8_ environment with _OpenSUSE 12.1 x86_64_ and _Asterisk 1.8.9.0-rc3_ configured with one FX-8120 processor.

* A build without "[ ] LOADABLE_MODULES" doesn't crash.  But, it also won't load sip.conf or extensions.conf.

By: Matt Jordan (mjordan) 2012-01-26 10:00:51.507-0600

Craig:

1. Does the crash occur every time you start Asterisk, or only some of the time?  When it does occur, does it only happen when loading/unloading Asterisk modules?
2. If Asterisk does successfully load all modules and you get the "Asterisk Ready." statement, does the CPU usage exhibit expected behavior?
3. Does this behavior occur when you are running Asterisk on a non-virtual machine?

It makes sense that this wouldn't occur without loadable modules, as the crash is occurring in dlclose.  Could you try running with a debug version of libdl/libc?  That might give us some additional information as to why its crashing in this call.

Thanks

Matt

By: Craig Arno (2craigarno) 2012-01-26 11:43:12.491-0600

Matt,

1. The "hang" occurs every time I start Asterisk.  Since the "crash" takes 3+ hours to reproduce, I haven't repeated this more than 3 times, however the crash does occur every time and always when loading modules.  Asterisk always crashes before it starts.

2. Asterisk never successfully loads all modules.

3. To date, all the behaviors described are from a "real" machine; ASUS Sabertooth 990FX, AMD FX-8120 processor, 8GB RAM, 9TB HD, etc.  My comment about VMware virtual machines is intended to help with your "debug" effort.

I'm explaining that late last night I tried a fresh install of OpenSUSE 12.1 x86_64 in a virtual machine, then installed asterisk-1.8.9.0-rc3 in this virtual environment, and received the same results/behavior as I do on the "real" machine.  I did this to make sure this bug can be reproduced with a short list of actions.

To be more direct, I can package this virtual machine up and you can download it for further investigation, after we set the root password to a value you expect.  This is far easier and expedient than shipping an actual computer.  Or, you can reproduce these steps on your development system without having to uninstall or replace your existing environment and tools.  From everything I've seen so far, I expect XEN could be used to the same end with the same results on a 64-bit computer.  Or you can install a new HD in an existing 64-bit computer and do a fresh install to reproduce this bug.

Thanks,
Craig

By: Craig Arno (2craigarno) 2012-01-27 19:30:24.785-0600

Here's something interesting (yes, I'm still banging away on this)... I thought I'd try loading individual modules in modules.conf and get this output (no crash yet):

*history*:
 618  2012-01-27 17:03:59 rm -r asterisk-1.8.9.0-rc3
 619  2012-01-27 17:06:29 tar xzf asterisk-1.8.9.0-rc3.tar.gz
 620  2012-01-27 17:07:03 cd asterisk-1.8.9.0-rc3/
 621  2012-01-27 17:07:11 history|grep configure
 622  2012-01-27 17:07:20 ./configure --disable-xmldoc
 623  2012-01-27 17:08:02 make
 624  2012-01-27 17:11:29 make install
 625  2012-01-27 17:11:40 asterisk -vvvc


*Console*:
...
Asterisk Dynamic Loader Starting:
 == Parsing '/etc/asterisk/modules.conf':   == Found
[Jan 27 17:11:40] NOTICE[8959]: loader.c:1127 load_modules: 25 modules will be loaded.
[Jan 27 17:13:04] WARNING[8959]: loader.c:396 load_dynamic_module: Error loading module 'res_features.so': /usr/lib64/asterisk/modules/res_features.so: cannot open shared object file: No such file or directory
[Jan 27 17:21:37] WARNING[8959]: loader.c:396 load_dynamic_module: Error loading module 'codec_ilbc.so': /usr/lib64/asterisk/modules/codec_ilbc.so: cannot open shared object file: No such file or directory

This is an improvement as before there was no output.

By: Richard Mudgett (rmudgett) 2012-01-27 19:52:20.396-0600

res_features.so has not existed since v1.4

I think codec_ilbc.so has only recently been restored because of a licensing issue with the ilbc source.

Could you have old *.so files installed?

By: Craig Arno (2craigarno) 2012-01-28 11:07:14.209-0600

Interesting idea, however this is a brand new OS installation and brand new Asterisk installation from the latest source release of the 1.8 branch of asterisk.  I also regularly perform an rm -r /usr/lib64/asterisk/modules, so I think old modules are unlikely, unless they came with the distribution and/or exist in some other location.  OpenSUSE 12.1 doesn't contain or offer an Asterisk package.

Does asterisk look for modules in a different location?

I'm now trying to get asterisk going in a virtual environment that mimics my retired server; OpenSUSE 10.3 with asterisk-1.8.9.0-rc3.  asterisk-1.8.9.0-rc3 fires right up in the 10.3 environment, connects to Broadvoice and my Sipura boxes, and I can call out and others can call in, but this solution puts the 10.3 machine behind a NAT which causes a collection of other problems, and I'd rather not run a VM just for Asterisk.  So I haven't found a solution yet.

By: Craig Arno (2craigarno) 2012-01-28 13:15:50.503-0600

I tried bringing Asterisk up in a debugger setting a breakpoint in the loader.c:1130 code since that appears to be where the problem exists, then single stepping through code until the debugger "hangs" waiting for the next "step".  This occurs in *stdtime/localtime.c:290* (I'm not sure this is a valid way to debug in a multi-threaded environment, but thought I'd give it a try).

Then I tried letting it run in the debugger until it "hangs", then interrupt and look at the backtrace...

#4 ... in clone () from libc.so.6
#3 ... in start_thread () from libthread.so.0
#2 ... in dummy_start () at utils.c:1004
#1 ... in inotify_daemon () at *localtime.c:290*
#0 ... in read () from libc.so.6

Interesting that the backtrace appears to confirm (at #1) what I found by single stepping.

This loader code is called successfully multiple times prior to the console message:
Asterisk Dynamic Loader Starting:
 == Parsing '/etc/asterisk/modules.conf':   == Found
[Jan 27 18:10:09] NOTICE[31020]: loader.c:1127 load_modules: 25 modules will be loaded.

I am in way over my head debugging unfamiliar code in a multi-threaded environment.  Maybe you could pass this along to someone at Digium who knows what they're doing in this area of the code and ask, "Do you know what could cause this?" or more importantly, "is there a way to code around this problem?"  I see three names at the top of the loader.c file.  Could one of these answer what's happening?

* Mark Spencer <markster@digium.com>
* Kevin P. Fleming <kpfleming@digium.com>
* Luigi Rizzo <rizzo@icir.org>

This module loading "hang" behavior exists in:

  - asterisk-10.1.0
  - asterisk-1.8.9.0-rc3
  - asterisk-1.6.2.22
  - asterisk-1.4.43

Again, this is a "real" machine running OpenSUSE 12.1 x86_64 on an AMD FX-8120.  And whatever is happening is consistently repeatable (happens every time).

Thanks,
Craig
[mailto:2craigarno@gmail.com]

By: Craig Arno (2craigarno) 2012-01-30 23:44:14.797-0600

h1.{color:blue}*PROBLEM RESOLVED*{color}

*uname -a* reports this version information for a working kernel:

Linux pluto *3.1.9*-1.4-desktop #1 SMP PREEMPT *Fri Jan 27 08:55:10 UTC 2012* (efb5ff4) x86_64 x86_64 x86_64 GNU/Linux


OpenSUSE 12.1 delivered kernel appears to be broken.  A forced software update with YaST and a recompile install with Asterisk solved the problem!  BTW, this is a Linux 3.1.9 kernel, the delivered on is 3.1.0.

FYI, IPKALL IAX and IPKALL SIP methods both work, as does a remote Sipura 2000 box (SIP).

Thank you for your time and attention.  This was a baffling one.

Craig

By: Craig Arno (2craigarno) 2012-01-30 23:52:37.495-0600

Last comment says it all :D  (y)