[Home]

Summary:ASTERISK-14848: crash when calling ao2_unlock inside pthread_timer_disable_continuous
Reporter:Keiron Liddle (keiron)Labels:
Date Opened:2009-09-18 01:03:55Date Closed:2010-07-21 11:20:20
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Resources/res_timing_pthread
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) gdb.txt
( 1) gdb2.txt
Description:When under heavy load we have experienced random crashes (6 times in a day) at the same point in the code. The load is with about 200 channels, 70 users and making a lot of calls. It was very stable until we started reaching a certain level of usage.

****** ADDITIONAL INFORMATION ******

Backtrace is as follows (same for each crash):

#0  0x007c4d28 in ao2_unlock@plt () from /usr/lib/asterisk/modules/res_timing_pthread.so
#1  0x007c5aa4 in pthread_timer_disable_continuous (handle=1187)
   at res_timing_pthread.c:240
#2  0x0809673a in __ast_read (chan=0xb7058600, dropaudio=<value optimized out>)
   at channel.c:2616
#3  0x00d36506 in wait_for_answer (in=<value optimized out>,
   outgoing=<value optimized out>, to=0x4cccd1c, peerflags=0x4cccd58, pa=0x4ccc4e8,
   num_in=0x4ccccd0, result=0x4cccd14) at app_dial.c:681
#4  0x00d3a19a in dial_exec_full (chan=<value optimized out>,
   data=<value optimized out>, peerflags=0x4cccd58, continue_exec=0x0)
   at app_dial.c:1651
ASTERISK-1  0x00d3c2fd in dial_exec (chan=0xb7728cc8, data=0x4ccee3c) at app_dial.c:2061
ASTERISK-2  0x080f5a79 in pbx_exec (c=0xb7728cc8, app=0x9a1c618, data=0x4ccee3c) at pbx.c:961
ASTERISK-3  0x08104129 in pbx_extension_helper (c=0xb7728cc8, con=<value optimized out>,
   context=<value optimized out>, exten=0xb7728f88 "0397802181", priority=2, label=0x0,
   callerid=0x0, action=E_SPAWN, found=0x4cd128c, combined_find_spawn=1) at pbx.c:3202
ASTERISK-4  0x08105a85 in ast_spawn_extension (found=<value optimized out>,
   callerid=<value optimized out>, priority=<value optimized out>,
   exten=<value optimized out>, context=<value optimized out>, c=<value optimized out>,
   c=<value optimized out>, context=<value optimized out>, exten=<value optimized out>,
   priority=<value optimized out>, callerid=<value optimized out>,
   found=<value optimized out>, combined_find_spawn=<value optimized out>) at pbx.c:3652
ASTERISK-5  __ast_pbx_run (found=<value optimized out>, callerid=<value optimized out>,
   priority=<value optimized out>, exten=<value optimized out>,
   context=<value optimized out>, c=<value optimized out>, c=<value optimized out>,
   context=<value optimized out>, exten=<value optimized out>,
   priority=<value optimized out>, callerid=<value optimized out>,
   found=<value optimized out>, combined_find_spawn=<value optimized out>) at pbx.c:3741
ASTERISK-6 0x08107270 in pbx_thread (data=0xb7728cc8) at pbx.c:4028
ASTERISK-7 0x0814708b in dummy_start (data=0xb77ea6e0) at utils.c:968
ASTERISK-8 0x0040e935 in start_thread () from /lib/libpthread.so.0
ASTERISK-9 0x0034282e in clone () from /lib/libc.so.6
Comments:By: Leif Madsen (lmadsen) 2009-09-18 07:14:41

Thank you for your bug report. In order to move your issue forward, we require a backtrace from the core file produced after the crash. Please see the doc/backtrace.txt file in your Asterisk source directory.

Also, be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then 'make install' after enabling, reproduce the crash, and then execute the instructions in doc/backtrace.txt.

When complete, attach that file to this issue report. Thanks!

By: Keiron Liddle (keiron) 2009-09-20 20:21:10

I am not able to run asterisk with DONT_OPTIMIZE as it is in heavy use.

By: Danilo Lotina F. (dlotina) 2009-09-21 09:45:39

Hi, I have the same issue running 1.6.1.6. I will try to recompile with DONT_OPTIMIZE and attach the backtrace.

By: Leif Madsen (lmadsen) 2009-09-21 10:42:46

The amount of overhead with DONT_OPTIMIZE is quite minimal, and is almost always OK to run on a heavy production system, but since another reporter stated they can reproduce and will produce the backtrace, I'll just say OK and leave this in the Feedback state. Thanks!

By: Danilo Lotina F. (dlotina) 2009-09-21 12:30:42

Hi, bt uploaded!.

By: Leif Madsen (lmadsen) 2009-09-22 09:38:04

Your backtrace still has the values optimized out. Be sure you run 'make install' after enabling DONT_OPTIMIZE as you need to install the newly re-compiled modules.

By: Keiron Liddle (keiron) 2009-09-30 20:37:49

I've attached the result with don't optimize on, it shows even less information as it is now getting a corrupt stack. It seems that it is somehow related to making a lot of calls (the initial dial part) at the same time. When I reduce the likelyhood of calls made in parallel it happens less often.

By: Keiron Liddle (keiron) 2009-09-30 20:46:37

By the way, DONT_OPTIMIZE makes a big difference in my situation.

By: Leif Madsen (lmadsen) 2009-10-26 09:58:33

Hmmm, I'm Acknowledging this issue as I'm not sure what else to ask for.

Can you try with the latest from SVN? I know there was a period of time there were I think something was broken in the build system that caused DONT_OPTIMIZE to not produce valid output (I saw it in a few issues).

Thanks!

By: Leif Madsen (lmadsen) 2009-11-17 07:29:17.000-0600

It has been requested that you test with the latest sets of release candidates at the issue may already be resolved. Please test with the latest release candidate, and report back if the issue is still happening. Thanks!

By: Shivam (shivam agarwal) 2009-11-18 03:57:09.000-0600

Hi All,
I am using releaase "asterisk-1.6.1.7-rc2" and when I run heavy load somewhere around 200 calls on this release, I also witness this issue. Kindly find the backtrace for the same as below:

received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1248740464 (LWP 8951)]
0x00d88cbc in ao2_unlock@plt () from /usr/lib/asterisk/modules/res_timing_pthread.so
(gdb) bt
#0  0x00d88cbc in ao2_unlock@plt () from /usr/lib/asterisk/modules/res_timing_pthread.so
#1  0x00d899b6 in pthread_timer_disable_continuous (handle=1123) at res_timing_pthread.c:240
#2  0x08097b7f in __ast_read (chan=0x9454ef0, dropaudio=0) at channel.c:2898
#3  0x0809bb52 in __ast_request_and_dial (type=0xb4cf6298 "SIP", format=64, data=0xb4cf6398, timeout=59995, outstate=0xb591b358,
   cid_num=0xb4cf673c "9838079257", cid_name=0xb4cf683c "", oh=0xb591b2b4) at channel.c:3279
#4  0x08108219 in ast_pbx_outgoing_exten (type=0xb4cf6298 "SIP", format=64, data=0xb4cf6398, timeout=60000, context=0xb4cf66e8 "voice",
   exten=0xb4cf6698 "s", priority=1, reason=0xb591b358, synchronous=2, cid_num=0xb4cf673c "9838079257", cid_name=0xb4cf683c "", vars=0xb4cf6960,
   account=0xb4cf693c "", channel=0x0) at pbx.c:7429
ASTERISK-1  0x007ec6a3 in attempt_thread (data=0xb4cf6180) at pbx_spool.c:341
ASTERISK-2  0x0814858b in dummy_start (data=0xb7302858) at utils.c:986
ASTERISK-3  0x00c062db in start_thread () from /lib/libpthread.so.0
ASTERISK-4  0x001dd14e in clone () from /lib/libc.so.6

However in issue having id "16156" it is mentioned that @plt" string in gdb indicates that the PLT in the linker is at fault. Quite simply, the linker cannot find the ao2_unlock function. So could you throw more light on this.

By: Danilo Lotina F. (dlotina) 2009-11-18 08:13:08.000-0600

same issue  with asterisk-1.6.1.7-rc2 and asterisk-1.6.1.9. I'll try to test 1.6.1.10-rc3

By: Leif Madsen (lmadsen) 2009-11-18 08:15:04.000-0600

Yes, please test with something newer in order to determine if this issue is still present in the latest release candidates. Thanks!

By: Shivam (shivam agarwal) 2009-11-18 23:50:23.000-0600

Hi lmadsen,
  Were you able to reproduce this issue.

By: Leif Madsen (lmadsen) 2009-11-19 10:00:44.000-0600

I'm not the one attempting to reproduce the issue;  it should be the ones who reported the issue and who are having the problem. I do not have a system with the amount of calls required to test this myself.

Anyone who is getting this issue, please test with the latest releases that were put out today (1.4.27, 1.6.0.18, and 1.6.1.10).

By: Danilo Lotina F. (dlotina) 2009-11-24 16:16:00.000-0600

Hi,
 Release 1.6.1.10 is still affected by this issue

By: paul-tg (paul-tg) 2009-11-30 23:37:35.000-0600

Asterisk 1.6.2.0-rc6 is affected.

We have just hit this during a load spike in production; at a load of around 4 calls per second, Asterisk on one of our servers crashed (and was resurrected by safe_asterisk) twice in four minutes.

The backtraces show that a02_unlock() was passed a bad pointer -- 0x04 and 0x80 for the two crashes:

#0  _ao2_unlock (user_data=0x80, file=0x2aaabb0336e0 "res_timing_pthread.c",
   func=0x2aaabb033d20 "pthread_timer_disable_continuous", line=240,
   var=0x2aaabb0336f5 "timer") at astobj2.c:115
#1  0x00002aaabb0309ce in pthread_timer_disable_continuous (
   handle=<value optimized out>) at res_timing_pthread.c:240
#2  0x0000000000470edd in __ast_read (chan=0xbd0eab0, dropaudio=0)
   at channel.c:2693
#3  0x00002aaab7fded23 in wait_for_answer (in=0x2aaac47073e0,
   outgoing=0xbd07620, to=0x44afb35c, peerflags=0x44afb3a0, pa=0x44afaa20,
   num_in=<value optimized out>, result=0x44afb354) at app_dial.c:893
#4  0x00002aaab7fe191b in dial_exec_full (chan=0x2aaac47073e0,
   data=<value optimized out>, peerflags=0x44afb3a0, continue_exec=0x0)
   at app_dial.c:1852

It would appear that ao2_find() in find_timer() is returning a dodgy pointer, but I have not delved further into the core files to find out why.

By: paul-tg (paul-tg) 2009-12-04 00:04:15.000-0600

I've analysed the state of the pthread_timers structure in one of the core files, and found no inconsistency.

Specifically, using the backtrace above I derived that the thread handle passed in to pthread_timer_disable_continuous() was 1218, so the backtrace *should have* looked like:

#0 _ao2_unlock (user_data=0x80, file=0x2aaabb0336e0 "res_timing_pthread.c",
   func=0x2aaabb033d20 "pthread_timer_disable_continuous", line=240,
   var=0x2aaabb0336f5 "timer") at astobj2.c:115
#1 0x00002aaabb0309ce in pthread_timer_disable_continuous (handle=1218)
   at res_timing_pthread.c:240
#2 0x0000000000470edd in __ast_read (chan=0xbd0eab0, dropaudio=0)
   at channel.c:2693

and find_timer() ought to have matched the following entry:

(gdb) print{struct pthread_timer}0x13013208
$105 = {pipe = {1218, 1219}, state = TIMER_STATE_IDLE, rate = 0, interval = 0,
 tick_count = 0, pending_ticks = 0, start = {tv_sec = 0, tv_usec = 0},
 continuous = 0}

This however looks valid, so I still don't know why the comparison returns low values with a single bit set (so far I've seen 0x01, 0x04, and 0x80) instead of a valid memory address.

By: Lorenz Barth (bartpbx) 2010-01-04 09:36:28.000-0600

Hi,

I can confirm this issue on  1.6.2.0. How can I helf fixing this?

By: Tilghman Lesher (tilghman) 2010-01-04 16:09:32.000-0600

As it turns out, the "@plt" is a dead giveaway as to the source of this problem.  This is a problem in the compiler/linker tools for your platform.  Please report this issue upstream to the package maintainers of your distribution environment.  This is NOT a bug in Asterisk, but a compiler (or linker) bug.

By: Tilghman Lesher (tilghman) 2010-01-04 16:17:11.000-0600

Oh, if you're wondering what the error means, it's that the dynamic linker failed to find the symbol called "ao2_unlock" in the PLT (procedure lookup table).  Given that the symbol is clearly there, there is a problem in the dynamic linker.  We cannot solve this problem within Asterisk, as the problem is within the dynamic linking tools.

You might try statically linking Asterisk (see Compiler Flags / Static Build) as a temporary workaround, though since it uses some of the same tools, that isn't guaranteed to work.

By: Digium Subversion (svnbot) 2010-07-21 11:15:00

Repository: asterisk
Revision: 278465

U   trunk/res/res_timing_pthread.c

------------------------------------------------------------------------
r278465 | russell | 2010-07-21 11:14:59 -0500 (Wed, 21 Jul 2010) | 41 lines

Use poll() instead of select() in res_timing_pthread to avoid stack corruption.

This code did not properly check FD_SETSIZE to ensure that it did not try to
select() on fds that were too large.  Switching to poll() removes the limitation
on the maximum fd value.

(closes issue ASTERISK-14848)
Reported by: keiron

(closes issue ASTERISK-15960)
Reported by: Eddie Edwards

(closes issue ASTERISK-15349)
Reported by: Hubguru

(closes issue ASTERISK-14670)
Reported by: flop

(closes issue ASTERISK-12249)
Reported by: falves11

(closes issue ASTERISK-13973)
Reported by: vrban

(closes issue ASTERISK-15971)
Reported by: aleksey2000

(closes issue ASTERISK-14385)
Reported by: kowalma

(closes issue ASTERISK-16185)
Reported by: dcabot

(closes issue ASTERISK-16085)
Reported by: glwgoes

(closes issue ASTERISK-15895)
Reported by: erikje

possibly other issues, too ...

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=278465

By: Digium Subversion (svnbot) 2010-07-21 11:20:18

Repository: asterisk
Revision: 278479

_U  branches/1.6.2/
U   branches/1.6.2/res/res_timing_pthread.c

------------------------------------------------------------------------
r278479 | russell | 2010-07-21 11:20:17 -0500 (Wed, 21 Jul 2010) | 48 lines

Merged revisions 278465 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r278465 | russell | 2010-07-21 11:15:00 -0500 (Wed, 21 Jul 2010) | 41 lines
 
 Use poll() instead of select() in res_timing_pthread to avoid stack corruption.
 
 This code did not properly check FD_SETSIZE to ensure that it did not try to
 select() on fds that were too large.  Switching to poll() removes the limitation
 on the maximum fd value.
 
 (closes issue ASTERISK-14848)
 Reported by: keiron
 
 (closes issue ASTERISK-15960)
 Reported by: Eddie Edwards
 
 (closes issue ASTERISK-15349)
 Reported by: Hubguru
 
 (closes issue ASTERISK-14670)
 Reported by: flop
 
 (closes issue ASTERISK-12249)
 Reported by: falves11
 
 (closes issue ASTERISK-13973)
 Reported by: vrban
 
 (closes issue ASTERISK-15971)
 Reported by: aleksey2000
 
 (closes issue ASTERISK-14385)
 Reported by: kowalma
 
 (closes issue ASTERISK-16185)
 Reported by: dcabot
 
 (closes issue ASTERISK-16085)
 Reported by: glwgoes
 
 (closes issue ASTERISK-15895)
 Reported by: erikje
 
 possibly other issues, too ...
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=278479