[Home]

Summary:ASTERISK-10897: Asterisk 1.4.14 and load average.
Reporter:Fernando Lujan (flujan)Labels:
Date Opened:2007-11-27 08:11:09.000-0600Date Closed:2008-01-29 10:57:18.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20071130__bug11386__2.diff.txt
( 1) 20071130__bug11386.diff.txt
( 2) 20071221__bug11386__otherfix.diff.txt
( 3) 20080116__bug11386__2.diff.txt
( 4) 20080117__bug11386.diff.txt
( 5) 20080124__bug11386__1.4.17.diff.txt
( 6) 20080124__bug11386.diff.txt
( 7) backtrace-1220.txt
( 8) core_show_locks1.txt
( 9) core_show_locks2.txt
(10) core_show_locks3.txt
(11) malloc_debug.txt
(12) malloc_debug-1220.txt
(13) null_ptr_deref-sthen-20080126.log
(14) null_ptr_deref-sthen-20080126.txt
(15) valgrind_patch.txt
(16) valgrind.txt
(17) valgrind-1220.txt
Description:Asterisk starts do use 500% of CPU and incresease the load average of the machine to 61. Generating also a core dump.


Bt output and valgrind.txt.



****** ADDITIONAL INFORMATION ******

(gdb) bt full
#0  0xb7faf2a0 in pthread_mutex_lock () from /lib/libpthread.so.0
No symbol table info available.
#1  0xb6c0b695 in ast_mutex_lock (pmutex=0xdeadbeef) at /root/src/asterisk/asterisk-1.4.14/include/asterisk/lock.h:700
No locals.
#2  0xb6c0ac1c in retrans_pkt (data=0xb5955a50) at chan_sip.c:1892
pkt = (struct sip_pkt *) 0xb5955a50
prev = (struct sip_pkt *) 0x0
cur = (struct sip_pkt *) 0x0
reschedule = 1000
xmitres = 0
__PRETTY_FUNCTION__ = "retrans_pkt"
#3  0x080eea2f in ast_sched_runq (con=0x82ec4a8) at sched.c:359
current = (struct sched *) 0x84c7eb8
tv = {tv_sec = 1196166383, tv_usec = 439930}
numevents = 75
res = 0
#4  0xb6c498e0 in do_monitor (data=0x0) at chan_sip.c:15471
res = 1
sip = (struct sip_pvt *) 0x0
peer = (struct sip_peer *) 0x0
t = 1196166362
fastrestart = 0
lastpeernum = -1
curpeernum = 407
reloading = 0
__PRETTY_FUNCTION__ = "do_monitor"
ASTERISK-1  0x080fd9c0 in dummy_start (data=0x8302220) at utils.c:843
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1208238092, 0, 4001536, -1228917832, 1237988176, -150611568}, __mask_was_saved = 0}}, __pad = {0xb6c03480, 0x0, 0x0, 0x0}}
__cancel_routine = (void (*)(void *)) 0x8069636 <ast_unregister_thread>
__cancel_arg = (void *) 0xb6c03b90
not_first_call = 0
ret = (void *) 0x0
a = {start_routine = 0xb6c492fb <do_monitor>, data = 0x0, name = 0x83022a0 "do_monitor", ' ' <repeats 11 times>, "started at [15525] chan_sip.c restart_monitor()"}
__PRETTY_FUNCTION__ = "dummy_start"
ASTERISK-2  0xb7fad1b3 in start_thread () from /lib/libpthread.so.0
No symbol table info available.
ASTERISK-3  0xb7ef703e in clone () from /lib/libc.so.6
No symbol table info available.
(gdb)


Comments:By: Tilghman Lesher (tilghman) 2007-11-30 12:25:44.000-0600

Okay, try this patch.

By: Tilghman Lesher (tilghman) 2007-12-17 07:17:45.000-0600

flujan:  is this the entire valgrind.txt?  I would expect there to be more, given the contents of malloc_debug.txt.

By: Fernando Lujan (flujan) 2007-12-17 07:36:06.000-0600

Yes, I start the valgrind thread and deleted the previos one.

By: Leif Madsen (lmadsen) 2007-12-20 17:07:48.000-0600

I've uploaded the backtrace-1220, malloc_debug-1220.txt, and valgrind-1220.txt files which were generated after using the 20071130_bug11386__2.diff.txt file. Crash was caused after about 5 minutes of testing with the patch.

Let me know if there is anything else I can get you.

By: Oz Mundim (ozmundim) 2007-12-21 14:08:38.000-0600

Patches 20071130__bug11386__2.diff.txt and 20071221__bug11386__otherfix.diff.txt applied on 2 boxes running 1.4.14. Extensions are connected box1, which places SIP/G711 calls to box2 (box2 has Sangoma A108 on it).

High load still happening on both boxes. Channels are not being disconnected after Hangup (shown state: 'Ringing'). Silent gaps during the whole call duration.

Boxes running 1.4.14 (no patches) reports high load still.

FYI: Switching outgoing calls to terminate on box running 1.2.7 high load and doesn't happen.

By: Eldad Ran (eldadran) 2007-12-24 04:18:18.000-0600

I don't know if it is the same problem, but i have a CPU usage problem that builds up over time, were user cpu builds up. only when I take asterisk down, the CPU user goes back to normal. it usually takes a day to see the build up, after 3 days it reaches 100%.
The problem appeared ever since I've switched from 1.4.7.1 to an higher version, the current is 1.4.15.
I use these modules extensively (this is a low traffic snap shot):
Module                         Description                           UseCount
res_agi.so                     Asterisk Gateway Interface (AGI)         43        
chan_zap.so                    Zapata Telephony                         14        
format_wav.so                  Microsoft WAV format (8000Hz Signed Line 1        
chan_sip.so                    Session Initiation Protocol (SIP)        43        
codec_alaw.so                  A-law Coder/Decoder                      46        
format_pcm.so                  Raw/Sun uLaw/ALaw 8KHz (PCM,PCMA,AU), G. 24        
app_meetme.so                  MeetMe conference bridge                 25

By: Oz Mundim (ozmundim) 2008-01-03 12:46:05.000-0600

I forwarded some calls this morning. Testing box was running 1.4.6.2 with patches applied.

When it reached 90 calls, load average jumped to 15+ and calls started to drop and huge silent gaps and voice distortion were reported.

Looks like nothing was really changed from the old 1.4.14 without the patches applied (since I didn't notice any performance improvement).

By: Tilghman Lesher (tilghman) 2008-01-03 14:11:43.000-0600

ozmundim:  could you compile with DEBUG_THREADS and get us a "core show locks" when the problem occurs?  I want to be sure that this isn't a lock contention issue.  A valgrind output might also help, but I don't think that this is a memory corruption issue anymore (at least, with the patches applied).

By: Oz Mundim (ozmundim) 2008-01-03 15:27:32.000-0600

All previews modules were deleted from /usr/lib/asterisk/modules prior compilation.

Compilation done with DEBUG_THREADS and LOADABLE_MODULES only (all other options were unchecked).

Asterisk 1.4.6.2 with patches applied. Box has a Sangoma A108 with no Hardware Echo Cancelation.

After reaching 74 concurrent calls, load jumped to 6+ and calls started to drop. Attached you'll see 'core show locks' during process.

I also issued 'show channels' so you guys can see that after load comes down, no more calls can be placed through that box. I had to 'stop now' and start Asterisk.

By: Russell Bryant (russell) 2008-01-03 15:32:47.000-0600

The "thread apply all bt" output from gdb of Asterisk when this happens would also be useful

By: Oz Mundim (ozmundim) 2008-01-04 09:23:39.000-0600

Adding to last note:

After patch 20071221__bug11386__otherfix.diff.txt applied, no crashes are being experienced during high load (no core dump).

By: Digium Subversion (svnbot) 2008-01-08 11:58:26.000-0600

Repository: asterisk
Revision: 97077

U   branches/1.4/channels/chan_sip.c
U   branches/1.4/main/asterisk.c

------------------------------------------------------------------------
r97077 | tilghman | 2008-01-08 11:58:25 -0600 (Tue, 08 Jan 2008) | 3 lines

Apply multiple crash fixes, found in issue ASTERISK-10897, but not completely
closing that issue.

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=97077

By: Digium Subversion (svnbot) 2008-01-08 13:03:02.000-0600

Repository: asterisk
Revision: 97125

_U  trunk/
U   trunk/channels/chan_sip.c
U   trunk/main/asterisk.c

------------------------------------------------------------------------
r97125 | tilghman | 2008-01-08 13:03:02 -0600 (Tue, 08 Jan 2008) | 11 lines

Merged revisions 97077 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r97077 | tilghman | 2008-01-08 12:02:13 -0600 (Tue, 08 Jan 2008) | 3 lines

Apply multiple crash fixes, found in issue ASTERISK-10897, but not completely
closing that issue.

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=97125

By: Dmitry Andrianov (dimas) 2008-01-08 17:39:29.000-0600

tilghman,
it is a bit offtopic but I just curious why your patch ast_strdupa every option except 's' ?

By: Tilghman Lesher (tilghman) 2008-01-08 18:52:39.000-0600

dimas:  because that option doesn't exist in 1.4 and I did a straight merge to trunk.

By: Andrew Lindh (andrew) 2008-01-10 14:25:27.000-0600

This 1.4 branch update (97077) to chan_sip.c causes hold problems with polycom and grandstream phones. I was going to open a new bug (as it's a new problem), but I'll stick it here for now. If I back off chan_sip then it works again.

It does not seem to affect cisco or sipura phones I have for testing.

When I try to place a call on hold, the grandstream just disconnects it's side of the call but asterisk does not disconnect the other side. The polycom puts the call on hold on the phone, but asterisk does not see it as on hold. The polycom can't resume the call so it's stuck. If it's disconnected then the phone still sees it on hold.

By: jmls (jmls) 2008-01-10 16:41:42.000-0600

we have the same hold issues, but use cisco 7940 and aastra9133i phones.

By: Tilghman Lesher (tilghman) 2008-01-11 10:12:43.000-0600

The more invasive of these two patches, the one to chan_sip.c, has been reverted.

By: Tilghman Lesher (tilghman) 2008-01-11 16:36:35.000-0600

For anybody having trouble with memory corruption, I would invite you to try out this latest patch, which takes a different tack, while should prevent all corruption caused by the scheduler.



By: Tilghman Lesher (tilghman) 2008-01-16 12:37:00.000-0600

Let's have another go-around on this patch.

By: Fernando Lujan (flujan) 2008-01-23 08:23:30.000-0600

Patched 1.4.17 with the last patch. I see no load average, but the zap channels stop working... I could not place or receive call on my E1 lines. :(

Alas, there is a problema patching 1.4.17 in the chan_sip.c. It is not working propertly and I patched it by hand.

By: Tilghman Lesher (tilghman) 2008-01-23 14:26:59.000-0600

When you have the patch running and it does that, please attach to the process and run a 'thread apply all bt'.  Make sure that you've compiled with DONT_OPTIMIZE so that we can see all of the stack frames clearly.  Then upload the output here.

Attaching to the process is as simple as:
gdb -p <pid-of-asterisk>

By: Tilghman Lesher (tilghman) 2008-01-24 13:47:23.000-0600

Updated patch to current 1.4 SVN

By: Tilghman Lesher (tilghman) 2008-01-24 14:09:54.000-0600

New patch generated only against 1.4.17.

By: Stuart Henderson (stuarth) 2008-01-26 02:14:58.000-0600

Null pointer dereference in ast_bridged_channel() with 20080124__bug11386.diff.txt (new patch against SVN) - see null_ptr_deref-sthen-20080126.txt for backtrace.



By: Digium Subversion (svnbot) 2008-01-27 15:57:14.000-0600

Repository: asterisk
Revision: 100465

U   branches/1.4/channels/chan_h323.c
U   branches/1.4/channels/chan_iax2.c
U   branches/1.4/channels/chan_mgcp.c
U   branches/1.4/channels/chan_misdn.c
U   branches/1.4/channels/chan_sip.c
U   branches/1.4/include/asterisk/sched.h
U   branches/1.4/main/cdr.c
U   branches/1.4/main/dnsmgr.c
U   branches/1.4/main/file.c
U   branches/1.4/main/rtp.c
U   branches/1.4/pbx/pbx_dundi.c

------------------------------------------------------------------------
r100465 | tilghman | 2008-01-27 15:57:10 -0600 (Sun, 27 Jan 2008) | 11 lines

When deleting a task from the scheduler, ignoring the return value could
possibly cause memory to be accessed after it is freed, which causes all
sorts of random memory corruption.  Instead, if a deletion fails, wait a
bit and try again (noting that another thread could change our taskid
value).
(closes issue ASTERISK-10897)
Reported by: flujan
Patches:
      20080124__bug11386.diff.txt uploaded by Corydon76 (license 14)
Tested by: Corydon76, flujan, stuarth`

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=100465

By: Digium Subversion (svnbot) 2008-01-27 16:33:18.000-0600

Repository: asterisk
Revision: 100488

_U  trunk/
U   trunk/channels/chan_h323.c
U   trunk/channels/chan_iax2.c
U   trunk/channels/chan_mgcp.c
U   trunk/channels/chan_misdn.c
U   trunk/channels/chan_sip.c
U   trunk/include/asterisk/sched.h
U   trunk/main/cdr.c
U   trunk/main/dnsmgr.c
U   trunk/main/file.c
U   trunk/main/rtp.c
U   trunk/pbx/pbx_dundi.c

------------------------------------------------------------------------
r100488 | tilghman | 2008-01-27 16:33:18 -0600 (Sun, 27 Jan 2008) | 19 lines

Merged revisions 100465 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r100465 | tilghman | 2008-01-27 15:59:53 -0600 (Sun, 27 Jan 2008) | 11 lines

When deleting a task from the scheduler, ignoring the return value could
possibly cause memory to be accessed after it is freed, which causes all
sorts of random memory corruption.  Instead, if a deletion fails, wait a
bit and try again (noting that another thread could change our taskid
value).
(closes issue ASTERISK-10897)
Reported by: flujan
Patches:
      20080124__bug11386.diff.txt uploaded by Corydon76 (license 14)
Tested by: Corydon76, flujan, stuarth`

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=100488