Summary: | ASTERISK-10897: Asterisk 1.4.14 and load average. | ||
Reporter: | Fernando Lujan (flujan) | Labels: | |
Date Opened: | 2007-11-27 08:11:09.000-0600 | Date Closed: | 2008-01-29 10:57:18.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) 20071130__bug11386__2.diff.txt ( 1) 20071130__bug11386.diff.txt ( 2) 20071221__bug11386__otherfix.diff.txt ( 3) 20080116__bug11386__2.diff.txt ( 4) 20080117__bug11386.diff.txt ( 5) 20080124__bug11386__1.4.17.diff.txt ( 6) 20080124__bug11386.diff.txt ( 7) backtrace-1220.txt ( 8) core_show_locks1.txt ( 9) core_show_locks2.txt (10) core_show_locks3.txt (11) malloc_debug.txt (12) malloc_debug-1220.txt (13) null_ptr_deref-sthen-20080126.log (14) null_ptr_deref-sthen-20080126.txt (15) valgrind_patch.txt (16) valgrind.txt (17) valgrind-1220.txt | |
Description: | Asterisk starts do use 500% of CPU and incresease the load average of the machine to 61. Generating also a core dump. Bt output and valgrind.txt. ****** ADDITIONAL INFORMATION ****** (gdb) bt full #0 0xb7faf2a0 in pthread_mutex_lock () from /lib/libpthread.so.0 No symbol table info available. #1 0xb6c0b695 in ast_mutex_lock (pmutex=0xdeadbeef) at /root/src/asterisk/asterisk-1.4.14/include/asterisk/lock.h:700 No locals. #2 0xb6c0ac1c in retrans_pkt (data=0xb5955a50) at chan_sip.c:1892 pkt = (struct sip_pkt *) 0xb5955a50 prev = (struct sip_pkt *) 0x0 cur = (struct sip_pkt *) 0x0 reschedule = 1000 xmitres = 0 __PRETTY_FUNCTION__ = "retrans_pkt" #3 0x080eea2f in ast_sched_runq (con=0x82ec4a8) at sched.c:359 current = (struct sched *) 0x84c7eb8 tv = {tv_sec = 1196166383, tv_usec = 439930} numevents = 75 res = 0 #4 0xb6c498e0 in do_monitor (data=0x0) at chan_sip.c:15471 res = 1 sip = (struct sip_pvt *) 0x0 peer = (struct sip_peer *) 0x0 t = 1196166362 fastrestart = 0 lastpeernum = -1 curpeernum = 407 reloading = 0 __PRETTY_FUNCTION__ = "do_monitor" ASTERISK-1 0x080fd9c0 in dummy_start (data=0x8302220) at utils.c:843 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1208238092, 0, 4001536, -1228917832, 1237988176, -150611568}, __mask_was_saved = 0}}, __pad = {0xb6c03480, 0x0, 0x0, 0x0}} __cancel_routine = (void (*)(void *)) 0x8069636 <ast_unregister_thread> __cancel_arg = (void *) 0xb6c03b90 not_first_call = 0 ret = (void *) 0x0 a = {start_routine = 0xb6c492fb <do_monitor>, data = 0x0, name = 0x83022a0 "do_monitor", ' ' <repeats 11 times>, "started at [15525] chan_sip.c restart_monitor()"} __PRETTY_FUNCTION__ = "dummy_start" ASTERISK-2 0xb7fad1b3 in start_thread () from /lib/libpthread.so.0 No symbol table info available. ASTERISK-3 0xb7ef703e in clone () from /lib/libc.so.6 No symbol table info available. (gdb) | ||
Comments: | By: Tilghman Lesher (tilghman) 2007-11-30 12:25:44.000-0600 Okay, try this patch. By: Tilghman Lesher (tilghman) 2007-12-17 07:17:45.000-0600 flujan: is this the entire valgrind.txt? I would expect there to be more, given the contents of malloc_debug.txt. By: Fernando Lujan (flujan) 2007-12-17 07:36:06.000-0600 Yes, I start the valgrind thread and deleted the previos one. By: Leif Madsen (lmadsen) 2007-12-20 17:07:48.000-0600 I've uploaded the backtrace-1220, malloc_debug-1220.txt, and valgrind-1220.txt files which were generated after using the 20071130_bug11386__2.diff.txt file. Crash was caused after about 5 minutes of testing with the patch. Let me know if there is anything else I can get you. By: Oz Mundim (ozmundim) 2007-12-21 14:08:38.000-0600 Patches 20071130__bug11386__2.diff.txt and 20071221__bug11386__otherfix.diff.txt applied on 2 boxes running 1.4.14. Extensions are connected box1, which places SIP/G711 calls to box2 (box2 has Sangoma A108 on it). High load still happening on both boxes. Channels are not being disconnected after Hangup (shown state: 'Ringing'). Silent gaps during the whole call duration. Boxes running 1.4.14 (no patches) reports high load still. FYI: Switching outgoing calls to terminate on box running 1.2.7 high load and doesn't happen. By: Eldad Ran (eldadran) 2007-12-24 04:18:18.000-0600 I don't know if it is the same problem, but i have a CPU usage problem that builds up over time, were user cpu builds up. only when I take asterisk down, the CPU user goes back to normal. it usually takes a day to see the build up, after 3 days it reaches 100%. The problem appeared ever since I've switched from 1.4.7.1 to an higher version, the current is 1.4.15. I use these modules extensively (this is a low traffic snap shot): Module Description UseCount res_agi.so Asterisk Gateway Interface (AGI) 43 chan_zap.so Zapata Telephony 14 format_wav.so Microsoft WAV format (8000Hz Signed Line 1 chan_sip.so Session Initiation Protocol (SIP) 43 codec_alaw.so A-law Coder/Decoder 46 format_pcm.so Raw/Sun uLaw/ALaw 8KHz (PCM,PCMA,AU), G. 24 app_meetme.so MeetMe conference bridge 25 By: Oz Mundim (ozmundim) 2008-01-03 12:46:05.000-0600 I forwarded some calls this morning. Testing box was running 1.4.6.2 with patches applied. When it reached 90 calls, load average jumped to 15+ and calls started to drop and huge silent gaps and voice distortion were reported. Looks like nothing was really changed from the old 1.4.14 without the patches applied (since I didn't notice any performance improvement). By: Tilghman Lesher (tilghman) 2008-01-03 14:11:43.000-0600 ozmundim: could you compile with DEBUG_THREADS and get us a "core show locks" when the problem occurs? I want to be sure that this isn't a lock contention issue. A valgrind output might also help, but I don't think that this is a memory corruption issue anymore (at least, with the patches applied). By: Oz Mundim (ozmundim) 2008-01-03 15:27:32.000-0600 All previews modules were deleted from /usr/lib/asterisk/modules prior compilation. Compilation done with DEBUG_THREADS and LOADABLE_MODULES only (all other options were unchecked). Asterisk 1.4.6.2 with patches applied. Box has a Sangoma A108 with no Hardware Echo Cancelation. After reaching 74 concurrent calls, load jumped to 6+ and calls started to drop. Attached you'll see 'core show locks' during process. I also issued 'show channels' so you guys can see that after load comes down, no more calls can be placed through that box. I had to 'stop now' and start Asterisk. By: Russell Bryant (russell) 2008-01-03 15:32:47.000-0600 The "thread apply all bt" output from gdb of Asterisk when this happens would also be useful By: Oz Mundim (ozmundim) 2008-01-04 09:23:39.000-0600 Adding to last note: After patch 20071221__bug11386__otherfix.diff.txt applied, no crashes are being experienced during high load (no core dump). By: Digium Subversion (svnbot) 2008-01-08 11:58:26.000-0600 Repository: asterisk Revision: 97077 U branches/1.4/channels/chan_sip.c U branches/1.4/main/asterisk.c ------------------------------------------------------------------------ r97077 | tilghman | 2008-01-08 11:58:25 -0600 (Tue, 08 Jan 2008) | 3 lines Apply multiple crash fixes, found in issue ASTERISK-10897, but not completely closing that issue. ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=97077 By: Digium Subversion (svnbot) 2008-01-08 13:03:02.000-0600 Repository: asterisk Revision: 97125 _U trunk/ U trunk/channels/chan_sip.c U trunk/main/asterisk.c ------------------------------------------------------------------------ r97125 | tilghman | 2008-01-08 13:03:02 -0600 (Tue, 08 Jan 2008) | 11 lines Merged revisions 97077 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r97077 | tilghman | 2008-01-08 12:02:13 -0600 (Tue, 08 Jan 2008) | 3 lines Apply multiple crash fixes, found in issue ASTERISK-10897, but not completely closing that issue. ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=97125 By: Dmitry Andrianov (dimas) 2008-01-08 17:39:29.000-0600 tilghman, it is a bit offtopic but I just curious why your patch ast_strdupa every option except 's' ? By: Tilghman Lesher (tilghman) 2008-01-08 18:52:39.000-0600 dimas: because that option doesn't exist in 1.4 and I did a straight merge to trunk. By: Andrew Lindh (andrew) 2008-01-10 14:25:27.000-0600 This 1.4 branch update (97077) to chan_sip.c causes hold problems with polycom and grandstream phones. I was going to open a new bug (as it's a new problem), but I'll stick it here for now. If I back off chan_sip then it works again. It does not seem to affect cisco or sipura phones I have for testing. When I try to place a call on hold, the grandstream just disconnects it's side of the call but asterisk does not disconnect the other side. The polycom puts the call on hold on the phone, but asterisk does not see it as on hold. The polycom can't resume the call so it's stuck. If it's disconnected then the phone still sees it on hold. By: jmls (jmls) 2008-01-10 16:41:42.000-0600 we have the same hold issues, but use cisco 7940 and aastra9133i phones. By: Tilghman Lesher (tilghman) 2008-01-11 10:12:43.000-0600 The more invasive of these two patches, the one to chan_sip.c, has been reverted. By: Tilghman Lesher (tilghman) 2008-01-11 16:36:35.000-0600 For anybody having trouble with memory corruption, I would invite you to try out this latest patch, which takes a different tack, while should prevent all corruption caused by the scheduler. By: Tilghman Lesher (tilghman) 2008-01-16 12:37:00.000-0600 Let's have another go-around on this patch. By: Fernando Lujan (flujan) 2008-01-23 08:23:30.000-0600 Patched 1.4.17 with the last patch. I see no load average, but the zap channels stop working... I could not place or receive call on my E1 lines. :( Alas, there is a problema patching 1.4.17 in the chan_sip.c. It is not working propertly and I patched it by hand. By: Tilghman Lesher (tilghman) 2008-01-23 14:26:59.000-0600 When you have the patch running and it does that, please attach to the process and run a 'thread apply all bt'. Make sure that you've compiled with DONT_OPTIMIZE so that we can see all of the stack frames clearly. Then upload the output here. Attaching to the process is as simple as: gdb -p <pid-of-asterisk> By: Tilghman Lesher (tilghman) 2008-01-24 13:47:23.000-0600 Updated patch to current 1.4 SVN By: Tilghman Lesher (tilghman) 2008-01-24 14:09:54.000-0600 New patch generated only against 1.4.17. By: Stuart Henderson (stuarth) 2008-01-26 02:14:58.000-0600 Null pointer dereference in ast_bridged_channel() with 20080124__bug11386.diff.txt (new patch against SVN) - see null_ptr_deref-sthen-20080126.txt for backtrace. By: Digium Subversion (svnbot) 2008-01-27 15:57:14.000-0600 Repository: asterisk Revision: 100465 U branches/1.4/channels/chan_h323.c U branches/1.4/channels/chan_iax2.c U branches/1.4/channels/chan_mgcp.c U branches/1.4/channels/chan_misdn.c U branches/1.4/channels/chan_sip.c U branches/1.4/include/asterisk/sched.h U branches/1.4/main/cdr.c U branches/1.4/main/dnsmgr.c U branches/1.4/main/file.c U branches/1.4/main/rtp.c U branches/1.4/pbx/pbx_dundi.c ------------------------------------------------------------------------ r100465 | tilghman | 2008-01-27 15:57:10 -0600 (Sun, 27 Jan 2008) | 11 lines When deleting a task from the scheduler, ignoring the return value could possibly cause memory to be accessed after it is freed, which causes all sorts of random memory corruption. Instead, if a deletion fails, wait a bit and try again (noting that another thread could change our taskid value). (closes issue ASTERISK-10897) Reported by: flujan Patches: 20080124__bug11386.diff.txt uploaded by Corydon76 (license 14) Tested by: Corydon76, flujan, stuarth` ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=100465 By: Digium Subversion (svnbot) 2008-01-27 16:33:18.000-0600 Repository: asterisk Revision: 100488 _U trunk/ U trunk/channels/chan_h323.c U trunk/channels/chan_iax2.c U trunk/channels/chan_mgcp.c U trunk/channels/chan_misdn.c U trunk/channels/chan_sip.c U trunk/include/asterisk/sched.h U trunk/main/cdr.c U trunk/main/dnsmgr.c U trunk/main/file.c U trunk/main/rtp.c U trunk/pbx/pbx_dundi.c ------------------------------------------------------------------------ r100488 | tilghman | 2008-01-27 16:33:18 -0600 (Sun, 27 Jan 2008) | 19 lines Merged revisions 100465 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.4 ........ r100465 | tilghman | 2008-01-27 15:59:53 -0600 (Sun, 27 Jan 2008) | 11 lines When deleting a task from the scheduler, ignoring the return value could possibly cause memory to be accessed after it is freed, which causes all sorts of random memory corruption. Instead, if a deletion fails, wait a bit and try again (noting that another thread could change our taskid value). (closes issue ASTERISK-10897) Reported by: flujan Patches: 20080124__bug11386.diff.txt uploaded by Corydon76 (license 14) Tested by: Corydon76, flujan, stuarth` ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=100488 |