Summary: | ASTERISK-26043: core: Crash when notifying threadpool listener | ||
Reporter: | Javier Riveros (goseeped) | Labels: | |
Date Opened: | 2016-05-19 15:03:22 | Date Closed: | |
Priority: | Major | Regression? | |
Status: | Open/New | Components: | Core/General |
Versions: | 13.8.0 13.9.1 | Frequency of Occurrence | Frequent |
Related Issues: | |||
Environment: | Attachments: | ( 0) asterisk_core_crash.zip ( 1) full_new.log ( 2) gdb_new.txt | |
Description: | Im getting ramdom crashs in my asterisk boxes, at this point i try to dig what type of call make it happen but i don't have any clue what is happening here, i collect the proper core dumps and logs in other to determine what is going here,
I can say that 99% of my calls are to ARI/stasis app. you will see a coredumps like this : {noformat} #0 0x00000000005eb3e4 in threadpool_send_state_changed (pool=0x7fa2c40393b0) at threadpool.c:185 #0 0x00000000005eb3e4 in threadpool_send_state_changed (pool=0x7fa2c40393b0) at threadpool.c:185 #1 0x00000000005eb506 in queued_active_thread_idle (data=0x7fa2c41d3148) at threadpool.c:241 #2 0x00000000005e411b in ast_taskprocessor_execute (tps=0x2a40ed8) at taskprocessor.c:850 #3 0x00000000005e2803 in default_tps_processing_function (data=0x2a40e58) at taskprocessor.c:183 #4 0x00000000005f8fab in dummy_start (data=0x2a40720) at utils.c:1237 #5 0x00007fa3108b3182 in start_thread (arg=0x7fa30c17d700) at pthread_create.c:312 #6 0x00007fa30fb7347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 #0 0x00000000005eb3e4 in threadpool_send_state_changed (pool=0x7fa2c40393b0) at threadpool.c:185 active_size = 1246756870 idle_size = 825112882 #1 0x00000000005eb506 in queued_active_thread_idle (data=0x7fa2c41d3148) at threadpool.c:241 pair = 0x7fa2c41d3148 #2 0x00000000005e411b in ast_taskprocessor_execute (tps=0x2a40ed8) at taskprocessor.c:850 local = {local_data = 0x24500000020, data = 0x2a3b280} t = 0x7fa2c4210f60 size = 6170346 __PRETTY_FUNCTION__ = "ast_taskprocessor_execute" #3 0x00000000005e2803 in default_tps_processing_function (data=0x2a40e58) at taskprocessor.c:183 listener = 0x2a40e58 tps = 0x2a40ed8 pvt = 0x2a3b270 sem_value = 202886864 res = 0 __PRETTY_FUNCTION__ = "default_tps_processing_function" #4 0x00000000005f8fab in dummy_start (data=0x2a40720) at utils.c:1237 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = { 140338259285760, -2854846373934348535, 0, 0, 140338259286464, 140338259285760, -2854846373942737143, 2821387285705049865}, __mask_was_saved = 0}}, __pad = {0x7fa30c17cef0, 0x0, 0x0, 0x0}} __cancel_routine = 0x450d23 <ast_unregister_thread> __cancel_arg = 0x7fa30c17d700 __not_first_call = 0 ret = 0x0 a = {start_routine = 0x5e276d <default_tps_processing_function>, data = 0x2a40e58, name = 0x2a3c300 "default_tps_processing_function started at [ 200] taskprocessor.c default_listener_start()"} #5 0x00007fa3108b3182 in start_thread (arg=0x7fa30c17d700) at pthread_create.c:312 __res = <optimized out> pd = 0x7fa30c17d700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140338259285760, {noformat} *note:* sometimes the crash doesn't dump core files. | ||
Comments: | By: Asterisk Team (asteriskteam) 2016-05-19 15:03:23.185-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Javier Riveros (goseeped) 2016-05-19 15:05:39.100-0500 This are the core dumps files and the full log. By: Richard Mudgett (rmudgett) 2016-05-19 18:01:03.429-0500 Core files are *useless* except to the machine that generated them (or an exact clone of the machine which is unlikely anyone else will have). https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace By: Javier Riveros (goseeped) 2016-05-19 18:42:35.651-0500 Sorry i have an script that collect the files and by error send the whole folder, hope there is something interesting in gdb.txt file though ! By: Rusty Newton (rnewton) 2016-05-24 09:45:56.629-0500 Thanks for all the debug. Can you additionally provide the dialplan in use as well as a description of how calls flow on the system? We typically need as much detail about the system as possible in scenarios like this. How many channels are up when this issue typically occurs? By: Javier Riveros (goseeped) 2016-05-24 10:30:25.106-0500 Normally our calls has the following path. {noformat} Carrier -> SIP Gateway -> Asterisk ( IVR) "Find agent" -> SIP gateway -> WebRTC endpoint. {noformat} So call comes from carrier go to our gateway then it hit the the stasis app it push the call into a AGI (where the IVR logic lives for now) then once it find an agent stasis app create a new call leg and dials a WebRTC agent. Once the call is stablished sometimes the agent put on hold the caller and perform a outbound call and put everybody on the bridge this is made through an API. {noformat} Asterisk ( stablished bridge) +1XXXXXXX -> SIP gateway -> carrier ; then bridge the call with the agent {noformat} So in this process of managing the bridge seems like the crash happens. *This is how our dialplan looks like* {code} [general] static=yes writeprotect=no clearglobalvars=no [entry_point] exten => 100,1,Agi(agi://127.0.0.1/entry_point) [pre_bridge] exten => s,1,Agi(agi://127.0.0.1/prebridge) same => n,Return() [predial] exten => _X.,1,Set(CHANNEL(hangup_handler_push)=hangup_handler,s,1()) same => n,Agi(agi://127.0.0.1/predial) same => n,Return() [page] exten => _X.,1,Agi(agi://127.0.0.1/local_page) same => n,Return() [hangup_handler] exten => s,1,Agi(agi://127.0.0.1/hangup_handler) same => n,Return() ;The following packX monster variable is there since ARI does’t allow to pass more than one var at a time to the stasis app so thats the packX :) [default] exten => _[+a-z0-9].,1,Set(packX=X-callID=${PJSIP_HEADER(read,X-callID)}X,X-customerID=${PJSIP_HEADER(read,X-customerID)}X,X-instigatorID=${PJSIP_HEADER(read,X-instigatorID)}X,X-carrierID=${PJSIP_HEADER(read,X-carrierID)}X,X-endpointAddress=${PJSIP_HEADER(read,X-endpointAddress)}X,) same => n,Set(CallID=${PJSIP_HEADER(read,X-callID)}) same => n,Stasis(switchapp) same => n,Stasis(${PJSIP_HEADER(read,X-callID)}) same => n,Hangup() exten => SPY,1,Agi(agi://127.0.0.1/Dialplan) same => n,Hangup() [outboundwait] exten => _[+*a-z0-9@].,1,wait(2) same => n,hangup() [agentwait] exten => _[+a-z0-9].,1,answer(); same => n,Set(loopcount=${EXTEN}); same => n,Set(loopcount=${MATH(${loopcount}-1)}) same => n,noop(${EXTEN}); same => n,GotoIf($[${loopcount} < 1]?hup) same => n,playback(beep) same => n,wait(2) same => n,Goto(agentwait,s,3) exten => _[+a-z0-9].,100(hup),hangup() {code} Typically it happens few channels no more than ~20 channels . By: Rusty Newton (rnewton) 2016-05-25 18:06:00.416-0500 Thanks for the additional information! By: Javier Riveros (goseeped) 2016-06-01 11:26:29.232-0500 Hey Guys I get more clear Full/coredumps {{full_new.log and gdb_new.txt}} logs this has the SIP traces so is more clear to see the what type of call make it crash, it was a PSTN -> WebRTC call the call was setup fine but at the time the call is hangup by the Caller and leaving the bridge it is when the {{threadpool_send_state_changed}} make asterisk crash. This is how i have the pjsip.conf file just in case. {noformat} [transport_simple_tcp] type=transport protocol=tcp bind=0.0.0.0 external_media_address=xx.xx.xx.xx [transport_simple_udp] type=transport protocol=udp bind=0.0.0.0 external_media_address=xx.xx.xx.xx [aor_proxy_udp] type=aor contact=sip:proxy.somedomain.com qualify_frequency=30 ; AOR to dial endpoints with TCP/webrtc [aor_proxy_tcp] type=aor contact=sip:proxy.somedomain.com\;transport=tcp qualify_frequency=30 [sip-standard] type=endpoint context=default disallow=all allow=ulaw allow=alaw aors=aor_proxy_udp direct_media=no rtp_symmetric=yes media_address=xx.xx.xx.xx force_rport=yes media_encryption=no ice_support=no media_use_received_transport=yes timers=no [sip-webrtc] type=endpoint context=default allow=ulaw allow=alaw aors=aor_proxy_tcp direct_media=no rtp_symmetric=yes media_address=xx.xx.xx.xx force_rport=yes use_avpf=yes ice_support=yes media_encryption=dtls dtls_setup=actpass dtls_verify=fingerprint dtls_cert_file=/etc/ssl/star.somedomain.com/cert.pem dtls_private_key=/etc/ssl/star.somedomain.com/key.pem media_use_received_transport=yes timers=no timers_sess_expires=90 timers_min_se=90 {noformat} I though this is good information for the ticket :) *note* this new coredumps files were with 13.9.1 version |