[Home]

Summary:ASTERISK-26043: core: Crash when notifying threadpool listener
Reporter:Javier Riveros (goseeped)Labels:
Date Opened:2016-05-19 15:03:22Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Core/General
Versions:13.8.0 13.9.1 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Attachments:( 0) asterisk_core_crash.zip
( 1) full_new.log
( 2) gdb_new.txt
Description:Im getting ramdom crashs in my asterisk boxes, at this point i try to dig what type of call make it happen but i don't have any clue what is happening here, i collect the proper core dumps and logs in other to determine what is going here,

I can say that 99% of my calls are to ARI/stasis app.

you will see a coredumps like this :

{noformat}
#0  0x00000000005eb3e4 in threadpool_send_state_changed (pool=0x7fa2c40393b0)
   at threadpool.c:185
#0  0x00000000005eb3e4 in threadpool_send_state_changed (pool=0x7fa2c40393b0)
   at threadpool.c:185
#1  0x00000000005eb506 in queued_active_thread_idle (data=0x7fa2c41d3148)
   at threadpool.c:241
#2  0x00000000005e411b in ast_taskprocessor_execute (tps=0x2a40ed8)
   at taskprocessor.c:850
#3  0x00000000005e2803 in default_tps_processing_function (data=0x2a40e58)
   at taskprocessor.c:183
#4  0x00000000005f8fab in dummy_start (data=0x2a40720) at utils.c:1237
#5  0x00007fa3108b3182 in start_thread (arg=0x7fa30c17d700)
   at pthread_create.c:312
#6  0x00007fa30fb7347d in clone ()
   at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
#0  0x00000000005eb3e4 in threadpool_send_state_changed (pool=0x7fa2c40393b0)
   at threadpool.c:185
       active_size = 1246756870
       idle_size = 825112882
#1  0x00000000005eb506 in queued_active_thread_idle (data=0x7fa2c41d3148)
   at threadpool.c:241
       pair = 0x7fa2c41d3148
#2  0x00000000005e411b in ast_taskprocessor_execute (tps=0x2a40ed8)
   at taskprocessor.c:850
       local = {local_data = 0x24500000020, data = 0x2a3b280}
       t = 0x7fa2c4210f60
       size = 6170346
       __PRETTY_FUNCTION__ = "ast_taskprocessor_execute"
#3  0x00000000005e2803 in default_tps_processing_function (data=0x2a40e58)
   at taskprocessor.c:183
       listener = 0x2a40e58
       tps = 0x2a40ed8
       pvt = 0x2a3b270
       sem_value = 202886864
       res = 0
       __PRETTY_FUNCTION__ = "default_tps_processing_function"
#4  0x00000000005f8fab in dummy_start (data=0x2a40720) at utils.c:1237
       __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {
               140338259285760, -2854846373934348535, 0, 0, 140338259286464,
               140338259285760, -2854846373942737143, 2821387285705049865},
             __mask_was_saved = 0}}, __pad = {0x7fa30c17cef0, 0x0, 0x0, 0x0}}
       __cancel_routine = 0x450d23 <ast_unregister_thread>
       __cancel_arg = 0x7fa30c17d700
       __not_first_call = 0
       ret = 0x0
       a = {start_routine = 0x5e276d <default_tps_processing_function>,
         data = 0x2a40e58,
         name = 0x2a3c300 "default_tps_processing_function started at [  200] taskprocessor.c default_listener_start()"}
#5  0x00007fa3108b3182 in start_thread (arg=0x7fa30c17d700)
   at pthread_create.c:312
       __res = <optimized out>
       pd = 0x7fa30c17d700
       now = <optimized out>
       unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140338259285760,
{noformat}

*note:* sometimes the crash doesn't dump core files.
Comments:By: Asterisk Team (asteriskteam) 2016-05-19 15:03:23.185-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Javier Riveros (goseeped) 2016-05-19 15:05:39.100-0500

This are the core dumps files and the full log.

By: Richard Mudgett (rmudgett) 2016-05-19 18:01:03.429-0500

Core files are *useless* except to the machine that generated them (or an exact clone of the machine which is unlikely anyone else will have).

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Javier Riveros (goseeped) 2016-05-19 18:42:35.651-0500

Sorry i have an script that collect the files and by error send the whole folder, hope there is something interesting in gdb.txt file though !

By: Rusty Newton (rnewton) 2016-05-24 09:45:56.629-0500

Thanks for all the debug.

Can you additionally provide the dialplan in use as well as a description of how calls flow on the system?

We typically need as much detail about the system as possible in scenarios like this.

How many channels are up when this issue typically occurs?

By: Javier Riveros (goseeped) 2016-05-24 10:30:25.106-0500

Normally our calls has the following path.

{noformat}
Carrier -> SIP Gateway -> Asterisk ( IVR) "Find agent" -> SIP gateway -> WebRTC endpoint.
{noformat}

So call comes from carrier go to our gateway then it hit the the stasis app it push the call into a AGI (where the IVR logic lives for now) then once it find an agent stasis app create a new call leg and dials a WebRTC agent.

Once the call is stablished sometimes the agent put on hold the caller and perform a outbound call and put everybody  on the bridge this is  made through an API.

{noformat}
Asterisk ( stablished bridge) +1XXXXXXX -> SIP gateway -> carrier ; then bridge the call with the agent
{noformat}

So in this process of managing the bridge seems like the crash happens.

*This is how our dialplan looks like*
{code}
[general]
static=yes
writeprotect=no
clearglobalvars=no

[entry_point]
exten => 100,1,Agi(agi://127.0.0.1/entry_point)

[pre_bridge]
exten => s,1,Agi(agi://127.0.0.1/prebridge)
same => n,Return()

[predial]
exten => _X.,1,Set(CHANNEL(hangup_handler_push)=hangup_handler,s,1())
same => n,Agi(agi://127.0.0.1/predial)
same => n,Return()

[page]
exten => _X.,1,Agi(agi://127.0.0.1/local_page)
same => n,Return()

[hangup_handler]
exten => s,1,Agi(agi://127.0.0.1/hangup_handler)
same => n,Return()

;The following packX monster variable is there since ARI does’t allow to pass more than one var at a time to the stasis app so thats the packX :)

[default]
exten => _[+a-z0-9].,1,Set(packX=X-callID=${PJSIP_HEADER(read,X-callID)}X,X-customerID=${PJSIP_HEADER(read,X-customerID)}X,X-instigatorID=${PJSIP_HEADER(read,X-instigatorID)}X,X-carrierID=${PJSIP_HEADER(read,X-carrierID)}X,X-endpointAddress=${PJSIP_HEADER(read,X-endpointAddress)}X,)
same => n,Set(CallID=${PJSIP_HEADER(read,X-callID)})
same => n,Stasis(switchapp)
same => n,Stasis(${PJSIP_HEADER(read,X-callID)})
same => n,Hangup()

exten => SPY,1,Agi(agi://127.0.0.1/Dialplan)
same => n,Hangup()

[outboundwait]
exten => _[+*a-z0-9@].,1,wait(2)
same => n,hangup()

[agentwait]
exten => _[+a-z0-9].,1,answer();
same => n,Set(loopcount=${EXTEN});
same => n,Set(loopcount=${MATH(${loopcount}-1)})
same => n,noop(${EXTEN});
same => n,GotoIf($[${loopcount} < 1]?hup)
same => n,playback(beep)
same => n,wait(2)
same => n,Goto(agentwait,s,3)
exten => _[+a-z0-9].,100(hup),hangup()
{code}

Typically it happens few channels no more than ~20 channels .

By: Rusty Newton (rnewton) 2016-05-25 18:06:00.416-0500

Thanks for the additional information!

By: Javier Riveros (goseeped) 2016-06-01 11:26:29.232-0500

Hey Guys I get more clear Full/coredumps {{full_new.log and gdb_new.txt}} logs this has the SIP traces so is more clear to see the what type of call make it crash, it was a PSTN -> WebRTC call the call was setup fine but at the time the call is hangup by the Caller and leaving the bridge it is when the {{threadpool_send_state_changed}} make asterisk crash.

This is how i have the pjsip.conf file just in case.

{noformat}

[transport_simple_tcp]
type=transport
protocol=tcp
bind=0.0.0.0
external_media_address=xx.xx.xx.xx

[transport_simple_udp]
type=transport
protocol=udp
bind=0.0.0.0
external_media_address=xx.xx.xx.xx

[aor_proxy_udp]
type=aor
contact=sip:proxy.somedomain.com
qualify_frequency=30

; AOR to dial endpoints with TCP/webrtc
[aor_proxy_tcp]
type=aor
contact=sip:proxy.somedomain.com\;transport=tcp
qualify_frequency=30

[sip-standard]
type=endpoint
context=default
disallow=all
allow=ulaw
allow=alaw
aors=aor_proxy_udp
direct_media=no
rtp_symmetric=yes
media_address=xx.xx.xx.xx
force_rport=yes
media_encryption=no
ice_support=no
media_use_received_transport=yes
timers=no

[sip-webrtc]
type=endpoint
context=default
allow=ulaw
allow=alaw
aors=aor_proxy_tcp
direct_media=no
rtp_symmetric=yes
media_address=xx.xx.xx.xx
force_rport=yes
use_avpf=yes
ice_support=yes
media_encryption=dtls
dtls_setup=actpass
dtls_verify=fingerprint
dtls_cert_file=/etc/ssl/star.somedomain.com/cert.pem
dtls_private_key=/etc/ssl/star.somedomain.com/key.pem
media_use_received_transport=yes
timers=no
timers_sess_expires=90
timers_min_se=90
{noformat}

I though this is good information for the ticket :)

*note* this new coredumps files were with 13.9.1 version