ASTERISK-25941: chan_pjsip: Crash on an immediate SIP final response

[Home]

Summary: ASTERISK-25941: chan_pjsip: Crash on an immediate SIP final response

Reporter: Javier Riveros (goseeped) Labels:

Date Opened: 2016-04-19 15:04:17 Date Closed: 2016-05-20 18:37:17

Priority: Major Regression?

Status: Closed/Complete Components: Resources/res_ari Resources/res_pjsip

Versions: 13.8.0 13.8.1 13.8.2 13.9.0 Frequency of
Occurrence Frequent

Related
Issues:
is duplicated by ASTERISK-26079 Regular Segfaults of Asterisk w/ARI

Environment: Attachments: ( 0) backtrace.txt
( 1) dialog_find_refinement.diff
( 2) full_after_patch_1.zip
( 3) full_debug
( 4) gdb_after_patch_1.txt
( 5) gdb.txt
( 6) kern.log
( 7) session_end_response.diff

Description: *Description*
I start experience constant asterisk crashes on our pro asterisk boxes seems like when the box is a little busy (+10 calls) asterisk is more susceptible.

So the issue seems to be for outbound/inbound calls, i see this more frequent in outbound calls. Doing a outbound call into the stasis app and get a fast busy signal causes race condition opening the door to crash occur.

1) outbound call endpoint -> asterisk -> trunk.
2) trunk (486 Busy Here) -> asterisk (crash).

You will see a segfault like this :
{noformat}
2016-04-19 18:41:30.783 | info | kernel[-] |[436624.469629] asterisk[3737]: segfault at c0 ip 00000000004ce35b sp 00007f00ee7446d0 error 4 in asterisk[400000+2bc000]
{noformat}

And a coredumps like this:

{noformat}
#0 0x00000000004ce35b in ast_channel_dialed_causes_add (chan=0x0, cause_code=0x7f96ba940860, datalen=109) at channel_internal_api.c:1448
ao2_cause_code = 0x0
#1 0x00000000004b7423 in ast_channel_hangupcause_hash_set (chan=0x0, cause_code=0x7f96ba940860, datalen=109) at channel.c:4356
causevar = '\000' <repeats 16 times>, "@\t\224\272\226\177\000\000X\264\f\354\226\177\000\000p\a\224\272\226\177\000\000\236\341J\000\000\000\000\000\240\a\224\272\226\177\000\000h\316\f\354\226\177\000\000@\b\224\272\226\177\000\000\212\347J\000\000\000\000\000m\000\000\000\000\000\000\000`\b\224\272\226\177\000\000\260\a\224\272!\000\000\000h\316\f\354\226\177\000\000\004\000\000\000\000\000\000\000!", '\000' <repeats 23 times>, "m", '\000' <repeats 39 times>, "`\b\224\272\226\177", '\000' <repeats 59 times>...
__PRETTY_FUNCTION__ = "ast_channel_hangupcause_hash_set"
#2 0x00007f96c35293cf in chan_pjsip_incoming_response (session=0x7f96ec0cb458, rdata=<optimized out>) at chan_pjsip.c:2319
status = {code = 486, reason = {ptr = <optimized out>, slen = 9}}
cause_code = 0x7f96ba940860
data_size = 109
__PRETTY_FUNCTION__ = "chan_pjsip_incoming_response"
#3 0x00007f96e33cf846 in handle_incoming_response (type=<optimized out>, response_priority=AST_SIP_SESSION_AFTER_MEDIA, rdata=0x7f97049443d8,
session=0x7f96ec0cb458) at res_pjsip_session.c:2323
supplement = 0x7f96ec0ccab0
status = {code = <optimized out>, reason = {ptr = 0x7f9704945740 "Busy HereV\224\004\227\177", slen = 9}}
#4 handle_incoming (session=session@entry=0x7f96ec0cb458, rdata=0x7f97049443d8, response_priority=response_priority@entry=AST_SIP_SESSION_AFTER_MEDIA,
type=<optimized out>) at res_pjsip_session.c:2337
No locals.
#5 0x00007f96e33d15e8 in session_inv_on_tsx_state_changed (inv=0x7f96ec0cb218, tsx=0x7f9718010868, e=0x7f96ba940ad0) at res_pjsip_session.c:2512
cb = 0x0
session = 0x7f96ec0cb458
tdata = 0x7f96ba940ad0
__PRETTY_FUNCTION__ = "session_inv_on_tsx_state_changed"
#6 0x00007f972448b2d4 in mod_inv_on_tsx_state (tsx=0x7f9718010868, e=0x7f96ba940ad0) at ../src/pjsip-ua/sip_inv.c:699
dlg = 0x7f9718048e08
inv = 0x7f96ec0cb218
#7 0x00007f97244d0fb0 in pjsip_dlg_on_tsx_state (dlg=0x7f9718048e08, tsx=0x7f9718010868, e=0x7f96ba940ad0) at ../src/pjsip/sip_dialog.c:2013
i = 2
#8 0x00007f97244d17c0 in mod_ua_on_tsx_state (tsx=0x7f9718010868, e=0x7f96ba940ad0) at ../src/pjsip/sip_ua_layer.c:178
dlg = 0x7f9718048e08
#9 0x00007f97244ca70c in tsx_set_state (tsx=0x7f9718010868, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG,
event_src=0x7f97049443d8) at ../src/pjsip/sip_transaction.c:1213
e = {prev = 0x7f9704004ef0, next = 0x7f97040050c8, type = PJSIP_EVENT_TSX_STATE, body = {timer = {entry = 0x7f97049443d8}, tsx_state = {src = {
rdata = 0x7f97049443d8, tdata = 0x7f97049443d8, timer = 0x7f97049443d8, status = 76825560, data = 0x7f97049443d8},
tsx = 0x7f9718010868, prev_state = 3, type = PJSIP_EVENT_RX_MSG}, tx_msg = {tdata = 0x7f97049443d8}, tx_error = {tdata = 0x7f97049443d8,
tsx = 0x7f9718010868}, rx_msg = {rdata = 0x7f97049443d8}, user = {user1 = 0x7f97049443d8, user2 = 0x7f9718010868, user3 = 0x300000003,
user4 = 0x7f97049443d8}}}
prev_state = PJSIP_TSX_STATE_PROCEEDING

....
{noformat}

*To reproduce*
I was able to reproduce this easy i just do the following. with two asterisk boxes one acting as a carrier/trunk "Asterisk A" and another with my basic stasis app "Asterisk B".

1) on "Asterisk A" i just create a simple dialplan that response 486 Busy or another 4XX code, this simulating our carrier given a fast 486 busy signal.

{code}
exten => _[+][1][7]X.,1,Noop("test")
same => n,Busy()
same => n,Hangup()

exten => _[+][1][8]X.,1,Noop("test")
same => n,Dial(SIP/1002) ; this endpoint isn't register
same => n,Hangup()
{code}

2) on "Asterisk B" in other to add a little of load i just stress the box a little bit until load average were like 2 "normal busy load average" {{sudo stress -d 2 --hdd-bytes 512M}}

3) on "Asterisk B" once the the box is busy i just make couple of outbound calls and asterisk crash, "this simulate an outbound call to a carrier that response a fast busy signal"

There is some external ways i'm workaround this but please asterisk don't let you crash in fast SIP ERROR codes :).

Attached are the coredumps and debug logs anything else is needed let me know.

Comments: By: Asterisk Team (asteriskteam) 2016-04-19 15:04:18.558-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Hugo Castro de Deco (hugodeco) 2016-05-03 17:21:53.188-0500

i'm having the same problem on 13.9.0 rc1

By: Hugo Castro de Deco (hugodeco) 2016-05-10 07:32:52.198-0500

Updating to 13.9.0.rc2 and the problem continues.
By: Mark Michelson (mmichelson) 2016-05-11 12:51:59.892-0500

I am attaching session_end_response.diff to the issue.

I believe the problem here is that there is a race condition caused by the fact that the chan_pjsip channel driver is told that the session has ended, and then is later given the 486 response to handle. This change makes it so that the 486 gets handled first, then the session is ended. Please test this and see if you have the same problem.
By: Javier Riveros (goseeped) 2016-05-16 13:53:04.264-0500

Hey Mark thanks for taking a look to this i just try this diff with asterisk 13.9.1 and i was able to reproduce again and get a crash.

{noformat}
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Function session_inv_on_state_changed called on event TSX_STATE
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The state change pertains to the endpoint 'sip-standard(PJSIP/sip-standard-00000177)'
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The inv session still has an invite_tsx (0x7fcb9805a5a8)
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: There is no transaction involved in this state change
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The current inv state is DISCONNCTD
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Source of transaction state change is RX_MSG
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Received response
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Response is 486 Busy Here
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Function session_inv_on_tsx_state_changed called on event TSX_STATE
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The state change pertains to the endpoint 'sip-standard(PJSIP/sip-standard-00000177)'
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The inv session still has an invite_tsx (0x7fcb9805a5a8)
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The UAC INVITE transaction involved in this state change is 0x7fcb9805a5a8
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The current transaction state is Completed
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The transaction state change event is RX_MSG
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: The current inv state is DISCONNCTD
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Received response
[May 16 17:57:47] DEBUG[26658] res_pjsip_session.c: Response is 486 Busy Here
[May 16 17:57:47] DEBUG[2577] channel.c: Hanging up channel 'PJSIP/sip-standard-00000177'
[May 16 17:58:22] Asterisk 13.9.1 built by root @ ip-172-19-4-82 on a x86_64 running Linux on 2016-05-16 16:04:27 UTC
{noformat}

I will attach the coredumps and full logs you will se something like this.

{code}
#0 0x00007fcba5f22cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fcba5f260d8 in __GI_abort () at abort.c:89
#2 0x00007fcba5f1bb86 in __assert_fail_base (fmt=0x7fcba606c830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7fcba6d30a5c "mutex->__data.__owner == 0", file=file@entry=0x7fcba6d30a28 "../nptl/pthread_mutex_lock.c",
line=line@entry=116, function=function@entry=0x7fcba6d30b60 <__PRETTY_FUNCTION__.8500> "__pthread_mutex_lock") at assert.c:92
#3 0x00007fcba5f1bc32 in __GI___assert_fail (assertion=0x7fcba6d30a5c "mutex->__data.__owner == 0", file=0x7fcba6d30a28 "../nptl/pthread_mutex_lock.c",
line=116, function=0x7fcba6d30b60 <__PRETTY_FUNCTION__.8500> "__pthread_mutex_lock") at assert.c:101
#4 0x00007fcba6d2852f in __GI___pthread_mutex_lock (mutex=0xfefefefefefefe00) at ../nptl/pthread_mutex_lock.c:116
#5 0x0000000000539d25 in __ast_pthread_mutex_lock (filename=0x6324eb "channel.c", lineno=1036,
func=0x6379e0 <__PRETTY_FUNCTION__.18225> "__ast_queue_frame", mutex_name=0x6336f0 "chan", t=0x7fcb94d36780) at lock.c:315
#6 0x000000000045d089 in __ao2_lock (user_data=0x7fcb94d367d8, lock_how=AO2_LOCK_REQ_MUTEX, file=0x6324eb "channel.c",
func=0x6379e0 <__PRETTY_FUNCTION__.18225> "__ast_queue_frame", line=1036, var=0x6336f0 "chan") at astobj2.c:167
#7 0x00000000004ae8df in __ast_queue_frame (chan=0x7fcb94d367d8, fin=0x7fcb680d47d0, head=0, after=0x0) at channel.c:1036
#8 0x00000000004af0e8 in ast_queue_frame (chan=0x7fcb94d367d8, fin=0x7fcb680d47d0) at channel.c:1145
#9 0x00000000004af6d4 in ast_queue_control_data (chan=0x7fcb94d367d8, control=AST_CONTROL_PVT_CAUSE_CODE, data=0x7fcb680d48a0, datalen=109)
at channel.c:1242
#10 0x00007fcb587a9c20 in chan_pjsip_incoming_response (session=0x7fcb9c671608, rdata=<optimized out>) at chan_pjsip.c:2319
#11 0x00007fcb69656af0 in handle_incoming_response (type=<optimized out>, response_priority=AST_SIP_SESSION_AFTER_MEDIA, rdata=0x7fcb9821d538,
session=0x7fcb9c671608) at res_pjsip_session.c:2385
#12 handle_incoming (session=session@entry=0x7fcb9c671608, rdata=0x7fcb9821d538, response_priority=response_priority@entry=AST_SIP_SESSION_AFTER_MEDIA,
type=<optimized out>) at res_pjsip_session.c:2402
#13 0x00007fcb69658788 in session_inv_on_tsx_state_changed (inv=0x7fcb9c462af8, tsx=0x7fcb9805a5a8, e=0x7fcb680d4b00) at res_pjsip_session.c:2569
{code}

By: Mark Michelson (mmichelson) 2016-05-17 09:22:45.203-0500

Thanks for the new information. I will get to work and try to figure out what has gone wrong here.
By: Javier Riveros (goseeped) 2016-05-17 10:09:16.705-0500

Cool, anything else you guys need here let me know !
By: Mark Michelson (mmichelson) 2016-05-17 17:49:06.855-0500

I figure that an update is in order here. After investigating the logs some more, I made a realization about how the threads were interacting. Specifically, you'll typically see something like this when you receive a 486 response:

{noformat}
[May 16 17:59:04] DEBUG[2845] res_pjsip/pjsip_distributor.c: No dialog serializer for response (null). Using request transaction as basis
[May 16 17:59:04] DEBUG[2845] res_pjsip/pjsip_distributor.c: Found serializer pjsip/session/sip-standard-000000de on transaction tsx0x7faf0c282538
{noformat}

However, sometimes that second message is not there. Looking into things some more, I found that when that second message was missing, it was because the thread that sent the outgoing INVITE request is not the one that we expect it to be. The result is that the thread processing the 486 response can compete with another thread that is destroying the channel. This causes the crash.

Interestingly, looking at the timing of this, it appears that you will get the INVITE sent in an unexpected thread exactly every 20 seconds. It appears the reason for this is because of PJSIP's asynchronous DNS resolution. The INVITE does not actually get sent in the thread that we think because our attempt to send the INVITE results in an async DNS request. When the DNS result returns, a callback is called on a separate thread and the INVITE is sent there. My guess regarding the 20 second rule is that there is some place where DNS results are being cached on the system. Every 20 seconds, a new DNS query has to be made.

The reason I was unable to reproduce this myself is because I was sending my INVITE requests to a direct IP address instead of involving DNS. My goal now is to determine the proper set of steps to ensure that there are no competing threads in this situation. I hope to have a patch ready tomorrow at some point.
By: Mark Michelson (mmichelson) 2016-05-18 12:25:40.091-0500

I have attached dialog_find_refinement.diff

This is a different approach to fixing the issue. The idea here is to ensure that the 486 and the subsequent hangup are handled by the same thread. I feel confident that this will fix your issue. Please test and let me know if it works for you.
By: Javier Riveros (goseeped) 2016-05-19 12:29:31.221-0500

@Mark thank you now i was unable to make it crash i make like ~3 hours test, so seems like the patch work :)
By: Mark Michelson (mmichelson) 2016-05-20 09:55:12.880-0500

Excellent. The change is now up for review: https://gerrit.asterisk.org/#/c/2881/