[Home]

Summary:ASTERISK-27037: Segfault in pjsip
Reporter:Martin Tomec (matesstar)Labels:
Date Opened:2017-06-07 06:16:27Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:pjproject/pjsip
Versions:13.15.1 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-28504 Asterisk is crashing too frequently whenever a large number of PJSIP AOR are trying to register on asterisk.
is related toASTERISK-28689 res_pjsip: Crash when locking group lock when sending stateful response
Environment:CentOS 7.3Attachments:( 0) backtrace.txt
( 1) endpoint.txt
( 2) pjproject_buildopts.txt
( 3) pjproject_log.txt
Description:We were experiencing frequent asterisk crash in production enviroment. Maybe it was caused by some mallformed webrtc registration. For now we canĀ“t reproduce the issue, so I send only backtrace from coredump. In logs I failed to find any common line for all crashes, only this is suspicious:
{noformat}
[Jun  7 12:10:55] ERROR[21634] res_pjsip.c: Error 171060 'Unsupported transport (PJSIP_EUNSUPTRANSPORT)' sending OPTIONS request to endpoint webrtc_kaplan
{noformat}
Please let mi know if I can get more from coredump or log
Comments:By: Asterisk Team (asteriskteam) 2017-06-07 06:16:28.242-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Marek Cervenka (cervajs) 2017-06-08 08:01:32.314-0500

backtrace from 13.16.0

By: Marek Cervenka (cervajs) 2017-06-08 10:08:45.629-0500

pjsip endpoint configuration for webrtc

By: Martin Tomec (matesstar) 2017-06-08 11:22:55.208-0500

It seems that the {{pj_grp_lock_t}} structure is corupted/destroyed
{noformat}
print *glock
$2 = {base = {lock_object = 0x7fd6d40b7c88, acquire = 0x7fd720136b30 <grp_lock_acquire>, tryacquire = 0x7fd720136a80 <grp_lock_tryacquire>, release = 0x7fd720136c00 <grp_lock_release>, destroy = 0x7fd7201367d0 <grp_lock_destroy>},
 pool = 0x0, ref_cnt = 0x7fd6d40b7d18, own_lock = 0x7fd6d40b7d70, owner = 0x0, owner_cnt = 0, lock_list = {prev = 0x7fd6e00c61d8, next = 0x7fd6e00c61d8, prio = 0, lock = 0x0}, destroy_list = {prev = 0x7fd6fc00e958,
   next = 0x7fd6fc00e958, comp = 0x0, handler = 0x0}}
(gdb) print &(glock->lock_list)
$3 = (grp_lock_item *) 0x7fd6d40b7cd8
(gdb)  print *(glock->ref_cnt)
$4 = {mutex = 0x7fd6d40b7d28, value = 0}
{noformat}
{{glock->lock_list.next}} should point to 0x7fd6d40b7cd8, not 0x7fd6e00c61d8 and {{glock->pool}} shouldnt be NULL.

By: Marek Cervenka (cervajs) 2017-06-09 16:17:50.690-0500

i found similar issue ASTERISK-26675

By: Richard Mudgett (rmudgett) 2017-06-09 16:41:48.267-0500

The crash is happening inside pjproject.  Are you using bundled pjproject?  If not you really should because bundled applies patches to known problems with pjproject that have affected others when using Asterisk.

By: Marek Cervenka (cervajs) 2017-06-10 09:35:11.774-0500

using bundled
pjproject show buildopts in attachment

By: Marek Cervenka (cervajs) 2017-06-10 09:35:45.081-0500

pjproject buildopts

By: Marek Cervenka (cervajs) 2017-06-10 09:38:08.755-0500

found another interesting discussion
http://lists.pjsip.org/pipermail/pjsip_lists.pjsip.org/2017-March/019864.html




By: Marek Cervenka (cervajs) 2017-06-10 09:41:23.036-0500

i dont know if its relevant for our problem but we are using realtime pjsip backend in mysql via ODBC  
(mysql is located in amazon AWS. latency between asterisk VPS and amazon RDS is 8ms)

By: Marek Cervenka (cervajs) 2017-06-10 16:15:27.323-0500

dou you think its good idea compile with PJ_DEBUG or PJ_GRP_LOCK_DEBUG ? is this debug from pjsip logged somewhere via asterisk logging?
how can i specify this params via asterisk configure?
tried like this --with-pjproject-bundled PJPROJECT_CFLAGS='-DPJ_HAS_IPV6=0' but this not works

By: Joshua C. Colp (jcolp) 2017-06-12 09:06:37.552-0500

I don't think this is a problem really in PJSIP itself, but may be in our websocket transport implementation.

What is the Asterisk console output when this happens? Is the Websocket still active? (I don't see it in the backtrace but some threads don't have data - do you have DONT_OPTIMIZE set?)

By: Marek Cervenka (cervajs) 2017-06-12 10:32:23.715-0500

i have DONT_OPTIMIZE but i had mistake in BETTER_BACKTRACES (missing T)
today made new test

segfault almost after asterisk start. last lines from log
[Jun 12 16:31:26] NOTICE[9496]: res_pjsip/pjsip_distributor.c:526 log_failed_request: Request 'REGISTER' from '<sip:undefined@asterisk_pbx>' failed for '127.0.0.1:32832' (callid: 0jq6mf3514eu7of0mphatk) - No matching endpoint found
[Jun 12 16:31:27] NOTICE[9503]: res_pjsip/pjsip_distributor.c:526 log_failed_request: Request 'REGISTER' from '<sip:undefined@asterisk_pbx>' failed for '127.0.0.1:32902' (callid: fpg9ofdbt8b4adv0ug478r) - No matching endpoint found
[Jun 12 16:31:28] NOTICE[9483]: res_pjsip/pjsip_distributor.c:526 log_failed_request: Request 'REGISTER' from '<sip:undefined@asterisk_pbx>' failed for '127.0.0.1:32906' (callid: pbos5hkj3n07rminm18kgl) - No matching endpoint found
[Jun 12 16:31:28] NOTICE[9497]: res_pjsip/pjsip_distributor.c:526 log_failed_request: Request 'REGISTER' from '<sip:undefined@asterisk_pbx>' failed for '127.0.0.1:32854' (callid: 2thlu0hpnr90nmfe1atl04) - No matching endpoint found
[Jun 12 16:31:28] NOTICE[9513]: res_pjsip/pjsip_distributor.c:526 log_failed_request: Request 'REGISTER' from '<sip:undefined@asterisk_pbx>' failed for '127.0.0.1:32978' (callid: klf3fh6nfpcrmvva9qi8kq) - No matching endpoint found

how can i check status of websocket?



By: Marek Cervenka (cervajs) 2017-06-12 10:51:08.024-0500

for the record

when building rpm. pjproject flags works only this way
{noformat}
export CFLAGS="-DPJ_GRP_LOCK_DEBUG=1 -DPJ_LOG_MAX_SIZE=4500 ${ARCHFLAGS} %{optflags}"
%configure --without-tds --without-portaudio --with-pjproject-bundled
{noformat}

but with PJ_GRP_LOCK_DEBUG=1 it segfaults immediatelly


By: Marek Cervenka (cervajs) 2017-06-13 06:04:16.965-0500

debug log from today. core dump is still the same

[Jun 13 12:45:19] DEBUG[8388][C-00000000] audiohook.c: Write factory 0x7fe4b408b888 was pretty quick last time, waiting for them.
[Jun 13 12:45:19] DEBUG[8093] res_odbc.c: Reusing ODBC handle 0x2042610 from class 'configDb'
[Jun 13 12:45:19] DEBUG[8093] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_domain_aliases WHERE id = ?
[Jun 13 12:45:19] DEBUG[8239] pjproject:                sip_endpoint.c Processing incoming message: Request msg REGISTER/cseq=4 (rdata0x7fe4bc022e48)
[Jun 13 12:45:19] DEBUG[8239] res_pjsip/pjsip_distributor.c: Could not find matching transaction for Request msg REGISTER/cseq=4 (rdata0x7fe4bc022e48)
[Jun 13 12:45:19] DEBUG[8239] res_pjsip/pjsip_distributor.c: Calculated serializer pjsip/distributor-00000064 to use for Request msg REGISTER/cseq=4 (rdata0x7fe4bc022e48)
[Jun 13 12:45:19] DEBUG[8239] pjproject:                sip_endpoint.c Distributing rdata to modules: Request msg REGISTER/cseq=4 (rdata0x7fe4b40bd5f8)
[Jun 13 12:45:19] DEBUG[8093] res_config_odbc.c: Parameter 1 ('id') = '1.1.1.1'
[Jun 13 12:45:19] DEBUG[8388][C-00000000] audiohook.c: Write factory 0x7fe4b408b888 was pretty quick last time, waiting for them.
[Jun 13 12:45:19] DEBUG[8093] res_odbc.c: Releasing ODBC handle 0x2042610 into pool
[Jun 13 12:45:19] DEBUG[8093] pjproject:             tsx0x7fe4bc0631a8 .Sending Response msg 200/REGISTER/cseq=2 (tdta0x7fe4bc1e5aa8) in state Destroyed
[Jun 13 12:45:24] Asterisk 13.16.0 built by root @ 003e23b29654 on a x86_64 running Linux on 2017-06-12 14:14:39 UTC


By: Marek Cervenka (cervajs) 2017-06-13 06:36:22.603-0500

pjproject debug log

By: Marek Cervenka (cervajs) 2017-06-13 07:21:22.533-0500

for the webrtc part we are using jssip 3.0.7
found https://github.com/versatica/JsSIP/issues/414 (released in 3.0.13) which can be the reason?


By: Martin Tomec (matesstar) 2017-06-13 10:27:06.802-0500

Acording to log, asterisk function {{ast_sip_send_stateful_response}} sends response to pjsip transaction which is in state destroyed. We have tried a workaround in start of function {{pjsip_tsx_send_msg}}:
{{if (tsx->state == PJSIP_TSX_STATE_DESTROYED) return PJ_EGONE;}}
And it seems to solve the issue. But I am not sure what is correct way to solve it...

By: Martin Tomec (matesstar) 2017-06-16 08:51:25.263-0500

The workaround is not enougth, because the transaction is probably fully deleted. Today we had another segfault with obviously corrupted tsx:
{noformat}
(gdb) print  *((pjsip_transaction*) 0x7fdc180e24c8)
$9 = {pool = 0x7fdc0fc690a8, tsx_user = 0x7fdc0fc690a8,
 endpt = 0x6466377830676c64, terminating = 808989027, grp_lock = 0x38,
 mutex_b = 0x0,
...
{noformat}

By: Marek Cervenka (cervajs) 2017-06-27 04:40:48.515-0500

the same segfault today
hypothesis about source problem in jssip 3.0.7 is gone




By: seanchann.zhou (seanchann) 2018-10-12 22:54:16.502-0500

same segfault with our, how to fix it?

By: Joshua C. Colp (jcolp) 2018-10-14 09:31:00.568-0500

[~seanchann] This issue is open and unresolved, if there was a fix for it then it would have gone up for review already.

By: Martin Tomec (matesstar) 2018-11-07 04:23:52.163-0600

Now we use version 13.23.1 and it is without segfaults. It seems that version upgrade, or some configuration change helped. For now I am unable to replicate the issue. Which version do you use?