[Home]

Summary:ASTERISK-29535: Segmentation fault in libasteriskpj.so.2
Reporter:Daniel Bonazzi (dbonazzi)Labels:patch
Date Opened:2021-07-29 08:30:05Date Closed:2022-03-30 15:13:02
Priority:MinorRegression?No
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:18.5.1 Frequency of
Occurrence
Constant
Related
Issues:
is duplicated byASTERISK-29919 CRASH ASTERISK
is related toASTERISK-29411 Crash in pjsip_msg_find_hdr_by_name
Environment:Linux 34104asterisk 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxAttachments:( 0) core.34104asterisk-2021-07-27T09-16-30-0400-brief.txt
( 1) core.34104asterisk-2021-07-27T09-16-30-0400-full.txt
( 2) core.34104asterisk-2021-07-27T09-16-30-0400-info.txt
( 3) core.34104asterisk-2021-07-27T09-16-30-0400-locks.txt
( 4) core.34104asterisk-2021-07-27T09-16-30-0400-thread1.txt
( 5) use_tdata_pool.diff
( 6) valgrind_2021-08-05_08_09_18
( 7) valgrind_2021-08-05_08_10_15
( 8) valgrind_2021-08-05_14_48_49
( 9) valgrind_2021-08-06_13_46_56
(10) valgrind_2021-08-06_13_52_27
Description:I've been facing some segmentation faults on asterisk without any apparent reason.

This is what shows on the system logs:
{noformat}
Jul 27 09:16:30 34104asterisk kernel: asterisk[6556]: segfault at 0 ip 00007fcc4298678f sp 00007fcacf8ab938 error 6 in libasteriskpj.so.2[7fcc42871000+168000]
Jul 27 09:16:30 34104asterisk asterisk[223321]: /usr/sbin/safe_asterisk: line 171: 223349 Segmentation fault      (core dumped) nice -n $PRIORITY "${ASTSBINDIR}/asterisk" -f ${CLIARGS} ${ASTARGS} > /dev/${TTY} 2>&1 < /dev/${TTY}
{noformat}

Checking the logs of our system that connects to asterisk via ARI and checking the asterisk logs we could see some cases where it crashed after some calls to the ARI API were made in a certain order like:
{noformat}
POST /ari/channels/1627495070.126/snoop?app=stasis-&spy=both&whisper=none'

POST /ari/channels/1627495070.126/moh

DELETE /ari/channels/1627495070.126

POST /ari/channels/1627495070.126/moh
{noformat}
It is not always that those requests get to asterisk out of order that the crashes happen, but all the time that it crashed we I could saw this pattern.
Comments:By: Asterisk Team (asteriskteam) 2021-07-29 08:30:10.390-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Kevin Harwell (kharwell) 2021-07-29 16:20:28.538-0500

Crashing in same area as ASTERISK-29411, so probably related.

By: Allan Rossi Lisboa (allanlisboa) 2021-08-09 08:04:55.751-0500

This second valgrind log (valgrind_2021-08-05_08_10_15) is easy to reproduce with our stack.

I don't know if it directly relates to the SEGFAULT.

It seems to be a concurrency issue with something changing the value of "rtp->themssrc_valid".

From what I gathered it could be comming from "__rtp_sendto" when the lock to "instance" is released

ao2_ref(ice, +1);
if (instance == transport) {
ao2_unlock(instance);
}
status = pj_ice_sess_send_data(ice->real_ice, component, temp, len);
ao2_ref(ice, -1);
if (instance == transport) {
ao2_lock(instance);
}


By: Allan Rossi Lisboa (allanlisboa) 2021-08-09 08:09:57.564-0500

This valgrind log (valgrind_2021-08-05_14_48_49), starting at the problem in "Thread 19", I haven't had the chance to investigate yet.

This log contains the errors in the previous 2 logs.

By: Allan Rossi Lisboa (allanlisboa) 2021-08-09 08:17:00.922-0500

In "valgrind_2021-08-06_13_46_56" and "valgrind_2021-08-06_13_52_27" we were running Asterisk on a machine with 16 cores, following a hunch that the problem was caused by concurrency, and we got these different error messages.

We also had a patch for the first error in PJSIP, just checking if it was copying to the same address and not copying in that case.

By: Allan Rossi Lisboa (allanlisboa) 2021-08-19 06:56:23.345-0500

We've been able to narrow a bit the cause of this problem.

This is our PJSIP configuration:

[clickproxytrunk]
type=aor
contact=sip:<my hostname>:5060

[clickproxytrunk]
type=endpoint
context=acproxycontext
rtp_symmetric=yes
force_rport=yes
rewrite_contact=yes
disallow=all
allow=ulaw
aors=clickproxytrunk

[clickproxytrunk]
type=identify
endpoint=clickproxytrunk
match=<my hostname>


This is the flow of what we identified as relevant:
1) We start the call through ARI
2) Asterisk correctly receives the POST
3) ARI's web socket receives information about channel created
4) 30 seconds pass and the web socket receives channel destroyed with cause 18, AST_CAUSE_NO_USER_RESPONSE
4.1) Some of our calls have 25s timeout (most have 60s). Those that have 25s timeout returned with a different cause but caused the same problem.
5) PJSIP starts creating the invite request and breaks when reading a header. Here we noticed that, although it consistently breaks at the same header, other data is already corrupted. The core dump we debugged showed the enum that contains the value for the SIP header with incorrect values (high int numbers) and strings not properly terminated or with incorrect length, resulting in garbage showing.

Another thing that we noticed in that last step was that the "msg" being processed had the IP for <my hostname> from PJSIP's configuration. On a hunch that the DNS resolution was taking too long and PJSIP would run even after ARI had already answered that the channel was destroyed we changed that PJSIP configuration to <my valid public IP> and the problem that was happening ~5 times a day went away.

Before acting on that hunch we checked the path in the code that the ARI requests goes through and found this comment in app_dial.cp: "XXX this code is highly suspicious, as it essentially overwrites the outgoing channel without properly deleting it."

That raised some eyebrows and gave more confidence that the timing between these actions was the cause of the problem.

tl;dr. A channel created through ARI that takes too long for PJSIP to process, when the DNS resolution takes too long, for instance, will have it's data corrupted or deleted while still allowing for PJSIP to process it, potentially causing a SEGFAULT.

By: N A (InterLinked) 2022-02-10 06:22:08.722-0600

This backtrace from ASTERISK-29846 appears to be related to this issue: https://issues.asterisk.org/jira/secure/attachment/61194/core-asterisk-2022-02-10T00-46-55Z-full.txt

By: Kevin Harwell (kharwell) 2022-02-28 13:05:13.064-0600

Using CentOS?

If possible next time it crashes please use the {{ast_coredumper}} script to collect the full core file and libraries. Here's an example of the command to run after a crash:
{noformat}
$ /var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps --no-default-search <path_to_coredump>
{noformat}

Then post a link where the files can be downloaded from. If you don't want to post it publicly please send that information to asteriskteam@digium.com with the issue number in the subject. Note, by sending the information privately you agree to the Website Terms of Use available on the Sangoma website at https://www.sangoma.com/legal/

Thanks!

By: Kevin Harwell (kharwell) 2022-03-09 10:49:25.287-0600

I have still been unable to replicate the issue, but after some investigation I found an area of suspect in the code. Please apply the attached [^use_tdata_pool.diff] patch, and let me know if it fixes the issue for you.

By: Friendly Automation (friendly-automation) 2022-03-30 15:13:03.854-0500

Change 18253 merged by Friendly Automation:
res_pjsip_header_funcs: wrong pool used tdata headers

[https://gerrit.asterisk.org/c/asterisk/+/18253|https://gerrit.asterisk.org/c/asterisk/+/18253]

By: Friendly Automation (friendly-automation) 2022-03-30 15:13:55.985-0500

Change 18273 merged by Friendly Automation:
res_pjsip_header_funcs: wrong pool used tdata headers

[https://gerrit.asterisk.org/c/asterisk/+/18273|https://gerrit.asterisk.org/c/asterisk/+/18273]

By: Friendly Automation (friendly-automation) 2022-03-30 15:15:52.393-0500

Change 18272 merged by Kevin Harwell:
res_pjsip_header_funcs: wrong pool used tdata headers

[https://gerrit.asterisk.org/c/asterisk/+/18272|https://gerrit.asterisk.org/c/asterisk/+/18272]

By: Friendly Automation (friendly-automation) 2022-03-30 16:09:30.722-0500

Change 18274 merged by Kevin Harwell:
res_pjsip_header_funcs: wrong pool used tdata headers

[https://gerrit.asterisk.org/c/asterisk/+/18274|https://gerrit.asterisk.org/c/asterisk/+/18274]