[Home]

Summary:ASTERISK-25275: A11 SIGSEGV from pjnpath check_cached_response (ast_rtcp_read -> pj_stun_session_on_rx_pkt)
Reporter:Dade Brandon (dade)Labels:
Date Opened:2015-07-22 13:01:45Date Closed:2017-12-18 05:31:57.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:
Versions:11.18.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-25853 segfault in libpjnath.so.2
is duplicated byASTERISK-25401 Segmentation-fault crash within pjnath
is duplicated byASTERISK-25617 Asterisk 11 segfaults in pj_stun_session_on_rx_pkt
is duplicated byASTERISK-25658 Random segmentation fault for asterisk webrtc
is duplicated byASTERISK-25671 Asterisk often gets a SIGSEGV, Segmentation fault
is duplicated byASTERISK-25699 Segfault in check_cached_response
is duplicated byASTERISK-25871 Asterisk deadlock when using confbridge
is related toASTERISK-25274 A11 SIGSEGV 'Double free or corruption' in backtrace from pj_pool_release (sip_destroy -> pj_ice_sess_destroy)
Environment:Ubuntu 14.04.2; Linux 3.13.0-24-generic SMP; Intel E3-1231 Openssl 1.0.1f-1ubuntu2.15 (Jun 11 2015; most recent available) libsrtp0 / libsrtp0-dev 1.4.5~20130609~dfsg-1Attachments:( 0) 2-1-phx-crash-jul23-510PST-backtrace.txt
( 1) 2-1-phx-crash-jul23-510PST-debuglog.p.txt.gz
( 2) 6-2-phx-crash_jul_22_1043am_backtrace.txt
( 3) 6-2-phx-crash_jul_22_1043am.p.txt.gz
( 4) atlas-backtrace-july22_2015.txt
( 5) backtrace.2015-12-23_1412.txt
( 6) commit_log.txt
( 7) crash_asterisk_13.tar.gz
( 8) debug5.log
( 9) fenrir-debug_more-aug17c.zip
(10) fenrir-debug-aug17.zip
(11) fullbt-aug17c.txt
Description:This may be a duplicate of my other just-created issue, ASTERISK-25274, however since the backtrace has a different signal point, I am following previous instruction to create separate issues.

We have the patch from ASTERISK-25103 added to trunk 11 with a few custom patches (mostly just debug messages). The following crash occurs infrequently (1-5 times per week, usually batched together and on the same server(s); based on the pattern I imagine that there is a remote factor in whether or not the crash occurs, such as a slow peer )

A full backtrace and debug log will be attached shortly after this issue is created;  here is a snip of the top chunk of the backtrace, for assistance in reviewing the issue:

{noformat}
Program terminated with signal SIGSEGV, Segmentation fault.
#0 check_cached_response
#1 pj_stun_session_on_rx_pkt ()
#2 pj_ice_sess_on_rx_pkt ()
#3 __rtp_recvfrom
(instance=0xvalidptr, buf=0xvalidptr, size=8192, flags=0, sa=validptr, rtcp=1)
{noformat}
Comments:By: Asterisk Team (asteriskteam) 2015-07-22 13:01:46.643-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2015-07-23 17:39:40.040-0500

You say you are using " trunk 11 "in your recent issues. Can you post what the last commit in your Git 11 branch repo is for both of the issues? That way we'll know exactly where the 11 version is that you are using.

By: Rusty Newton (rnewton) 2015-07-24 18:51:24.033-0500

Dade can you post new logs when the issue occurs next, with the new logs including a SIP trace? (sip set debug on)

By: Asterisk Team (asteriskteam) 2015-08-15 12:00:22.260-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Nicole McIntosh (atna99) 2015-08-17 17:38:09.098-0500

new debug log added, this time with sip debug enabled.

also attached is the full backtrace.

By: Nicole McIntosh (atna99) 2015-08-18 15:37:06.514-0500

This is a longer and better version of the previous debug log. This contains the sip trace information for the call that was happening when the crash occurred.

useful:
Call-ID: k1lpln5covu6200bqra9
Call Code: C-00005a40

By: Michael Balen (aeinstein) 2015-12-23 08:25:49.062-0600

This error drives me crazy. I got this several times a day. Also it affects all Asterisk versions(11,13) including the last git version(GIT-master-f0b4375M) with the last pjproject2.4.5-svn. Clients are Chrome 47 with wss, encryption sdes, wss over stunnel. btw same problem with dtls and stunnel or not.

By: Dade Brandon (dade) 2015-12-23 13:10:45.241-0600

You can work around this 100% in Asterisk-11 by putting "cache_res=PJ_FALSE;" in res/pjproject/pjnath/src/pjnath/stun_session.c at the top of the function pj_stun_session_send_message (ie right before the PJ_ASSERT_RETURN macro call, but after pj_status_t status;

That causes the corruption to cached_response_list to not happen, because nothing is ever added to that list structure.  The corruption occurs because the timer that is created to remove items from the cache doesn't lock.  This will also solve likely all of your deadlocks, since sometimes the corruption causes a null pointer dereference, other times it check_cached_response is in its loop, and "t" will never be == &sess->cached_response_list.

Asterisk 11 includes an older version of pjproject.  -Newer pjproject appears to have solved this via a new group lock mechanism & an added lock in the timer that evicts from the  cached_response_list.  I've found to be difficult to backport the group lock mechanism, although one of my co-workers has been working on it, and will probably be able to contribute that work back to the 11 trunk after the holidays- _(text deleted because of reports that newer pjproject is affected by the same bug + is resolved by the same patch, so the list corruption must be occurring for a different reason other than the unlocked timer I mentioned)_.

I've reviewed the use for the cached_response_list, and it appears to be trivial, other than for protocol compliance.  Keep in mind that if you update pjproject in Asterisk-11, it has to be done in a manner which statically links the libs, the same way as Asterisk-11 already does.  That makes it pretty difficult to do.  i did it as a test before coming up with the cache_res=PJ_FALSE workaround, and had difficulty because even after fixing the non-backwards-compatible function calls in res_rtp_asterisk.c, we had new crashes (we didn't bother to debug since that implied an unknown amount of non-backwards-compatible changes to how Asterisk-11 would need to work with pjproject).  I am just saying this because I'm guessing there's at least a chance that you'd installed the updated pjproject as a shared lib, and Asterisk-11 was not using it.

By: Michael Balen (aeinstein) 2015-12-23 15:37:04.112-0600

OK, i will try, but for your interest: I have patched pjproject2.4.5 in asterisk-11.20 with changing res_rtp_asterisk session construktor with NULL for GroupLock like in Asterisk-13 (pj_ice_sess_create(&stun_config, NULL, PJ_ICE_SESS_ROLE_UNKNOWN, 2, &ast_rtp_ice_sess_cb, &ufrag, &passwd, NULL, &rtp->ice). Same result.
Also I have used the latest svn version of pjproject in Asterisk-13, you can look at my debug file. I've quick checked the timer callback and it seems you are right. on_cache_timeout does not lock. That means the problem will also exist in current pjproject-svn version. Thanks for your help, if your workaround works you have made my day. happy christmas

By: Marius Cojocariu (marius) 2016-01-05 03:38:36.476-0600

Hi,

I'm getting the same crash using the v13 branch of the asterisk (release 13.5.0). I can confirm that upgrading to 13.6.~ does not solve this issue.

I tried to apply the workaround listed here but cannot find pjnath in the res folder. I tried to edit the pjnath in the pjproject lib that is linked (version 2.4.5) but cannot find the pj_stun_session_send_message function.

I have attached the results for bt, bt full, thread apply all bt gdb commands in crash_asterisk_13.tar.gz as suggested in my own issue which was marked as a duplicate.

I'm waiting for asterisk to crash again and will attach a debug log with pjsip set logger on and also the backtraces again.

By: Michael Balen (aeinstein) 2016-01-05 03:44:13.778-0600

i can confirm that the workaround does its job.
Asterisk-13 pjproject is seperate. Path: pjproject/pjnath/src/pjnath
Asterisk-11 Path: asterisk-11.20.0/res/pjproject/pjnath/src/pjnath
correct name of function is pj_stun_session_send_msg in stun_session.c


By: Marius Cojocariu (marius) 2016-01-05 05:03:27.319-0600

Ok, I applied the workaround and waiting for asterisk to crash. I will come back with the results.

By: Richard Odekerken (richard_o) 2016-01-06 02:51:15.263-0600

We can confirm that the patch provided by Dade Brandon gets rid of the crashes. Thank you Dade!

By: Marius Cojocariu (marius) 2016-01-08 08:19:55.273-0600

I can also confirm that the workaround does its job. Thank you!

By: Andreas Krüger (woopstar) 2016-01-12 03:17:34.110-0600

Confirmed to work here too.

Should this be added as an issue to Trac on pjsip?

By: Abhay Gupta (agupta) 2016-02-10 06:47:10.991-0600

OK i will also try this and see if it solves the issue .

By: Vasilii Rogin (roginvs) 2016-02-23 07:23:07.827-0600

We had problems with Dade Brandon's patch - some customers totally lost their ability to call because of no or half-way audio. So, we had to revert it back.
Maybe this workaround have to be used with some specific options in configuration?

By: Thomas Guebels (tguescaux) 2016-02-29 04:06:33.939-0600

Hi,

It seems the crash could be fixed by this change https://trac.pjsip.org/repos/changeset/5233 in pjproject.
See also: https://trac.pjsip.org/repos/ticket/1903

By: Joshua C. Colp (jcolp) 2017-12-18 05:31:57.620-0600

This was fixed upstream and since we now use system PJSIP or bundled PJSIP in Asterisk this is no longer applicable.