[Home]

Summary:ASTERISK-27210: Getting segfault in res_pjsip.so and libasteriskpj.so.2
Reporter:Jeppe Ryskov Larsen (ryskov)Labels:
Date Opened:2017-08-21 04:36:25Date Closed:2020-01-14 11:14:07.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:14.5.0 14.6.2 Frequency of
Occurrence
One Time
Related
Issues:
Environment:Ubuntu 16.04.2 LTSAttachments:( 0) 20171027_asterisk-ASTERISK-27210-results.tar.gz
( 1) asterisk-ASTERISK-27210-results.tar.gz
Description:Suddenly we got these 4 segfaults in a relatively short timespan.

{code}
Aug 21 10:38:28 osl1-voip-cluster01-asterisk05 kernel: [28594.890030] asterisk[27799]: segfault at 1e0 ip 00007f341b163500 sp 00007f34331a1a08 error 4 in res_pjsip.so[7f341b151000+49000]
Aug 21 10:40:43 osl1-voip-cluster01-asterisk05 kernel: [28729.766816] asterisk[5930]: segfault at 1e0 ip 00007fe2552bb500 sp 00007fe0779b1a08 error 4 in res_pjsip.so[7fe2552a9000+49000]
Aug 21 10:41:52 osl1-voip-cluster01-asterisk05 kernel: [28799.083666] asterisk[7569]: segfault at 18 ip 00007f7828c780f8 sp 00007f7722f21940 error 4 in libasteriskpj.so.2[7f7828b6f000+15a000]
Aug 21 10:42:44 osl1-voip-cluster01-asterisk05 kernel: [28850.946232] asterisk[9174]: segfault at 90 ip 00007f8c9fd630e8 sp 00007f8ba1d18940 error 4 in libasteriskpj.so.2[7f8c9fc5a000+15a000]
{code}

After investigating the circumstances during that timespan, i saw no behaviour out of the ordinary, and no changes has been made in the last months, and this has never occurred.

Sadly, this is running in our production system, where we have debug turned off, so i can not provide a backtrace, but was just hoping maybe someone has seen similar before or can pinpoint us the the right direction for collecting more information so we can debug this further.
Comments:By: Asterisk Team (asteriskteam) 2017-08-21 04:36:26.101-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2017-08-22 07:42:29.384-0500

Thank you for the crash report. However, we need more information to investigate the crash. Please provide:

1. A backtrace generated from a core dump using the instructions provided on the Asterisk wiki [1].
2. Specific steps taken that lead to the crash.
3. All configuration information necesary to reproduce the crash.

Thanks!

[1]: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

There's not enough information here to really provide any information or suggestion of what could be happening, we really do need that backtrace.

By: Asterisk Team (asteriskteam) 2017-09-05 12:00:01.795-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Andreas Krüger (woopstar) 2017-10-25 03:01:39.803-0500

So, it happend again and now we had our Asterisk running compiled with both DONT_OPTIMIZE and BETTER_BACKTRACES,  at version 14.6.2

{code}
root@osl1-voip-cluster01-asterisk04:/tmp# grep "segfault" /var/log/syslog
Oct 25 09:17:06 osl1-voip-cluster01-asterisk04 kernel: [25182.454189] asterisk[30679]: segfault at 18 ip 00007fda0f679142 sp 00007fd80345e840 error 4 in libasteriskpj.so.2[7fda0f542000+18b000]
Oct 25 09:18:22 osl1-voip-cluster01-asterisk04 kernel: [25257.949956] asterisk[31947]: segfault at 0 ip           (null) sp 00007f56b45a5818 error 14 in asterisk[400000+313000]
Oct 25 09:19:32 osl1-voip-cluster01-asterisk04 kernel: [25328.246808] asterisk[1309]: segfault at 1e0 ip 00007f21f2e915a0 sp 00007f21f31d6828 error 4 in res_pjsip.so[7f21f2e7f000+49000]
Oct 25 09:42:21 osl1-voip-cluster01-asterisk04 kernel: [26697.588256] asterisk[17275]: segfault at 90 ip 00007f8c5a3b0134 sp 00007f8a0745e840 error 4 in libasteriskpj.so.2[7f8c5a279000+18b000]
Oct 25 09:43:34 osl1-voip-cluster01-asterisk04 kernel: [26770.113511] asterisk[19141]: segfault at 90 ip 00007fc41d7a0134 sp 00007fc2d6272840 error 4 in libasteriskpj.so.2[7fc41d669000+18b000]
{code}

Respective coredumps:

{code}
Oct 25 09:17:06 -> core.asterisk.1508915826
Oct 25 09:18:22 -> core.asterisk.1508915902
Oct 25 09:19:32 -> core.asterisk.1508915972
Oct 25 09:42:21 -> core.asterisk.1508917341
Oct 25 09:43:34 -> core.asterisk.1508917414
{code}

I have attached a tarball with alle the core dumps, which has been processed by the following command:
{code}
/var/lib/asterisk/scripts/ast_coredumper --tarball-uniqueid=ASTERISK-27210 --tarball-results
{code}

By: Andreas Krüger (woopstar) 2017-10-25 03:02:10.518-0500

Core dump of segfault for ASTERISK-27210

By: Jeppe Ryskov Larsen (ryskov) 2017-10-25 03:06:13.386-0500

Commenting to reopen after attachment of backtrace etc.

By: Asterisk Team (asteriskteam) 2017-10-25 03:06:13.583-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Kevin Harwell (kharwell) 2017-10-25 18:13:37.171-0500

I only see the one crash in core.asterisk.1508915902-full.txt
{noformat}
Thread 1 (Thread 0x7f21f31d8700 (LWP 1309)):
#0  ast_sip_failover_request (tdata=0x0) at res_pjsip.c:3821
       via = <optimized out>
#1  0x00007f21f355b751 in check_request_status (inv=inv@entry=0x7f21cc00a4a8, e=0x7f21f31d69d0) at res_pjsip_session.c:2589
       session = <optimized out>
       tsx = 0x42059a8
#2  0x00007f21f355b823 in session_inv_on_state_changed (inv=0x7f21cc00a4a8, e=0x7f21f31d69d0) at res_pjsip_session.c:2673
       session = 0x7f21cc043888
       type = <optimized out>
       __PRETTY_FUNCTION__ = "session_inv_on_state_changed"
{noformat}
It appears the transaction's last_tx (the message kept for re-transmissions) is missing. A debug log would be helpful, but understand if that is hard to obtain. Is there anything in the log at all around the time of the crash? Errors/Warnings?

By: Andreas Krüger (woopstar) 2017-10-27 03:07:22.927-0500

I think you are looking at the "wrong" dump. The first dump (the first crash so to speak) is core.asterisk.1508915826.

It just happened today again. I will attach the log. Thread 1 in the dump today seems to be "equal" to the Thread 1 dump from core.asterisk.1508915826

We suspect it seems to happen when a contact registers.

By: Andreas Krüger (woopstar) 2017-10-27 03:10:39.116-0500

New coredump when server seqfaults

By: Andreas Krüger (woopstar) 2017-10-27 04:15:04.038-0500

We suspect there is a locking issues with PJSIP.

From what we see (and think), please correct us if wrong. But for us it looks like Asterisk is crashing at: ../src/pj/lock.c:290 in the function grp_lock_acquire().

The function is as following:

{code}
static pj_status_t grp_lock_acquire(LOCK_OBJ *p)
{
   pj_grp_lock_t *glock = (pj_grp_lock_t*)p;
   grp_lock_item *lck;

   pj_assert(pj_atomic_get(glock->ref_cnt) > 0);

   lck = glock->lock_list.next;
   while (lck != &glock->lock_list) {

       pj_lock_acquire(lck->lock);
       lck = lck->next;

   }
   grp_lock_set_owner_thread(glock);
   pj_grp_lock_add_ref(glock);
   return PJ_SUCCESS;
}
{code}

According to the dump:

{code}
glock = 0x7f1100042df8
lck = 0x0
{code}

So lck is NULL.

So it looks like it tries to call pj_lock_acquire(lck->lock); where as lck is null and lck-lock would then cause a panic?

So the good question is. Why would lck = glock->lock_list.next; return NULL ? Is it ok?
Is there simply just missing a NULL check in this function?

By: Andreas Krüger (woopstar) 2017-10-27 04:43:17.180-0500

Could be related to ASTERISK-26675 ?

By: Kevin Harwell (kharwell) 2017-10-27 15:41:51.085-0500

So yes the crashes basically show that the object, or parent object, in question no longer exists (has been freed/released). Depending on which thread tries to use the released object first will make it crash in different places.
{quote}Could be related to ASTERISK-26675 ?{quote}
The one crash does looks similar to the one posted on that issue, however it could still be a separate problem. The reporter on that issue stated using bundled fixed the issue for them. Are you using pjproject-bundled with Asterisk?

By: Andreas Krüger (woopstar) 2017-10-28 01:40:23.252-0500

Yes we are using the bundled version.

We compile unixODBC-2.3.4 and mysql-connector-odbc-5.3.8-linux-ubuntu16.04-x86-64bit

Then we configure Asterisk with ./configure --with-crypto --with-ssl --with-srtp --with-cap --with-pjproject-bundled

where we enable and disable modules etc before we compile.

We also disabled the Sorcery cache to make sure it was not a cache related problem

By: Andreas Krüger (woopstar) 2017-10-28 01:48:02.401-0500

I am wondering what we can do to find the issue as it is a major problem in our production environment at the moment with random seqfaults. What can we provide?

By: Kevin Harwell (kharwell) 2017-10-30 15:59:04.919-0500

A couple of things to try:

As this is possibly a problem in pjproject/pjsip itself you could try upgrading to the latest release of Asterisk, 15.1.0 (new versions of Asterisk were just released.  Both 13.18.0, 15.1.0 have upgraded bundled to use PJPROJECT 2.7). It's possible something changed between pjproject 2.6 and 2.7 that fixed the problem.

I hesitate to recommend this but, barring the above, you could try using the patch that's on ASTERISK-26675. I hesitate because in briefly looking at the patch it appears that while it may stop the crash from happening it doesn't address the reason why it is crashing in the first place.

Even if that patch works though, it'd just be a workaround until we know what the underlying problem is. Without debug or some kind of logging information it's going to be hard to move forward on this issue.

By: Asterisk Team (asteriskteam) 2017-11-15 12:00:01.023-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Jeppe Ryskov Larsen (ryskov) 2017-12-05 06:49:02.067-0600

Thanks for the tips. We tried upgrading to 15.1.3, but it seems like it still uses PJPROJECT 2.6, so not much to gain there.

By: Asterisk Team (asteriskteam) 2017-12-05 06:49:02.427-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Richard Mudgett (rmudgett) 2017-12-05 07:35:06.025-0600

Yeah.  v15.1.x just missed the pjproject upgrade.  It will be in the v15.2.0 release.  You could try the tip of the 15 branch in git to get the newer bundled pjproject (2.7.1).

By: Andreas Krüger (woopstar) 2017-12-05 07:41:57.847-0600

Should we upgrade the libSRTP also then? I saw the commit for libSRTP 2.0.

Current install has:

{code}
# dpkg -l | grep srtp
ii  libsrtp0                            1.4.5~20130609~dfsg-1                      amd64        Secure RTP (SRTP) and UST Reference Implementations - shared library
ii  libsrtp0-dev                        1.4.5~20130609~dfsg-1                      amd64        Secure RTP (SRTP) and UST Reference Implementations - development files
{code}

By: Richard Mudgett (rmudgett) 2017-12-05 08:12:04.285-0600

The crash you report is dealing with SIP messaging and the use of a pjproject structure after it is destroyed; not RTP traffic.

Are you using SRTP for your media?
See https://wiki.asterisk.org/wiki/display/AST/libsrtp about supported libsrtp versions.

By: Andreas Krüger (woopstar) 2017-12-05 09:34:45.724-0600

Yes I know. I was just wondering if it could be related too.

Reading that page, it states aftes Nov. 27, only version 1.5.4 and above is supported of SRTP. Yet 2.0 is not fully supported but under test.

Looking though debian, ubuntu etc, the libsrtp is currently version 1.4.5 in stable releases. The guide states that you should use install_prereq to install dependencies. That script is by default trying to install the version 1.4.5 from aptitude.
There seems to be a version mismatch here?

By: Richard Mudgett (rmudgett) 2017-12-05 10:57:57.660-0600

Sigh.  The install_prereq script is a useful user contributed helper script.  It is a very simple script which simply installs needed packages.  It does not know nor care which version your system is going to install from packages.  The package maintainers for your distribution may even have backported some of the critical fixes to libsrtp for their "older" version of libsrtp.  If your system actually doesn't have a libsrtp package installed, the script can get an unpackaged version when you use the install-unpackaged option.  The unpackaged version the script installs was updated to 2.0 early last month by a community member.

By: Joshua C. Colp (jcolp) 2017-12-05 12:49:10.218-0600

If this does occur after trying the new version of PJSIP the console output with as much detail as possible before the crash would also be useful.

By: Andreas Krüger (woopstar) 2017-12-05 13:12:45.562-0600

@Richard

The wiki just mentions the install_prereq script all over the guides, so I think people would think it is the "right" way to prepare the server before you compile Asterisk.
I do agree that the distros often backport the fixes, I'm simply just stating it is kinda misleading and the only thing we want to do is to be "up to date" and not be using old faulty versions of any library.
I did see the script got updated with libSRTP 2.0 but was just curious as the wiki stated it was yet not ready for production.
We are almost about to contribute with a complete production ready Asterisk running inside Docker, so this was simply just useful for us.

@Joshua
We will test PJSIP 2.7 when it is released within the 15 branch and then see if the issue is resolved.


By: Corey Farrell (coreyfarrell) 2017-12-05 14:01:06.029-0600

If you are using debian your libsrtp-1.4.5 has at least some backported fixes.  See debian security advisories.
https://security-tracker.debian.org/tracker/CVE-2013-2139
https://security-tracker.debian.org/tracker/CVE-2015-6360

It's absolutely possible other issues have been fixed by 1.5.4 which debian has not backported, I'm just pointing out that debian is not ignoring libsrtp issues.

By: Asterisk Team (asteriskteam) 2017-12-20 12:00:01.952-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines