[Home]

Summary:ASTERISK-26310: Crash occurs every 24 - 48 hours with backtrace log showing fault related to pjsip hash
Reporter:Gaston Mendez (gastonxpander)Labels:
Date Opened:2016-08-20 13:41:25Date Closed:2017-12-20 06:07:53.000-0600
Priority:CriticalRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:13.10.0 13.11.0 Frequency of
Occurrence
Frequent
Related
Issues:
duplicatesASTERISK-25439 Segfault in find_entry () from /usr/lib/libpj.so.2 (dns_resolver, qualify_contact)
is related toASTERISK-26344 Asterisk 13.11.0 + PJSIP crash
Environment:Asterisk 13.10.0 running on fully updated Centos 7 linux 64bit. We also have a second backtrace showing the same ../src/pj/hash.c:181 in the (gdb) bt output from a second asterisk server running Asterisk 13.11.0-rc1 so we think we are crashing the same way across the 2 latest versions of asterisk 13.Attachments:( 0) asterisk_full_08-29-2016-0924a.txt
( 1) asterisk_full_08302016_0620p.txt
( 2) asterisk_full_08302016_0624p.txt
( 3) backtrace_08302016_0620p.txt
( 4) backtrace_08302016_0624p.txt
( 5) backtrace13-10-0.txt
( 6) backtrace13-10-0-on-08-29-2016-0930a.txt
( 7) backtrace13-11-0.txt
( 8) full_log_13-10-0.txt
( 9) full_log_13-11-0.txt
(10) modules.conf.txt
(11) pjsip.conf.txt
(12) rtp.conf.txt
(13) udptl.conf.txt
Description:We are trying to put an Asterisk 13 server into production. First time using pjsip as well. When we get to a loaded beta of 20 active calls we are experiencing crashes unpredictably and without a visible error or commonality between crashes. It is not load dependent because we have seen it crash at low points during the day with literally 1 - 2 active calls running during the crash. The only thing that's certain is that after steady load of every day use in 2 week beta we know it will crash every 48 hours, and more like every 24 hours. It will crash with no visible error or complaint in asterisk messages or full logs which are very clean and quiet logs. The coredump shows it citing line 181 of ../src/pj/hash.c and the only known commonality we have between crashes is that we have at least 2 backtraces on 2 different servers citing this same line of code in the back trace (gdb bt) like this:
{noformat}
#0  find_entry (lower=0, entry_buf=0x0, hval=0x7f52cc5412cc, val=0x0, keylen=258, key=0x7f52cc541310, ht=<optimized out>, pool=0x0) at ../src/pj/hash.c:181
{noformat}
{noformat}
181 if (entry->hash==hash && entry->keylen==keylen &&
{noformat}
It seems there is some instability we must be triggering in pjsip/asterisk. We are not doing anything outside the norm of what we've done on old versions of asterisk. Asterisk throws no message errors at any time, and other than this once a day crash, asterisk 13 is running very clean and high performing with no other complaint at all. We have reason to believe this is some asterisk/pjsip bug we have triggered. There are no exact steps to trigger it. It seems as long as there is at least 1 active call it can happen. It also happens about once every 24-48 hours for a span of 2 weeks. So the only way to 'reproduce' it is to wait 48 hours as we have been. We have multiple backtraces and are attaching 2 that show the same exact source code file and line number. As stated in the environment section we are crashing across 2 servers, the second being identical centos 7 fully yum updated 64 bit linux with the second server running Asterisk 13.11.0-rc1. We will attach everything we have from both servers and file it as a bug report and hope we can stabilize the system asap.
Comments:By: Asterisk Team (asteriskteam) 2016-08-20 13:41:26.532-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Gaston Mendez (gastonxpander) 2016-08-20 14:00:06.973-0500

Asterisk 13.10.0 backtrace

By: Gaston Mendez (gastonxpander) 2016-08-20 14:06:55.037-0500

Asterisk 13.11.0 backtrace showing same pjsip source code line number.

By: Gaston Mendez (gastonxpander) 2016-08-20 14:19:28.886-0500

Asterisk 13.10.0 verbose full log the minute of the crash. No correlation to anything in asterisk dialplan/function/module found.

By: Gaston Mendez (gastonxpander) 2016-08-20 14:26:03.500-0500

Asterisk 13.11.0 full log with debug on during the minute of the crash. No correlation or error found in asterisk logs.

By: Gaston Mendez (gastonxpander) 2016-08-20 14:28:03.803-0500

pjsip.conf which is the same on both asterisk servers that are crashing

By: Gaston Mendez (gastonxpander) 2016-08-20 14:53:04.697-0500

Core configuration files shared by both crashing asterisk servers.

By: Gaston Mendez (gastonxpander) 2016-08-22 10:04:39.287-0500

Added pjsip.conf txt file with public IP addresses masked.

By: Rusty Newton (rnewton) 2016-08-24 09:58:30.846-0500

Can you recompile with DONT_OPTIMIZE and BETTER_BACKTRACES compiler flags to help us get better data on your next dumps?  I see a lot of optimizations in the traces.

By: Gaston Mendez (gastonxpander) 2016-08-25 10:13:11.815-0500

So is there absolutely nothing you can tell from this documentation? No clues at all?

We will recompile, but since we do have live action on these servers, we will recompile over the weekend, and then will "wait" for a crash next week starting Monday.

Thank you.

By: Gaston Mendez (gastonxpander) 2016-08-29 10:09:10.976-0500

Backtrace showing asterisk/pjsip crash with DONT_OPTIMIZE and BETTER_BACKTRACES

By: Gaston Mendez (gastonxpander) 2016-08-29 10:57:45.316-0500

Attached asterisk full log of the seconds leading into the crash.

By: Rusty Newton (rnewton) 2016-08-30 19:29:11.089-0500

Once a developer is able to investigate further then we'll have additional information. Thanks for everything you have provided so far.

By: George Joseph (gjoseph) 2016-08-30 19:53:01.224-0500

Are you using the bundled pjproject or an external version?  If external, what version and can you try with the bundled version?


By: Gaston Mendez (gastonxpander) 2016-08-30 20:02:23.163-0500

This is bundled pjproject. the crash was occurring on a server where pjproject was compiled and installed separately. I actually built new servers with pjproject bundle hoping it would fix it. Once I saw it did not I opened this case.

By: Gaston Mendez (gastonxpander) 2016-09-03 21:39:09.906-0500

The server continues to crash almost daily with gdb showing a pjsip file in the bt command output. Attached is another backtrace on 8/30 at 624p eastern showing the same line number 181 of the pjsip hash.c as the original backtraces in this case. Also attached along is the full log preceding minute during the crash and restart again showing just pjsip performing regular dials and no error messages.

By: Gaston Mendez (gastonxpander) 2016-09-03 21:48:23.517-0500

Attached is another crash which occurred within 4 minutes of another crash in rapid succession. On this crash the (gdb) bt command shows a different file name and line number than the pervious backtraces. Hopefully this provides some additional evidence. It shows the following file name under pjsip libraries: #1  0x00007f62abf283a1 in grp_lock_acquire (p=0x7f6228060b18) at ../src/pj/lock.c:290

So again it is pjsip being the only correlation. Also attached is the asterisk full log of the minute prior again showing mostly dials for pjsip to handle, and no errors. Asterisk is crashing almost daily, and the server is massively _under_ utilized, with very light load. Other than these sudden crashes, all of our dialplans and feature sets run as expected and the server is nice and performant until it crashes. What can we do? This is blocking us from fully switching to asterisk 13 with pjsip. Thanks in advance.

By: George Joseph (gjoseph) 2016-09-12 15:15:08.711-0500

Gaston,

Between 13.11.0-rc1 and now we updated pjproject to 2.5.5 and fixed a few more bugs there.  Can you upgrade to 13.11.2 and see if you still have the issue?

Keep the DON_OPTIMIZE and BETTER_BACKTRACES flags just in case.
You don't need COMPILE_DOUBLE.

george


By: Gaston Mendez (gastonxpander) 2016-09-14 13:46:40.087-0500

George,

Thank you for finally updating the case. Can you let me know what you think is happening? Do you know where the failure is occurring? This being my first asterisk case and the fact we had such good documentation coupled with the frequent crashing, I was hoping for more information. This really tells us almost nothing from the perspective of the Asterisk team. It almost seems like well you guys really don't know. Which is okay. Again first experience here.

We definitely couldn't wait 2 weeks for an update on this, so we did move off of Asterisk 13 already. We already moved to Asterisk 14-beta2 as the PJSIP dns resolver code looks really bad in the backtraces so we'd prefer to bypass the dns resolver completely and use asterisk 14 with unbound configured as that is where we already determined the crash to be happening. We have only been on 14 since Friday so it is not quite long enough to say it's good to go. We will hold on 14 for at least 2 more weeks. And if it doesn't crash again I don't believe we'll have any reason to go back to 13 ever again. If it does crash again, we will try 13.11.2 as you recommended, but based on our intel we think 14's got the real fix for us.

By: George Joseph (gjoseph) 2016-09-14 15:12:01.825-0500

You are entirely correct when you say "It almost seems like well you guys really don't know.".  We've seen failures there before but that code is very complex as you've seen, and hard for us to debug.  Since we didn't write that part, we only see it when there's an issue  That's why we asked the pjsip team for the ability to use an external resolver and used that new capability in Asterisk 14.  I hesitated recommending Asterisk 14 because it's still a release candidate but if it's working, great.  I'll leave this issue open for now and keep an eye on it.


By: Gaston Mendez (gastonxpander) 2016-09-14 15:22:35.241-0500

George,

Thanks for that. That information does help. At least knowing where your recommendation came from and that you had a thought about going to v14 to bypass this code which jives with why we thought moving to v14 was a reasonable move, so it is nice to know. The plan remains I will stay on 14.0 beta2 for ~ 2 weeks and ideally report that the crash is gone, further pointing to the pjsip dns resolution. Thanks again.

By: George Joseph (gjoseph) 2016-11-02 08:35:23.513-0500

Hi Gaston,

Just checking to see if 14 has resolved the issue for you.  If you need to stay on 13, we did submit DNS related patches to the pjproject team that might help.  We've included them in the bundled pjproject currently in the 13 GIT branch and they should be released in asterisk 13.13.


By: Joshua C. Colp (jcolp) 2017-12-20 06:07:53.469-0600

Tracking the 13 issue under ASTERISK-25439.

By: Gaston Mendez (gastonxpander) 2017-12-22 08:42:38.495-0600

Hello, yes upgrading to 14 did in fact instantly resolve. You rock, thank you!

By: Asterisk Team (asteriskteam) 2017-12-22 08:42:38.804-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.