[Home]

Summary:ASTERISK-26669: PJSIP Segfault 13.13.1 (Bundled PJSIP)
Reporter:Nic Colledge (nic)Labels:
Date Opened:2016-12-22 04:41:58.000-0600Date Closed:2017-02-21 18:59:47.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:13.13.1 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-26738 Frequent segfaults since activation of DNS SRV, in pjsip_auth_clt_reinit_req at /pjsip/sip_auth_client.c, and pj_atomic_inc_and_get at pj/os_core_unix.c
is related toASTERISK-26675 PJSIP Segmentation Fault grp_lock_acquire
Environment:Ubuntu 16.04.1 LTS, Linux 4.4.0-53-genericAttachments:( 0) backtrace.30388.ast3.1486008781-0413.txt
( 1) backtrace.30588.ast3.1486010460-0441.txt
( 2) backtrace.30719.ast3.1486015498-0604.txt
( 3) backtrace.36281.ast3.1486130000-1353.txt
( 4) backtrace.36368.ast3.1486140553-1649.txt
( 5) backtrace.core.asterisk.1107.ast3.1486474718-1338.txt
( 6) backtrace.core.asterisk.1215.ast3.1486495028-1917.txt
( 7) backtrace.core.asterisk.3079.ast3.1486566672-1511.txt
( 8) backtrace.core.asterisk.6222.ast3.1486609431-0303.txt
( 9) backtrace-1482444465.txt
(10) backtrace-1482482153.txt
(11) backtrace-1482483920.txt
(12) backtrace-1482496599.txt
(13) backtrace-1482497243.txt
(14) backtrace-1482498503.txt
(15) debug_log_0303.txt
(16) debug_log_0413.txt
(17) debug_log_0441.txt
(18) debug_log_0604.txt
(19) debug_log_1338.txt
(20) debug_log_1353.txt
(21) debug_log_1511.txt
(22) debug_log_1649.txt
(23) debug_log_1917.txt
(24) jira_asterisk_26669_v13_fix_srv.patch
(25) jira_asterisk_26669_v13_pjproject_logging.patch
(26) jira_asterisk_26669_v13_tdata_trace.patch
(27) jira_asterisk_26669_v13_test.patch
Description:Hi guys,

Been getting an intermittent segfault with asterisk 13.13.1 since we upgraded. Its a low-load server but happens maybe a few times a day.
We jumped a few versions from 13.10 so not sure if this is a last-release regression or not.

We have another higher-load server (with the same setup) that has not yet crashed in the same way.

I will attach backtraces; the fault its self comes from pj_atomic_dec_and_get from /usr/lib/libasteriskpj.so
Comments:By: Asterisk Team (asteriskteam) 2016-12-22 04:41:59.528-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Nic Colledge (nic) 2016-12-22 04:45:19.642-0600

Backtraces

By: Rusty Newton (rnewton) 2016-12-22 15:05:30.005-0600

[~nic] , thanks for the report! Unfortunately there is no symbol table info for many of the frames.

This is resolved for bundled PJPROJECT in the fix here https://gerrit.asterisk.org/#/c/4557/

You can wait for the next release to get more traces, or make a patch from that review to try out now. You could also recompile with the Git head of the 13 branch (probably easiest) or compile pjproject separately with the appropriate flags! So many choices.

Either way, make sure to compile Asterisk with DONT_OPTIMIZE and BETTER_BACKTRACES.

Thanks!

By: Rusty Newton (rnewton) 2016-12-22 15:12:35.669-0600

Oh, and we could use a debug log to match the crash. Follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Nic Colledge (nic) 2016-12-22 17:01:44.270-0600

Ok cool, I ran the patch against 13.13.1 and got "lucky" with a crash happening right away. I'm not 100% sure this is the same issue though looking at the backtrace but attaching anyway.

By: Nic Colledge (nic) 2016-12-22 17:07:03.011-0600

Backtrace

By: Nic Colledge (nic) 2016-12-22 17:17:14.039-0600

.. also theres not too much standing out at the end of the debug log when the crash happened. Just verbose level messages about endpoints becoming reachable.
Don't really want to attach it publicly unless I have to as its full of IP addresses and user/peer names etc.




By: Nic Colledge (nic) 2016-12-23 04:19:06.697-0600

Added more backtraces.

By: Nic Colledge (nic) 2016-12-23 08:37:35.700-0600

More backtraces.

backtrace-1482496599.txt - Cannot access memory at address 0x746e>) at /usr/src/asterisk-13.13.1/include/asterisk/strings.h:66

backtrace-1482498503.txt - __GI___pthread_mutex_lock (mutex=0x7461440071655343) at ../nptl/pthread_mutex_lock.c:67

backtrace-1482497243.txt - __GI___pthread_mutex_lock (mutex=0x6c6f68772d706973) at ../nptl/pthread_mutex_lock.c:67

By: Joshua C. Colp (jcolp) 2017-02-01 08:04:14.463-0600

What exactly are you running for PJSIP now? Has this happened again? Do you have a debug log like previously asked for?

By: Nic Colledge (nic) 2017-02-01 14:58:51.435-0600

Hi Joshua,

Still running the same version (13.13.1) with the patch above (https://gerrit.asterisk.org/#/c/4557/) to fix the backtraces in bundled pjsip. Still getting the issues. Ill clean everything out now and pull a fresh debug log tomorrow when it happens again. The debug logs I saw before didn't have anything in them standing-out at the time the crash happened (or any other time) but I'll get some tomorrow and remove any sensitive addresses / account names if you think it will help.

Thanks,
Nic.


By: Joshua C. Colp (jcolp) 2017-02-01 15:01:12.867-0600

I'd also suggest pulling in a fix for out of dialog requests[1] which is currently up for review. It is in the same area.

[1] https://gerrit.asterisk.org/#/c/4846/

By: Nic Colledge (nic) 2017-02-02 10:55:23.337-0600

Applied both patches mentioned in the thread and took some backtraces from early this morning.

Use the last four digits (time HHmm) to tie debug log to backtrace (comments also in log file).

backtrace.30388.ast3.1486008781-0413 - in pj_atomic_dec_and_get (atomic_var=0x7f9500000007) at ../src/pj/os_core_unix.c:962

backtrace.30588.ast3.1486010460-0441 - in pj_atomic_dec_and_get (atomic_var=0x7f530000fb06) at ../src/pj/os_core_unix.c:962

backtrace.30719.ast3.1486015498-0604 in pj_lock_acquire (lock=0xffff800800000053) at ../src/pj/lock.c:180

By: Joshua C. Colp (jcolp) 2017-02-02 12:22:27.170-0600

The logs show a bit more info but do not include the debug level needed to understand the full flow of what happened. Can you follow the wiki page that [~rnewton] linked? It would give useful information.

By: Nic Colledge (nic) 2017-02-04 12:36:07.463-0600

backtrace.36368.ast3.1486140553-1649.txt - pj_atomic_dec_and_get (atomic_var=0x7f3400000005) at ../src/pj/os_core_unix.c:962

backtrace.36281.ast3.1486130000-1353.txt - pop_freelist (ht=0x291ee00) at ../src/pj/timer.c:149 (a new one - not seen this before??)

Also attached trimmed debug logs from around the time of the crash (full files are far too large) - let me know if you need more info.

Thanks,
Nic.

By: Richard Mudgett (rmudgett) 2017-02-06 17:18:36.754-0600

[^jira_asterisk_26669_v13_test.patch] - This patch updates some debug messages and records the serializer sending messages earlier where possible.  DNS resolution causes the outgoing serializer to not get recorded to handle responses.

What kind of addresses is DNS looking up? IPv4, IPv6, or both?

The last two crashes are reminiscent of ASTERISK-26344 where the tdata was destroyed early.  It's like the pjproject patches to fix that issue aren't being applied.  You can ensure that they get applied by doing the following:
{noformat}
cd third-party/pjproject/
make distclean
cd ../..
./configure --with-pjproject-bundled <- or whatever command you use to configure for bundled pjproject
{noformat}

By: Nic Colledge (nic) 2017-02-06 18:14:46.570-0600

Hi Richard,

Thanks for your reply. I applied the additional patch and ran through the distclean steps for pjsip. Ill report back with more traces etc. when it happens again.

It should just be IPv4 for provider DNS lookups.

Not sure if its important but one provider has both A and SRV records (_sip._udp.provideraddr.com). The other providers only have A records.
No AAAA (IPv6) addresses are returned for DNS lookups to the provider addresses. All were checked using dig on the box having the issues.

Thanks,
Nic.

By: Nic Colledge (nic) 2017-02-08 17:24:05.202-0600

More backtraces hopefully with patches applied :-)

As before numbers on the end of filenames match debug log to backtraces.

debug_log_1511 -> backtrace.core.asterisk.3079.ast3.1486566672-1511
debug_log_1338 -> backtrace.core.asterisk.1107.ast3.1486474718-1338
debug_log_1917 -> backtrace.core.asterisk.1215.ast3.1486495028-1917

Thanks,
Nic.

By: Richard Mudgett (rmudgett) 2017-02-08 17:39:24.701-0600

Could you increase the logging timestamp resolution displayed?  You can change the format in the {{logger.conf}} file:
{noformat}
[general]
;
; Customize the display of debug message time stamps
; this example is the ISO 8601 date format (yyyy-mm-dd HH:MM:SS)
;
; see strftime(3) Linux manual for format specifiers.  Note that there is also
; a fractional second parameter which may be used in this field.  Use %1q
; for tenths, %2q for hundredths, etc.
;
;dateformat=%F %T       ; ISO 8601 date format
;dateformat=%F %T.%3q   ; with milliseconds
;dateformat=%b %e %T    ; Original Asterisk format (default)
; Default Asterisk dateformat with milliseconds
dateformat=%b %e %T.%3q
{noformat}

Also have you changed the timer_t1 and timer_b settings in pjsip.conf?
{noformat}
;==========================SYSTEM SECTION OPTIONS=========================
;[system]
;  SYNOPSIS: Options that apply to the SIP stack as well as other system-wide settings
;timer_t1=500   ; Set transaction timer T1 value milliseconds (default: "500")
;timer_b=32000  ; Set transaction timer B value milliseconds (default: "32000")
{noformat}


By: Nic Colledge (nic) 2017-02-08 18:10:11.257-0600

Hi Richard,

I have updated the logger config with milliseconds and will report back when it happens again.

The timers have not been changed in pjsip.conf. they still show as 500 and 32000 respectively when running "pjsip show settings"

Thanks,
Nic.

By: Nic Colledge (nic) 2017-02-09 14:21:23.168-0600

Higher-resolution log times uploaded
backtrace.core.asterisk.6222.ast3.1486609431-0303.txt -> debug_log_0303.txt

By: Richard Mudgett (rmudgett) 2017-02-09 18:45:57.614-0600

[^jira_asterisk_26669_v13_tdata_trace.patch] - This patch adds some tdata ref tracing messages to pjproject.  It should be added in addition to the other patches you have installed.

The problem is somewhere in pjproject and it appears to be an extra unref of a tdata OPTIONS request message.  It has something to do with DNS resolution and how the messages are sent.  How it is triggered I still don't know and I haven't been able to reproduce it.

To apply:
# In the Asterisk source root directory {{patch -p1 -i jira_asterisk_26669_v13_tdata_trace.patch}}
# Use the project distclean method described in an earlier comment.
# Recompile and install.

To collect the needed debug data in addition to how you have collected the previous backtraces and Asterisk debug data enable {{pjproject set log level 5}}.  Optionally, you can see {{pjproject.conf.sample}} for an option to start the logging immediately on Asterisk start.  Adding the pjproject logging output will allow following tdata buffer ref changes and the DNS resolution steps.

The last six or so backtraces have all been in the same place.

Thanks

By: Nic Colledge (nic) 2017-02-10 07:33:56.490-0600

Is that option / config section in 13.13.1? I can see it on the master-branch pjproject.conf.sample but not 13.13.1. Same with the console command.

Thanks,
Nic.

By: Richard Mudgett (rmudgett) 2017-02-10 11:23:44.708-0600

[^jira_asterisk_26669_v13_pjproject_logging.patch] - This enables log level 5 in pjproject and will output the higher debugging levels when Asterisk "core set debug" is at least 1.

Sorry.  I forgot how new "pjproject set log level" is.  It will be available in v13.14.0 (Release candidate is available now).

Apply this patch in addition to the others.  This patch needs the pjproject distclean method performed so the change in the {{config_site.h}} file will be applied.

By: Nic Colledge (nic) 2017-02-12 10:15:05.751-0600

Ok all done.

I think the crux of the issue is as follows:

The providers SRV records include multiple servers except it only returns a sub-set of the servers in response to a DNS request. Say there are eight servers in total, only six will (randomly) be returned per lookup. When the SRV is returned A and AAAA lookups are performed on each of the names returned. When this happens its possible that you will get cache-hits for all but one of the addresses. It appears to crash shortly after getting the DNS response for the last "new" uncached address.

I will shortly email you the latest un-redacted debug log.

I also have a packet capture of the DNS lookups around the time of the crash if this is useful (but there is nothing special about the responses).

Thanks,
Nic.

By: Richard Mudgett (rmudgett) 2017-02-13 17:26:17.339-0600

[^jira_asterisk_26669_v13_fix_srv.patch] - This should fix the DNS SRV resolution crash.  The additional records in the DNS SRV query response were being counted toward an already resolved query but pjproject was going ahead with a real DNS query anyway.  This resulted in the resolution callback being called early and possibly several times.  The extra calls cause the tdata ref to be released too many times and get destroyed early.

This is a patch to pjproject so the distclean steps above are needed.  After applying the patch.

By: Nic Colledge (nic) 2017-02-14 17:45:14.936-0600

Hi Richard,

24-hours uptime since the patch was applied. Ill keep an eye on it but hopefully that means its fixed :-)

Thanks,
Nic.

By: Friendly Automation (friendly-automation) 2017-02-21 18:59:49.063-0600

Change 5032 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5032|https://gerrit.asterisk.org/5032]

By: Friendly Automation (friendly-automation) 2017-02-21 20:02:11.093-0600

Change 5035 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5035|https://gerrit.asterisk.org/5035]

By: Friendly Automation (friendly-automation) 2017-02-21 20:15:27.496-0600

Change 5036 merged by zuul:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5036|https://gerrit.asterisk.org/5036]

By: Friendly Automation (friendly-automation) 2017-02-21 20:24:21.959-0600

Change 5037 merged by zuul:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5037|https://gerrit.asterisk.org/5037]

By: Friendly Automation (friendly-automation) 2017-02-21 21:18:25.445-0600

Change 5028 merged by zuul:
res_pjsip: Record the serializer earlier on the tdata.

[https://gerrit.asterisk.org/5028|https://gerrit.asterisk.org/5028]

By: Friendly Automation (friendly-automation) 2017-02-21 21:18:29.130-0600

Change 5029 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5029|https://gerrit.asterisk.org/5029]

By: Friendly Automation (friendly-automation) 2017-02-21 22:39:54.025-0600

Change 5038 merged by zuul:
res_pjsip: Record the serializer earlier on the tdata.

[https://gerrit.asterisk.org/5038|https://gerrit.asterisk.org/5038]

By: Friendly Automation (friendly-automation) 2017-02-21 22:39:58.643-0600

Change 5039 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5039|https://gerrit.asterisk.org/5039]

By: Friendly Automation (friendly-automation) 2017-02-21 22:40:01.480-0600

Change 5040 merged by zuul:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5040|https://gerrit.asterisk.org/5040]

By: Friendly Automation (friendly-automation) 2017-02-21 22:40:06.322-0600

Change 5041 merged by zuul:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5041|https://gerrit.asterisk.org/5041]

By: Friendly Automation (friendly-automation) 2017-02-22 04:49:51.850-0600

Change 5030 merged by Joshua Colp:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5030|https://gerrit.asterisk.org/5030]

By: Friendly Automation (friendly-automation) 2017-02-22 04:49:56.079-0600

Change 5031 merged by Joshua Colp:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5031|https://gerrit.asterisk.org/5031]

By: Friendly Automation (friendly-automation) 2017-02-22 05:54:56.385-0600

Change 5033 merged by Joshua Colp:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5033|https://gerrit.asterisk.org/5033]

By: Friendly Automation (friendly-automation) 2017-02-22 05:55:00.273-0600

Change 5034 merged by Joshua Colp:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5034|https://gerrit.asterisk.org/5034]