[Home]

Summary:ASTERISK-26738: Frequent segfaults since activation of DNS SRV, in pjsip_auth_clt_reinit_req at /pjsip/sip_auth_client.c, and pj_atomic_inc_and_get at pj/os_core_unix.c
Reporter:Michael Maier (micha)Labels:
Date Opened:2017-01-20 15:27:49.000-0600Date Closed:2017-02-21 18:59:44.000-0600
Priority:CriticalRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.13.1 Frequency of
Occurrence
Frequent
Related
Issues:
duplicatesASTERISK-26669 PJSIP Segfault 13.13.1 (Bundled PJSIP)
Environment:Centos 6 64bit FreePBX 13.0.190.7 IPv4 and IPv6 dual core processorAttachments:( 0) backtrace-2017-01-20T21.21.25.txt
( 1) backtrace-2017-01-20T23.21.00+0100.txt
( 2) backtrace-2017-01-21T05.19.00+0100.txt
( 3) backtrace-2017-01-21T12.43.36+0100.txt
( 4) backtrace-2017-02-14T21.29.39+0100.txt
( 5) debug-2017-01-21T12.43.36+0100.txt
Description:After switching to DNS SRV w/ 5 parallel trunks I'm getting frequent segfaults like the attached one. Backtrace is built on base of patched Schmooze RPM source package:
- Patch to get libasteriskpj.so.2 symbols
- Patch from ASTERISK-26675 (segfault has been seen already w/o this patch - but it doesn't obviously fix this problem)

Parallel running tcpdump shows, that REGISTER is sent at the time of the crash.
In detail:
- The DNS SRV entry has two hostnames (see ASTERISK-26735). All the time before the crash, ipaddress 217.0.23.108 is used for REGISTER.
- At the moment of the crash, 2(!!) REGISTERS are sent nearly at the same time (0.01s difference) for the same trunk! The first to 217.0.23.140(!), the second to the usual 217.0.23.108. The answer from 217.0.23.108 comes first (200 ok), the answer from 217.0.23.140 comes 0.09s later and is 401 unauthorized. That's the end. Now the crash appears.
Comments:By: Asterisk Team (asteriskteam) 2017-01-20 15:27:50.364-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Michael Maier (micha) 2017-01-20 15:31:30.377-0600

Backtrace of segfault

By: Rusty Newton (rnewton) 2017-01-20 17:09:17.579-0600

We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Michael Maier (micha) 2017-01-21 00:32:44.526-0600

2 new backtraces. Segfault is always only with Telekom (I have two more trunk providers). For Telekom, there are 3 trunks (3 different numbers). Telekom is the only provider using more than one ipaddress and with completely different resolutions between SRV and not SRV entry.

It turns out, that crash always occurs, if for one and the same trunk 2 REGISTERS are created at nearly the same time (about 0.0004s difference), regardless if the destination of the 2 REGISTERS is the same or not.

Relevant parameters are:
- SRV with more than one entries.
- may be more than one trunk to the same provider with different numbers.
- trunk names are telekomPJSIP, telekomPJSIP-316, telekomPJSIP-317

I setup a debug trace. Hopefully this doesn't flood the disk (there isn't much space). Therefore I hope I can provide more information the next time.


By: Michael Maier (micha) 2017-01-21 07:50:58.103-0600

This backtrace now contains at the end the relevant debug output. Hope this helps (it's rather small but there wasn't any more at the relevant time). The machine was completely idle at the time of the crash. This is true for each provided trace here.

As usual, the REGISTER packet was sent out a second time for the same trunk after just 0.000731s. The unauthorized answer can be seen only in tcpdump 0.09s after the second REGISTER package - asterisk already crashed at the arrival time of the answer package or it wasn't written to file anymore. It is unauthorized because asterisk switched target address from 217.0.23.108 to 23.140.
That's the answer package as seen by tcpdump:
{noformat}
SIP/2.0 401 Unauthorized 010350348
Via: SIP/2.0/UDP 46.91.112.37:5061;received=46.91.112.37;rport=5061;branch=z9hG4bKPj8258aae8-02a1-49a2-a356-a44706ed6eee
To: <sip:+4911112222222@tel.t-online.de>;tag=h7g4Esbg_d2bfd3770a30b062f0c6cd3f8fb450c4
From: <sip:+4911112222222@tel.t-online.de>;tag=2a385a16-37d3-4c03-89a9-0387642326ce
Call-ID: 41fb9b70-43f1-42c0-8588-605a4b2bff2e
CSeq: 59123 REGISTER
Service-Route: <sip:217.0.23.140:5060;transport=udp;lr>
WWW-Authenticate: Digest realm="tel.t-online.de",nonce="B9CC18C96B43535800000000614B2493",stale=true,algorithm=MD5,qop="auth"
Content-Length: 0
{noformat}
I could see a few OPTIONS packages sent to 23.140 long before and one more REGISTER to 23.140 since the last asterisk restart at 5:19 today. I don't think this is intended behavior. OPTIONS should always go to the same target as the REGISTER.

Error message in /var/log/messages was:
{noformat}
kernel: traps: asterisk[24916] general protection ip:7fdfdf93d4b2 sp:7fdf9052f5f0 error:0 in libasteriskpj.so.2[7fdfdf8a0000+16d000]
{noformat}
The machine is a multihomed machine (3 physical interfaces, 1 tap device, one bridge and one ppp0 device). ppp0 and eth0 both have IPv6 global addresses, too.


By: Michael Maier (micha) 2017-01-23 05:22:00.494-0600

Does the provided information help you to find the problem? Do you need any more? Please ask if you need any more! This problem really must be fixed, because asterisk / pjsip / dns SRV with different entries is completely unusable as long as it frequently crashes.

By: Rusty Newton (rnewton) 2017-01-23 08:55:06.287-0600

Can you attach the debug in separate files, but make it clear which crash they relate to? We have to make this easy for developers to look at and typically the debug is attached separately.

By: Rusty Newton (rnewton) 2017-01-23 08:57:48.056-0600

In addition, please note your exact pjproject version.

By: Michael Maier (micha) 2017-01-23 10:05:25.618-0600

Own builds are based on asterisk13-13.13.1-1.shmz65.1.150.src.rpm (rpm -i src.rpm, rpmbuild -ba asterisk13.spec after modifying spec file to meet the demand of building traceable binaries and after applying mentioned patches) including patch ASTERISK-26675 and patch and "[PATCH] pjproject_bundled:  Fix missing inclusion of symbols" to get symbols of libasteriskpj.

By: Michael Maier (micha) 2017-01-23 10:11:27.050-0600

Splitted debug- and backtrace- 2017-01-21T12.43.36+0100.txt to 2 separated files. Removed file containing both debug and backtrace in one file.

By: Rusty Newton (rnewton) 2017-01-23 12:45:37.828-0600

Thanks!

By: Richard Mudgett (rmudgett) 2017-02-14 14:24:21.565-0600

The underlying cause of this crash is the same as ASTERISK-26669.  The crash here is because of an outgoing REGISTER while the crash in ASTERISK-26669 is because of an outgoing OPTIONS qualify ping.  Both are because of DNS SRV lookups.  I have put a patch on ASTERISK-26669 which should fix the issue.

By: Michael Maier (micha) 2017-02-15 10:45:12.253-0600

Backtrace after crash during reload.

By: Michael Maier (micha) 2017-02-15 10:45:36.654-0600

I applied provided patch in ASTERISK-26669. During a reload via FreePBX, I got the attached crash  backtrace-2017-02-14T21.29.39+0100.txt. This reload was necessary to switch back from SRV to normal DNS handling, because SRV handling is substantially broken even after this patch.

Why?

Given is the first DNS SRV lookup after retart of Asterisk, which provides hostname A and hostname B. Asterisk decides to use IP-address of hostname A.
Some time / hours later, another SRV lookup is done. DNS provides hostname B and hostname A (same hostnames as before, but vice versa). Now, Asterisk decides to use IP-address of hostname B - but only for Register. Options and Invite use the old IP-address A - which doesn't work (they are rejected by ISP because of missing Register).

By: Richard Mudgett (rmudgett) 2017-02-15 16:50:33.358-0600

The backtraces aren't enough to find the issue.  I will need debug logs showing what is happening with the DNS lookups and the corresponding tdata that was involved in the crash.  Using Asterisk 13.13.1 please apply all four patches on ASTERISK-26669 and the patch "[PATCH] pjproject_bundled: Fix missing inclusion of symbols" to get symbols of libasteriskpj.  Do not apply the patch from ASTERISK-26675 as that patch tries to cover up the problem after it has happened.

Since most of the patches on ASTERISK-26669 are to pjproject you must use bundled pjproject for them to work.  The patches are applied when the pjproject source files are unzipped.  To ensure that they actually get applied do this from the Asterisk source root directory:
{noformat}
cd third-party/pjproject
make distclean
cd ../..
./configure --with-pjproject-bundled --with-externals-cache=/home/rmudgett/cache_pjproject <-- Or whatever command you use to configure for bundled pjproject
make && sudo make install
{noformat}

Thanks

By: Michael Maier (micha) 2017-02-16 11:36:45.771-0600

I have to apologize! I made a mistake by applying https://issues.asterisk.org/jira/secure/attachment/55030/55030_jira_asterisk_26669_v13_fix_srv.patch , which was just partly applied (and unfortunately didn't throw any error or warning neither during patching nor during compiling). After it has been applied correctly, things seem to work completely fine as expected. I will run it a couple of days and report back at Sunday..

By: Michael Maier (micha) 2017-02-19 12:54:33.611-0600

I'ts working just perfectly. Thanks for this patch!

By: Friendly Automation (friendly-automation) 2017-02-21 18:59:45.326-0600

Change 5032 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5032|https://gerrit.asterisk.org/5032]

By: Friendly Automation (friendly-automation) 2017-02-21 20:02:09.744-0600

Change 5035 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5035|https://gerrit.asterisk.org/5035]

By: Friendly Automation (friendly-automation) 2017-02-21 20:15:29.138-0600

Change 5036 merged by zuul:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5036|https://gerrit.asterisk.org/5036]

By: Friendly Automation (friendly-automation) 2017-02-21 20:24:23.424-0600

Change 5037 merged by zuul:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5037|https://gerrit.asterisk.org/5037]

By: Friendly Automation (friendly-automation) 2017-02-21 21:18:26.667-0600

Change 5028 merged by zuul:
res_pjsip: Record the serializer earlier on the tdata.

[https://gerrit.asterisk.org/5028|https://gerrit.asterisk.org/5028]

By: Friendly Automation (friendly-automation) 2017-02-21 21:18:30.574-0600

Change 5029 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5029|https://gerrit.asterisk.org/5029]

By: Friendly Automation (friendly-automation) 2017-02-21 22:39:52.411-0600

Change 5038 merged by zuul:
res_pjsip: Record the serializer earlier on the tdata.

[https://gerrit.asterisk.org/5038|https://gerrit.asterisk.org/5038]

By: Friendly Automation (friendly-automation) 2017-02-21 22:39:56.655-0600

Change 5039 merged by zuul:
pjsip_distributor.c: Update some debug messages to get transaction name.

[https://gerrit.asterisk.org/5039|https://gerrit.asterisk.org/5039]

By: Friendly Automation (friendly-automation) 2017-02-21 22:40:03.294-0600

Change 5040 merged by zuul:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5040|https://gerrit.asterisk.org/5040]

By: Friendly Automation (friendly-automation) 2017-02-21 22:40:08.182-0600

Change 5041 merged by zuul:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5041|https://gerrit.asterisk.org/5041]

By: Friendly Automation (friendly-automation) 2017-02-22 04:49:49.623-0600

Change 5030 merged by Joshua Colp:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5030|https://gerrit.asterisk.org/5030]

By: Friendly Automation (friendly-automation) 2017-02-22 04:49:54.073-0600

Change 5031 merged by Joshua Colp:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5031|https://gerrit.asterisk.org/5031]

By: Friendly Automation (friendly-automation) 2017-02-22 05:54:57.844-0600

Change 5033 merged by Joshua Colp:
pjproject: Increase SENDER_WIDTH column size for 64-bit system logs.

[https://gerrit.asterisk.org/5033|https://gerrit.asterisk.org/5033]

By: Friendly Automation (friendly-automation) 2017-02-22 05:54:59.027-0600

Change 5034 merged by Joshua Colp:
pjproject: Fixes to resolve DNS SRV crashes.

[https://gerrit.asterisk.org/5034|https://gerrit.asterisk.org/5034]