[Home]

Summary:ASTERISK-27930: res_pjsip: PJSIP TCP Segfault.
Reporter:Anna (modema)Labels:pjsip
Date Opened:2018-06-21 01:28:21Date Closed:2018-10-17 04:41:59
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_pjsip
Versions:13.19.0 13.19.1 13.19.2 13.20.0 13.21.0 13.21.1 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Virtual box on VMWare CentOS Linux release 7.4.1708 (Core) Linux 3.10.0-693.5.2.el7.x86_64 CPU: 4 cores, Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz RAM: 6GBAttachments:( 0) core-2018-06-21T08-28-01+0300-brief.txt
( 1) core-2018-06-21T08-28-01+0300-full.txt
( 2) core-2018-06-21T08-28-01+0300-locks.txt
( 3) core-2018-06-21T08-28-01+0300-thread1.txt
( 4) full_log.txt
( 5) pjsip_users.conf
( 6) pjsip.conf
Description:Hello,

I experience random segfault crashes on Asterisk 13 setup with pjsip. Crashes happen for no apparent reason, it can be twice a day, can be once in 2 weeks. I have 2 servers with the same configuration and they both segfault from time to time. I can't reproduce this crash and don't know what is the reason. Also all endpoints that were registered before crash don't come back and register, i need to manually reboot/re-register all phones.
I created a topic on Asterisk community and was advised to update my version since there is some issue with PJSIP using TCP protocol and if that won't help then i need to submit issue here.
https://community.asterisk.org/t/segmentation-fault-asterisk-13-19-0/74289/9

My asterisk is build from source with these options:
PBX Core settings
-----------------
 Version:                     13.21.0
 Build Options:               DONT_OPTIMIZE, DEBUG_THREADS, BETTER_BACKTRACES, BUILD_NATIVE, OPTIONAL_API
 Maximum calls:               10000 (Current 4)
 Maximum open file handles:   10000
 Root console verbosity:      4
 Current console verbosity:   4
 Debug level:                 0
 Maximum load average:        10.000000
 Minimum free memory:         0 MB
 Startup time:                08:28:10
 Last reload time:            08:28:10
 System:                      Linux/3.10.0-693.5.2.el7.x86_64 built by root on x86_64 2018-05-31 11:24:50 UTC
 System name:                
 Entity ID:                   MY_MAC
 PBX UUID:                    592b015e-a035-4b57-9bbe-8d8f785d663d
 Default language:            en
 Language prefix:             Enabled
 User name and group:         asterisk/asterisk
 Executable includes:         Disabled
 Transcode via SLIN:          Enabled
 Transmit silence during rec: Disabled
 Generic PLC:                 Enabled
 Generic PLC on equal codecs: Disabled
 Min DTMF duration::          80
 Cache media frames:          Enabled
 RTP dynamic payload types:   96-127

* Subsystems
 -------------
 Manager (AMI):               Enabled
 Web Manager (AMI/HTTP):      Disabled
 Call data records:           Enabled
 Realtime Architecture (ARA): Enabled

I have about 125 endpoints with static and dynamic AORs, i attached my pjsip config files. I have endpoints using TCP transport and trunks using UDP.
I also attached backtraces from the core dump created on crash.
Comments:By: Asterisk Team (asteriskteam) 2018-06-21 01:28:22.945-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2018-06-21 04:57:31.454-0500

Just to confirm - you configured Asterisk with --with-pjproject-bundled?

Can you also please attach the Asterisk console log? Where are the endpoints? Are they local?

By: Anna (modema) 2018-06-21 06:07:51.014-0500

Hello Joshua,

Yes, it is configured with --with-pjproject-bundled. Endpoints are 2 types - local (phones, softclients using TCP with local IPs in the same vlan as PBX) and external (providers trunks using UDP with public IPs). PBX itself behind NAT but ALG(cisco SIP inspect) makes all NAT work.
What console logs do you mean, full log before crash? I didn't have debug enabled but  i was connected to cmd when it happened, should i submit output of console (for how much time before crash?)?

By: Joshua C. Colp (jcolp) 2018-06-21 06:17:39.711-0500

As much log as possible for before it crashed, to see if there is anything that sticks out.

By: Anna (modema) 2018-06-22 00:13:54.161-0500

I attached full_log for about 10 minutes before crash. I replaced some IP and other security info with names in {}. Hope that helps.

By: George Joseph (gjoseph) 2018-06-27 08:04:56.293-0500

It looks like there was a race condition between a phone doing a re-register and Asterisk trying to send an OPTIONS message to the phone.   To help us reproduce the issue, could you run {{ast_coredumper --tarball-coredumps --no-default-search <coredump>}} where {{<coredump}} is the actual core file that you generated the files in the attachments from?   It'll contain the actual coredump and the compiled asterisk code that was in use at the time so it'll be quite large.  You can then post it to dropbox or another file sharing service and email the link to asteriskteam@asterisk.org mentioning this issue.  This way you won't be exposing any sensitive information to the public.

BTW, Thanks for supplying the detailed problem description.  We usually don't get this level of information without prompting the reporter several times. :)




By: Anna (modema) 2018-06-28 10:52:07.946-0500

George,

Thank you for your reply. I created tarball but when i sent it i received error from email:
A message that you sent could not be delivered to one or more of
its recipients. This is a permanent error. The following address(es)
failed: asteriskteam@asterisk.org:
SMTP error from remote server for RCPT TO command, host: mail.digium.com (216.207.245.2) reason: 550 Unrouteable address

Is this something with my mail or asterisk's?

By: Anna (modema) 2018-06-28 13:54:05.623-0500

I sent it to asteriskteam@digium.com as this address was mentioned on main tracker page, seems like it worked.
Also i updated my both servers to version 13.21.1 and my second server crashed today again, so the issue is still present in latest release. Please let me know if you will need those backtraces as well.

By: Kevin Harwell (kharwell) 2018-07-02 13:06:38.181-0500

[~modema] Thanks we received your email.

By: Richard Mudgett (rmudgett) 2018-07-12 17:04:36.997-0500

Your backtrace is showing a crash in the transport code of PJPROJECT when Asterisk was trying to send an endpoint qualify OPTIONS ping.  The just released Asterisk 13.22.0 is the first release that has the OPTIONS ping code rewrite for ASTERISK-26806.  The new release may help with your crashing issue.

By: Anna (modema) 2018-07-13 02:01:23.378-0500

Thank you Richard. I will update my servers and monitor them for crashed on new release.

By: Anna (modema) 2018-07-16 04:35:19.999-0500

I had no crash so far but my endpoints are still losing registration from time to time and after Asterisk restart. I reffer to this part in ASTERISK-26806:
{quote}
2. If static contacts are used or a bunch of endpoints have registered and asterisk is \[re]started/\[re]loaded there is a significant slowdown within the options code. This actually blocks until it completes, which as the number of registered contacts increases so does the block/wait time. This needs to be fixed as well.
{quote}

If i manually restart Asterisk - 70% of my dynamic contacts won't go back so i have to reboot phones. Only after reboot my contacts get updated and register again.

By: Joshua C. Colp (jcolp) 2018-07-16 04:37:42.356-0500

What version are you referring to? The new OPTIONS code was tested with 3000 endpoints and suffered no extreme blocking/wait time like the previous one.

By: Anna (modema) 2018-07-16 12:27:48.352-0500

Joshua, i refer to 13.22.0 version. I don't know if it's blocking or not or this is the same issue but after restart of Asterisk (either "core restart now" or with init script) i have like 5-7 endpoints out of 40 to re-register correctly. There might be issue with the phones (cisco one different models) but i don't have this issue on version 11.21.2 using chan_sip with TCP with the same firmware. However this fails only for dynamic contacts, static ones getting their OPTIONS fine and goes to reachable state just fine.
Just to clarify - i have 1 static (IP) and 1-2 dynamic contacts for each endpoint.

By: Joshua C. Colp (jcolp) 2018-07-16 14:38:32.198-0500

That would be unrelated to the OPTIONS work. It depends on configuration what would happen. If rewrite_contact is on then any existing TCP connections would be dropped, and we'd have to wait for the remote endpoint to re-establish it.

By: Richard Mudgett (rmudgett) 2018-08-02 17:53:20.625-0500

Have there been any crashes concerning this issue since the upgrade to 13.22.0?

Leave it in Wait-For-Feedback if there haven't been any.  The issue will automatically close in two weeks.  If any happen after that, the issue will automatically reopen if you comment on the issue.  We'll need new backtraces for the new Asterisk version if it happens.

By: Anna (modema) 2018-10-17 04:31:49.952-0500

Hello Richard,

No, no crashes since updating to 13.22.0, no segfaults. The issue seems to be resolved. Thanks a lot for you help!