[Home]

Summary:ASTERISK-27463: Asterisk fails randomly
Reporter:Donat Zenichev (donat)Labels:crash pjsip
Date Opened:2017-12-07 03:54:02.000-0600Date Closed:2018-02-21 17:02:00.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:
Versions:15.1.3 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-27488 core: If frame with unnegotiated format is read crash will occur
Environment:PBX Core settings: Version: 15.1.3 Build Options: DONT_OPTIMIZE, COMPILE_DOUBLE, BUILD_NATIVE, OPTIONAL_API System: Linux/4.9.0-3-amd64 built by root on x86_64 Maximum open file handles: 65535 User name and group: asterisk/asterisk Machine status: virtual machine (proxmox VM, not container). Debian 9.1 Attachments:( 0) core-brief_08122017.txt
( 1) core-brief.txt
( 2) core-full_08122017.txt
( 3) core-full.txt
( 4) core-locks_08122017.txt
( 5) core-locks.txt
( 6) core-thread1_08122017.txt
( 7) core-thread1.txt
( 8) pacemaker_log_aster_crash_07122017.txt
Description:Hi.

I have an issue with periodical crashes of asterisk.
There are about 500-600 registered user agents on it.
Max loading is about 30-45 calls per sec.

Asterisk is managed by pacemaker daemon.
Schema:
Master (active) <-corosync-> Slave (stand-by)
Pacemaker monitors asterisk process and when it crashes - tries to restart it.

There is no problems when I'm working with asterisk version 11.23.0
I moved from 11.23.0 to 13.8.3 LTS
And then I noticed that from time to time asterisk crashes, pacemaker notices that asterisk is not running and restarts it.
And so on during the day (about 1-2 crashes per hour).

But there is nothing strange in asterisk's full log with verbosity set to 4.
So that, it looks like asterisk works fine and then abruptly crashes, without any hints on the problem.

I found nothing strange in system logs of debian machine nor I found in pacemaker log, just rows that said - "asterisk is not running, ok lets restart it"
Logs are attached as url, so you can take a look.

Before asterisk crashes there are DTMF signals received.
Release note from 2017-09-13 21:31 says:
"The telephony DTMF events are not exchanged with a codec. As a result when RFC2833/RFC4733 sent digits you would crash if "core set debug 1" is enabled, the DTMF digits would always get passed to the core even though the local native RTP bridge is active, and the DTMF digits would go out using the wrong SSRC id."
So I decided to move from 13.8.3 LTS to 15.1.3.

But actually it changes nothing, it's still crashing sometimes.

The interesting thing is, that I have one more cluster running on 13.8.3 LTS with the same asterisk/host configurations and the same pacemaker version, where it works fine, without crashes.
But the difference between them is loading, the problematic asterisk is more loaded.

I added more resource to problematic asterisk (host) now is - 8CPUs 2.5Gz and 8GB ram. Resources loading is not more than 10% (cpu/ram).

Asterisk resource described as systemd service:
[Unit]
Description=Asterisk PBX and telephony daemon.
After=network.target
[Service]
Type=simple
PIDFile=/var/run/asterisk/asterisk.pid
ExecStart=/usr/sbin/asterisk -g -C /etc/asterisk/asterisk.conf
ExecStop=/usr/sbin/asterisk -rx 'core stop now'
ExecReload=/usr/sbin/asterisk -rx 'core reload'
WorkingDirectory=/var/lib/asterisk
Environment=HOME=/var/lib/asterisk
Restart=always
RestartSec=10s
TimeoutStartSec=30
TimeoutStopSec=15
LimitNOFILE=65535
LimitNPROC=65535
[Install]
WantedBy=multi-user.target

Peculiar properties in asterisk.conf:
maxfiles = 65535
runuser = asterisk
rungroup = asterisk

/etc/security/limits.conf
asterisk hard nofile 65535
asterisk soft nofile 65535
asterisk hard nproc  65535
asterisk soft nproc  65535

Asterisk starts with permissions of "asterisk" user/group.
asterisk user is owner of all working directories:
/var/lib/asterisk
/usr/lib/asterisk
/var/spool/asterisk
/var/log/asterisk
/etc/asterisk

So the main question is - why id doesn't crash with older version 11.23.0?

I have core dumps, I attached them in URL.
Comments:By: Asterisk Team (asteriskteam) 2017-12-07 03:54:03.517-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Benjamin Keith Ford (bford) 2017-12-07 13:45:00.579-0600

Hey [~donat], just took a look at the logs and it looks like Asterisk may be trying to reference a null stream, hence the crash. I'll take another look through the logs to verify that this is the case. Also, could you please attach the logs as text documents to the issue itself instead of a google drive URL? It will make it easier to reference in the future!

By: Joshua C. Colp (jcolp) 2017-12-07 13:47:35.305-0600

[~donat] It would also be useful to have the Asterisk log and a description of what the system is doing. For example - is video in use?

By: Donat Zenichev (donat) 2017-12-08 01:08:12.430-0600

Thanks for your answer.

Actually we don't use video codecs such as H.263 / H.264 / H.265
We don't have video streams at all, just an audio.

As I've already mentioned, asterisk log had nothing at all before daemon crashed.
Just rows with usual logging of calls. No errors, warnings etc.
I will get the full log and will attach it here, after next failure.


Attachment with coredump txt files.

By: Donat Zenichev (donat) 2017-12-08 03:30:16.290-0600

Coredump updates since 08 12 2017


By: Donat Zenichev (donat) 2017-12-11 04:59:15.806-0600

Asterisk log before the crash.

[Dec 11 12:49:44] DTMF[24627][C-00001064] channel.c: DTMF begin '1' received on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:44] DTMF[24627][C-00001064] channel.c: DTMF begin passthrough '1' on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:44] DTMF[24627][C-00001064] channel.c: DTMF end '1' received on SIP/vpbx_main_kiev-000020da, duration 165 ms
[Dec 11 12:49:44] DTMF[24627][C-00001064] channel.c: DTMF end accepted with begin '1' on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:44] DTMF[24627][C-00001064] channel.c: DTMF end passthrough '1' on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF begin '0' received on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF begin passthrough '0' on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF end '0' received on SIP/vpbx_main_kiev-000020da, duration 65 ms
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF end accepted with begin '0' on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF end '0' detected to have actual duration 79 on the wire, emulation will be triggered on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF end '0' has duration 79 but want minimum 80, emulating on SIP/vpbx_main_kiev-000020da
[Dec 11 12:49:45] DTMF[24627][C-00001064] channel.c: DTMF end emulation of '0' queued on SIP/vpbx_main_kiev-000020da
...

11 seconds later asterisk restarted:
[Dec 11 12:49:56] Asterisk 15.1.3 built by root @ virtual2-Master on a x86_64 running Linux on 2017-12-05 15:31:48 UTC

Does it look clearly? I mean DTMF looks as usual, but I can't understand why log always has DTMF signals before the crash, interesting coincidence.

By: Donat Zenichev (donat) 2017-12-19 07:36:40.849-0600

Temporarily moved back to 11.23.0 ver.

Now it's working fine, but not so nice idea I think, because of so old version and absence of pjsip module.

By: Joshua C. Colp (jcolp) 2017-12-22 07:15:14.297-0600

The 15 crash is being tracked at ASTERISK-27488.

By: Joshua C. Colp (jcolp) 2017-12-22 07:16:48.126-0600

This issue has become quite confusing with all of the version changes you've done. The 15 issue itself is being tracked on a different issue. If you have a problem with 13 can you please provide the information for that.

By: Asterisk Team (asteriskteam) 2018-01-05 12:00:01.627-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines