[Home]

Summary:ASTERISK-27885: Calls to SIP phones fail to establish, debug log and tcpdump suggest a packet backlog
Reporter:Aragon Gouveia (acrg)Labels:pjsip
Date Opened:2018-05-30 09:07:45Date Closed:2018-06-14 07:11:24
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:13.17.2 13.20.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Debian Stretch: Linux asterisk 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3 (2017-12-03) x86_64 GNU/LinuxAttachments:( 0) core-show-settings.txt
( 1) core-show-sysinfo.txt
( 2) debug.txt
( 3) sip-show-settings.txt
( 4) tcpdump.txt
Description:The PBX I operate receives calls from a telco via a DAHDI channel, and these typically all ring to SIP phones. At random times, a call will ring to a SIP phone, but answering the phone fails to establish the call. The ringing stops, but audio is just dead.

I've been running Asterisk with debugging enabled on one SIP extension, while simultaneously running tcpdump in the background to capture the actual network communication. Every time it happens, tcpdump shows the phone's SIP responses reaching the Asterisk server, but the debug log shows the same packets being processed much later than the tcpdump timestamps. By the time Asterisk has processed the phone's response packets, the caller has already hungup or some other timeout has already occurred (depending on which packets get backlogged).

Unfortunately I have no way to reliably reproduce this. It can take weeks and hundreds of successful calls to manifest. The PBX was recently upgraded from Debian Wheezy + Asterisk 1.8.10.0, and since then this problem started occurring (ie. since Asterisk 13 and Debian Stretch).

Attaching files with more details:

core-show-settings.txt - CLI output, "core show settings"
core-show-sysinfo.txt - CLI output, "core show sysinfo"
sip-show-settings.txt - CLI output, "sip show settings"
debug.txt - Debug log, notice the silence between 08:30:07 and 08:30:15
tcpdump.txt - Notice the delay in processing packets received between 08:30:10 and 08:30:17.
Comments:By: Asterisk Team (asteriskteam) 2018-05-30 09:07:46.144-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Richard Mudgett (rmudgett) 2018-05-30 13:05:13.178-0500

* You may be having a DNS hiccup during the backlog time.
* The debug information \[1] level you have captured isn't high enough to infer what may be happening.
* You are using chan_sip which is entirely community supported so response times will reflect that.
* You have marked this issue as a regression.  A regression from what version?

\[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: Aragon Gouveia (acrg) 2018-05-30 15:31:44.473-0500

bq. The debug information [1] level you have captured isn't high enough to infer what may be happening.

I have run "core set verbose 5" and "core set debug 5" and reloaded the logger module. SIP debug is enabled for only one peer - enabling for all would create a lot of noise. I hope this is sufficient?

bq. You have marked this issue as a regression. A regression from what version?

1.8.10.0


By: Aragon Gouveia (acrg) 2018-05-30 15:33:45.318-0500

bq. You may be having a DNS hiccup during the backlog time.

Does chan_sip perform DNS resolution synchronously? Does chan_pjsip do so asynchronously?

Is there a way to enable DNS resolution debugging?

Thank you!

By: Richard Mudgett (rmudgett) 2018-05-31 14:46:11.666-0500

* SIP by its nature is reliant on DNS so a DNS disruption is going to affect SIP.  chan_sip is mostly synchronous in design so a DNS disruption can affect all calls controlled by chan_sip.  chan_pjsip is more multithreaded in design so a disruption can have less impact.
* There is no specific option to enable DNS debugging.  There are debug messages that will come out if the logging level is high enough.
* Marking this issue as a regression is a stretch since you have practically rebuilt your server from scratch.

By: Richard Mudgett (rmudgett) 2018-05-31 14:51:11.484-0500

I'm going to leave this issue in "waiting for feedback" until more information is available.  If it auto closes in a couple weeks, you can reopen the issue when you have more information by commenting on the issue.

By: Aragon Gouveia (acrg) 2018-06-14 05:33:54.336-0500

I've now enabled much more debugging and had it happen again. The problem seems to be due to a "sip reload" blocking chan_sip or at least having a much higher priority than other tasks. Asterisk takes about 10 seconds to process a SIP reload with our SIP config, and during this time a large backlog of unprocessed network traffic accumulates. When it finishes, all the backlog gets handled, including many SIP timeouts.

This wasn't a problem with 1.8. I can't see any evidence of DNS lookups being the cause, and our SIP config does not have any DNS names in it (every peer entry has host=dynamic).

I'm not sure if you want to consider this a bug or a regression, so I'll leave it up to you. From my side I'm just going to try eliminate or reduce the frequency of our SIP reloads, and for our next upgrade we'll migrate to chan_pjsip, which I'm hoping doesn't have the same problem.

Thank you!

By: Richard Mudgett (rmudgett) 2018-06-14 07:11:25.009-0500

Reloading a system while it is busy is a disruption as the system has to figure out its new configuration.  Closing as the disruption is user caused.