[Home]

Summary:ASTERISK-18930: Asterisk stops responding to SIP devices if it loses Internet Access (DNS)
Reporter:M. Anderson (adv99)Labels:
Date Opened:2011-11-29 10:08:11.000-0600Date Closed:2012-01-28 11:51:15.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:1.8.7.1 Frequency of
Occurrence
Constant
Related
Issues:
is duplicated byASTERISK-21378 chan_sip completely blocks on DNS lookups
is related toASTERISK-17722 SIP SRV lookups for registration discard the port when dnsmgr disabled (the default)
Environment:FreePBX Distro/PBX In A FlashAttachments:
Description:If Asterisk loses internet connectivity or DNS, it stops responding to all SIP devices and trunks, and all extensions lose connectivity.  This bug has apparently been around since Asterisk 1.4, persisted through 1.6, and remains in 1.8
Comments:By: M. Anderson (adv99) 2011-11-30 17:28:29.555-0600

The problem occurs even if srv lookups are disabled.

By: Leif Madsen (lmadsen) 2011-12-01 14:15:24.779-0600

It would be ideal to get more information about this issue. I'd be curious to see console logs with debug level logging just before the network disconnect, and just after it is re-established.

A SIP trace may also be useful to determine what is going on.

I'd also like to see a 'core show locks' output after the network is removed, and after it is re-established.

A backtrace from the running process may also be useful.

What happens if you enable dnsmgr?

By: Leif Madsen (lmadsen) 2011-12-01 14:15:36.360-0600

Debugging deadlocks: Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install).  This will then give you the console command "core show locks." When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt
# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt
gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!



By: Leif Madsen (lmadsen) 2011-12-01 14:15:49.489-0600

Thank you for your bug report. In order to move your issue forward, we require a backtrace[1] from the core file produced after the crash. Also, be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then:

make install

After enabling, reproduce the crash, and then execute the backtrace[1] instructions. When complete, attach that file to this issue report.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace



By: M. Anderson (adv99) 2011-12-01 14:25:23.096-0600

I'm afraid that I lack sufficient expertise to do the things that you've asked me to do.  I use Asterisk via a Distribution, i.e. the FreePBX Distro and PBX In A Flash.  The problem is easily reproduceable on both of them, and has been reported in Distros including Elastix, FreePBX, AsteriskNOW, PBX In A Flash, and the FreePBX Distro.

Just install AsteriskNOW, the FreePBX Distro, PBX In A Flash (you can do so on a VMWare virtual machine for ease of testing), set-up a SIP Trunk to someone like VOIP.ms, Callcentric, Flowroute, etc., using their domain name and not their IP address, configure an extension, register your phone, and then pull the plug on your internet.

Here are some links showing users reporting the problems as long ago as Asterisk 1.4:

Trixbox:
http://fonality.com/trixbox/forums/trixbox-forums/help/internet-down-cant-dial-other-extensions

PBX In A Flash:
http://pbxinaflash.com/forum/showthread.php?t=245

FreePBX:
http://www.freepbx.org/support/documentation/howtos/how-to-install-bind-so-that-sip-extensions-continue-to-work-when-intern

I can confirm that the problem is still occuring with Asterisk 1.8.7.1.  The only way to avoid the problem is to turn SRV Lookup OFF, and then register all of your SIP trunks using IP addresses and not domain names.  

When the problem occurs, Asterisk does not crash, it merely gets bogged down with DNS lookups.  When you restore internet, everything starts working again.

By: David Woolley (davidw) 2011-12-02 06:29:57.838-0600

My suspicion is that he is breaking reverse DNS when he loses the internet connection.  If so, the question would be whether Asterisk can safely assume that reverse DNS isn't needed.  If not, it becomes a support issue, in terms of his providing fallback reverse DNS.  If so, it might still be a feature request.

By: M. Anderson (adv99) 2011-12-02 10:19:57.405-0600

I've been told by others in the FreePBX community that Digium has done something to address this bug since 1.6.  When I experienced the failure on 1.8.7.1, I had SRV Lookups enabled.  I have not tested 1.8.7.1 with SRV Lookup disabled.  Rather, I assumed that this was the same problem that I had had with Asterisk 1.4 and 1.6.  When I used those versions, I had SRV lookups disabled.

Baesd upon the new information I've received from the FreePBX community, it may be the case that Digium has corrected the original bug, and that this is a new bug (or a remnant of the old one) that only affects systems using SRV Lookup.

For that reason, it is possible that Mr. Madsen's 12/1/11 comment indicating that this bug is related to SRV lookup is correct.  When I have time (hopefully today), I'll test again with SRV lookups disabled and see if the problem persists.

By: M. Anderson (adv99) 2011-12-02 22:43:21.925-0600

Okay, I had a chance to test it again.

Definitely still happens even with SRV Lookup disabled.

Even stranger, it still happens even when I place all of the domain name entries used in my trunk settings in my hosts file.  In the past, it would not happen if the DNS resolves were done in my host files.



By: M. Anderson (adv99) 2011-12-02 23:37:47.771-0600

I did some more investigation.  

The problem occurs whenever a domain name is used in the host= line and registration lines of a SIP trunk.  If you lose internet, within 30-60 seconds, Asterisk will stop responding.

The only solution that I can find that actually works now is to define all SIP trunks using IP addresses instead of domain names.

My difficulty in finding this problem relates to the fact that it doesn't always occur instantly.  It can take up to 60 seconds before Asterisk stops responding, and even then, it will sometimes respond intermittently.

It remains the case that once internet connectivity is restored, Asterisk begins working normally again.

Here's how to reproduce the problem:

Set-up Asterisk to register to a provider such as VOIP.ms or Callcentric.com, using their domain names.

Set-up an extension to register to Asterisk.

Pull the plug on your internet connection.

Wait for some amount of time (usually 5 to 10 minutes) and your phone will show no service.  Alternatively, pick-up your phone and try calling the Time of Day feature code or even your own extension.  For some period of time, your calls will go through.  However, eventually, within 30 to 60 seconds, they'll stop going through.


By: Matt Jordan (mjordan) 2011-12-15 11:51:41.881-0600

Asterisk uses synchronous hostname resolution when it performs a DNS lookup for a peer's hostname.  As such, if chan_sip has to resolve a hostname and the DNS server is not available, it can block the SIP do_monitor thread until the request times out.  How long the call blocks is dependent on the system Asterisk is running on, but can be more then several seconds.  Obviously, if a large number of peers have to be resolved and Asterisk enters this state, it will be come unresponsive to SIP traffic on a local intranet.

At this time, there we are not planning to implement a DNS cache or asynchronous DNS lookups in Asterisk.  The best solution is to instead install and configure a local DNS cache on the system that Asterisk runs on - there are many very good ones available in all major Linux distributions (and I would imagine the same to be true for other Operating Systems).  In the case where internet connectivity is lost, this should prevent long hostname resolution times as Asterisk will still hit the local DNS cache, as opposed to timing out.  In Asterisk versions 1.8 and greater, you may also find that using the dnsmgr feature (which periodically refreshes DNS information on a separate background thread) will alleviate chan_sip from becoming unresponsive.  Without having a local DNS cache, however, you may be simply trading which thread is blocked for a long period of time - so this is not a solution in and of itself.

If you can, please implement a local DNS cache on the system experiencing this behavior and retest, and let us know if this prevents the complete loss of SIP functionality when the DNS provider is no longer available (you're obviously going to lose some SIP functionality :-) )

By: Paul Belanger (pabelanger) 2012-01-28 11:50:58.200-0600

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested.  Further information can be found at http://www.asterisk.org/developers/bug-guidelines