ASTERISK-25251: getifaddrs() blocks infinitely in PJSIP

[Home]

Summary: ASTERISK-25251: getifaddrs() blocks infinitely in PJSIP

Reporter: Gergely Dömsödi (doome) Labels:

Date Opened: 2015-07-15 07:36:54 Date Closed: 2015-11-20 10:02:43.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Resources/res_pjsip

Versions: 13.1.1 13.3.2 13.4.0 Frequency of
Occurrence Frequent

Related
Issues:

Environment: Fedora 22 Attachments: ( 0) bt.txt
( 1) bt.txt
( 2) bt-nodns.txt
( 3) fd.txt
( 4) locks.txt
( 5) locks-nodns.txt
( 6) taskprocessors.txt
( 7) threads.txt

Description: Using PJSIP, after about 10-20 minutes of SIP traffic between two PJSIP peers, a deadlock occurs and Asterisk cannot serve SIP traffic anymore. Incoming INVITES can be seen with a tcpdump, but it is not even visible in pjsip log.

The issue is reproducible with both the "vanilla" asterisk package in Fedora 22 (13.1.1), with the package currently in updates-testing (13.3.2), and a custom-built 13.4.0. Tried with PJSIP 2.3 from Fedora and a custom built 2.4, both were affected.

When in the deadlock, PJSIP commands continue to work (pjsip show *), but asterisk cannot be stopped, core stop now just returns, but nothing happens. Only SIGKILL can pull out asterisk from this state.

Attached outputs and backtraces are from the custom built 13.4.0, with PJSIP 2.4.

Comments: By: Asterisk Team (asteriskteam) 2015-07-15 07:36:55.226-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Gergely Dömsödi (doome) 2015-07-15 07:39:46.504-0500

core show locks, core show taskprocessors, core show threads, core show fd, and the backtrace
By: Richard Mudgett (rmudgett) 2015-07-15 10:45:56.302-0500

This is not a classical deadlock. Looking at the [^locks.txt] and [^bt.txt] files the thread holding the lock is stuck trying to reslolve a DNS query. Your DNS server looks to be down or otherwise inaccessible.
By: Gergely Dömsödi (doome) 2015-07-15 13:01:21.825-0500

When I was looking at [^bt.txt] I suspected the same, so I made sure that the DNS servers are configured correctly, and they are, and normally both of them can resolve. However, as it is our ISP's server, it could of course be faulty somehow, by not responding to some particular query because of query rate for example. However, Asterisk should handle this condition in a far more user friendly way, as there could be a lot of erroneous DNS servers out there in the wild.

Although, I read somewhere that {{gettaddrinfo()}} and/or {{gethostbyname()}} in glibc is far from perfect, and Fedora is massively [patching|https://fedoraproject.org/wiki/Features/FixNetworkNameResolution] them, but my resources are unfortunately not enough to decide who should fix this.

I am going to try Asterisk without any DNS resolver tomorrow, and see if this fixes the problems. If there is anything else I can help with, please let me know.
By: Gergely Dömsödi (doome) 2015-07-16 01:48:18.861-0500

Tried again with no DNS servers configured, and the crash happened after 20 minutes. Outputs attached: [Backtrace|^bt-nodns.txt] and {{[core show locks|^locks-nodns.txt]}}
By: Gergely Dömsödi (doome) 2015-07-16 02:56:56.555-0500

Since I noticed in [^bt-nodns.txt] that asterisk was trying to resolve the local hostname, I realized it was not added to {{/etc/hosts}}. I added it, but unfortunately the result was the same, the lock happened after ~20 minutes.
By: Richard Mudgett (rmudgett) 2015-07-16 11:15:29.158-0500

I think you are still having DNS resolution issues. The [^locks-nodns.txt] files show two threads waiting on different channel locks. When you find the threads that have the locks being waited on and then look in the [^bt-nodns.txt] backtrace file for the threads holding the locks, you see that the stacks are uninformative because there are only two things on the stack with question marks for names.

I think that the time between getting the locks file and the backtrace file the probable DNS query timed out and that specific block eliminated. Then the slowdown moves to another DNS query involving other threads.
By: Gergely Dömsödi (doome) 2015-07-20 01:15:26.906-0500

I think I managed to solve to problem, but I still can't really decide if it was an application or a kernel bug.

Since You wrote that Asterisk is still locking in a DNS query, I updated {{nsswitch.conf}} so for DNS queries, only {{files}} and {{dns}} were used, but no DNS servers were configured (I disabled {{myhostname}} and {{mymachines}}). The issue seemed to disappear, but there were still some locked threads appearing after some use, which were infinitely blocked in a {{recvmsg()}} called from a {{netlink_request()}} and eventually a {{getifaddrs()}}. When I realized this, I googled a bit for "infinite block getifaddrs()" and alike, and found some other bugreports involving multithreaded apps and {{getifaddrs()}}.

Next, I updated the kernel from 4.0.4 to 4.0.7, and the locking threads were gone, even if I enabled DNS queries again, asterisk is operating normally.

By: Rusty Newton (rnewton) 2015-07-21 15:43:20.370-0500

Strange.. well I'm going to close this out as I don't think we have enough to go on here towards any possible Asterisk bug. For now we'll presume it was a kernel issue since the upgrade resolved it so far.

If you have issues in the future or if more people report this issue then we can re-open and look into it further. In that case we would need a more narrow, step-by-step guide for reproduction so that we could examine it in a lab environment.
By: Gergely Dömsödi (doome) 2015-10-16 07:06:41.286-0500

Unfortunately, the issue seems to reappear with a more recent kernel (4.1.10). I was not able to pinpoint the exact kernel version when it broke again, because I just updated to this one. I am trying to narrow it down though.

So to summarize: 4.0.4 is bad, 4.0.7 is good, and 4.1.10 is bad again.
By: Asterisk Team (asteriskteam) 2015-10-16 07:06:42.175-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Gergely Dömsödi (doome) 2015-10-16 07:10:06.952-0500

Asterisk 13.6.0 is still affected, I tried with both 13.3.2 (which was in the original bugreport) and the newest release. Same thing happened.
By: Gergely Dömsödi (doome) 2015-10-21 03:58:17.649-0500

Sorry, false alarm. While trying to pinpoint the exact kernel version where it gone wrong, I installed several kernel on the machine, and the issue didn't reappear. After reinstalling 4.1.10, the issue still did not show up.

Closing the issue again...
By: Florian Weimer (fweimer) 2015-11-19 07:20:56.112-0600

We have a report of a similar issue, on Debian wheezy with Linux 4.1.13:

http://marc.info/?l=linux-netdev&m=144777158701417&w=2

We have not been able to pinpoint the cause. We have seen such hangs on netlink sockets if something else in the process reads the kernel netlink response before it is processed by the intended recipient, but we are not sure if this is the case here.

Gergely, what is your exact glibc version?

By: Gergely Dömsödi (doome) 2015-11-20 06:29:58.482-0600

It's Fedora 22's default 2.21.

By: Asterisk Team (asteriskteam) 2015-11-20 06:29:59.854-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.