[Home]

Summary:ASTERISK-28171: res_resolver_unbound: DNS issue when under load, can't make outgoing calls
Reporter:Cyril Ramière (Cyril.r)Labels:pjsip
Date Opened:2018-11-19 08:43:05.000-0600Date Closed:2019-11-06 10:52:41.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_resolver_unbound
Versions:15.4.0 16.0.0 16.6.1 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Linux, Ubuntu 16.04 (kernel 4.4.0-1060-aws)Attachments:( 0) asterisk.zip
( 1) console.log
( 2) fd.txt
( 3) locks.txt
( 4) taskprocessor.txt
( 5) threads.txt
Description:Hello,

We have an issue with Asterisk, let me try to explain the setup and the problem:

h5. Setup

An instance on AWS (C5.2XLARGE) : 8CPU & 16GB RAM.
Ubuntu 16.04, kernel 4.4.
Tried with Amazon Linux 2 AMI and got same issue.
Only asterisk installed (compile from src, tested on v15.4.0 & 16.0).
Using PJSIP (bundled).

h5. The problem

Problem is with outgoing calls (from US to our trunk), our trunk is set with an URL (so it uses DNS, that's very important).

When Asterisk is processing hundreds of calls (400, 500), and we try to make an outbound call using our trunk, it doesn't work.

We checked the configuration and everything seems fine, we replaced the instance & recompiled asterisk, same thing.
We tested on the last Asterisk 16.0, same thing.

We are running out of ideas, the instance is really big and can handle easily the load, if we use an IP instead of the DNS for our trunk, it works, so problem seems DNS related.

We checked everything we can, there are no DNS issues on the instance, resolution is absolutely fine.

I'm attaching the configuration that we use, and the outputs (console).

There are some messages related to the unbound library.
Ulimits are fine.
Seems that realtime is not part of the issue, for the sake of simplicity I removed the realtime configuration and kept a light configuration.

h5. Steps to reproduce
Start asterisk
Use SIPP to call extension 1000 with 400/500 calls and wait for the calls to be handled.
Try to make an outgoing call through the "outgoing" endpoint

Please tell me if you want more information/logs.
Comments:By: Asterisk Team (asteriskteam) 2018-11-19 08:43:08.103-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Cyril Ramière (Cyril.r) 2018-11-19 08:43:59.813-0600

Output when trying to make an outgoing call

By: Cyril Ramière (Cyril.r) 2018-11-19 08:44:15.382-0600

output of taskprocessor

By: Cyril Ramière (Cyril.r) 2018-11-19 08:46:00.482-0600

asterisk configuration files

By: Cyril Ramière (Cyril.r) 2018-11-19 10:21:28.529-0600

Hello,

I recompiled asterisk with debug flags to get locks & fd information.

I went down on the number of calls to 200 because the server was overloading (performance hit is huge with debug) and I was able to reproduce the issue & gather those informations, hope you can use it.

Files attached

Best regards.

By: Cyril Ramière (Cyril.r) 2018-11-19 10:22:46.970-0600

info gathered with debug flags enabled.
Note: less calls (~200) due to server overload because of the debug flags.

By: Cyril Ramière (Cyril.r) 2019-01-02 03:25:42.479-0600

Hello,

Wish you a Happy New Year!

I wanted to know if there is any progress on this case?

Can I do something to help? (provide more logs/details?)

Best regards.

By: Joshua C. Colp (jcolp) 2019-01-02 05:13:15.477-0600

This issue has been accepted and is in queue. There is no timeframe on when it will get looked into. There is also no need to ask for an update, as any updates will be posted on the issue itself.

By: Sean Bright (seanbright) 2019-01-04 08:58:59.104-0600

What version of {{unbound}} is installed on your system?

By: Cyril Ramière (Cyril.r) 2019-01-04 09:36:32.493-0600

Hello Sean,

We have those packages installed:

{quote}
libunbound-dev:amd64 | 1.5.8-1ubuntu1 | amd64
libunbound2:amd64 | 1.5.8-1ubuntu1 | amd64
{quote}



By: Cyril Ramière (Cyril.r) 2019-10-23 03:27:26.412-0500

Edit:
Sadly it affects latest release 16.6.1 (same setup).

By: Sean Bright (seanbright) 2019-10-23 08:07:19.260-0500

OK. So here is what is going on...

{{libunbound2 1.5.8-1ubuntu1.1}} on Xenial _does not_ use libevent. You can confirm this by running {{apt show libunbound2}} and see that the {{Depends:}} line does not contain {{libevent}}. So this means that {{libunbound2}} will instead use an internal {{libevent}}-like implementation called {{mini_event}} which they describe as:

bq. fake libevent implementation. Less broad in functionality, and only supports select(2).

So I did a bit of digging in to the source code of the {{libunbound2}} v1.5.8 package and determined that:

* The only way that {{event_add}} can fail (as indicated in the logs) is due to this statement: {{if(ev->ev_fd != \-1 && ev->ev_fd >= ev->ev_base->capfd)}}
* ...and {{ev->ev_base->capfd}} is set to {{MAX_FDS}} which is hard-coded as {{1024}}

This means that once file descriptor {{1024}} is allocated (anywhere in the asterisk process, not just in {{libunbound2}}'s code), {{event_add}} will start to fail with the error you are seeing.

A couple options:

# Build and install {{libunbound2}} from source, making sure to specify {{--with-libevent}} when calling its {{configure}}
# Upgrade to 18.04 Bionic or better (the {{libunbound2}} package in Bionic uses {{libevent}})

In either case, this is not an Asterisk bug. Sorry that it took so long to determine that.

By: Cyril Ramière (Cyril.r) 2019-11-06 09:50:38.273-0600

Hi Sean,

Well, wouldn't saw that coming :)

A big thanks for the provided workaround, so we got rid of ubuntu's libevent, compile & install a new one then compile & install libunbound2 and everything seems to work as expected.

That was really a viscious issue... "fake libevent implementation" with a hard coded limit... really? ^^

Glad it was fixed in next ubuntu releases.

Maybe a quick note on the install guide for our fellows relying on Ubuntu 16.04LTS would be a nice attention since this issue is not generating any error messages while discarding calls.

Again, thanks for your help :)

By: Asterisk Team (asteriskteam) 2019-11-06 09:50:38.796-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Sean Bright (seanbright) 2019-11-06 10:52:41.808-0600

Feel free to add a page to the [Asterisk Wiki|https://wiki.asterisk.org/wiki/display/AST/Home] if you feel that would be beneficial to other users.

Hopefully others with the same problem will find this issue with the workaround if they run into the same thing.

-I've also [filed an issue with the unbound project|https://github.com/NLnetLabs/unbound/issues/105] because this appears to be a bug and not a limitation.-

On further review - this appears to be a limitation of select and not of libunbound.