[Home]

Summary:ASTERISK-26903: Listening TCP/TLS sockets stop when temporarily out of open files
Reporter:Walter Doekes (wdoekes)Labels:
Date Opened:2017-03-29 07:01:44Date Closed:2017-04-11 20:13:54
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:13.14.0 14.3.0 GIT Frequency of
Occurrence
Related
Issues:
is caused byASTERISK-24728 tcptls: Bad file descriptor error when reloading chan_sip
Environment:Attachments:
Description:Just now a misconfigured Asterisk lost its AMI socket because it ran out of open files.

Asterisk uses ast_tcptls_server_root for several TCP/TLS listening sockets. If an accept() there fails, the entire listening thread just stops without more than a WARNING. (Okay, ERROR on Asterisk 13+.)

Example:
{noformat}
[2017-03-29 06:34:50] ERROR[6513] cel_custom.c: Unable to re-open master file /var/log/asterisk/cel-custom/full.csv : Too many open files
[2017-03-29 06:34:50] WARNING[6519] tcptls.c: Accept failed: Too many open files
[2017-03-29 06:34:50] ERROR[6513] cel_custom.c: Unable to re-open master file /var/log/asterisk/cel-custom/full.csv : Too many open files
{noformat}

At 06:34:50 AMI was trying to accept() an incoming connection. It failed here:
{code}
void *ast_tcptls_server_root(void *data)
{
// ...
       for (;;) {
// ...
               fd = ast_accept(desc->accept_fd, &addr);
               if (fd < 0) {
                       if ((errno != EAGAIN) && (errno != EWOULDBLOCK) && (errno != EINTR) && (errno != ECONNABORTED)) {
                               ast_log(LOG_WARNING, "Accept failed: %s\n", strerror(errno));
                               break;
// ...
       return NULL;
}
{code}

That is, with just that WARNING (ERROR), the listening thread dies.

There is no cleanup when the thread ends, so it keeps listening, but no one is accept()ing any connections. Because the OS takes care of the TCP handshake, it appears as though Asterisk has hung before it can do a write. (You can connect to the port, but nothing happens.) But the problem is even earlier.

This function is used here:
{noformat}
asterisk-rw-13.git$ wgrep . accept_fn.*ast_tcptls_server_root
./main/manager.c: .accept_fn = ast_tcptls_server_root, /* thread doing the accept() */
./main/manager.c: .accept_fn = ast_tcptls_server_root, /* thread doing the accept() */
./main/http.c: .accept_fn = ast_tcptls_server_root,
./main/http.c: .accept_fn = ast_tcptls_server_root,
./channels/chan_sip.c: .accept_fn = ast_tcptls_server_root,
./channels/chan_sip.c: .accept_fn = ast_tcptls_server_root,
{noformat}

(once for TCP, once for SSL)

In a 2014 commit that has made it to Asterisk 13+, the message has been changed from WARNING to ERROR.
{noformat}
commit 7c276f9fef945b644566533ddbcb72a2ec8ff821
Author: Olle Johansson <oej@edvina.net>
Date:   Sun Apr 27 19:29:27 2014 +0000

   tcptls.c : Log errors as ERROR, not warning or something else.
{noformat}

But there still is no indication that the thread has ended.

Suggestions for improvement:
- add another ERROR before {{return NULL}} that says the thread has ended (prematurely)
- or, don't end the thread just because a single accept() failed and stay in the for-loop instead; that would make Asterisk more resilient against temporary problems
Comments:By: Asterisk Team (asteriskteam) 2017-03-29 07:01:45.292-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Sean Bright (seanbright) 2017-03-29 08:15:18.529-0500

It's easy enough to add {{EMFILE}} to that error check which would solve this particular problem. But the listener thread dying without cleaning things up is a bit ugly.

By: Rusty Newton (rnewton) 2017-03-31 15:43:47.123-0500

Walter what versions does this affect?

By: Walter Doekes (wdoekes) 2017-04-02 04:22:32.451-0500

All recent versions: 13, 14, master

If I diff ast_tcptls_server_root between 11 and master, I see only the WARNING->ERROR change as relevant change.

The listening socket (accept_fd) is first closed when ast_tcptls_server_stop() is called, which generally is only on module/usage shutdown or if the config changes from enabled to disabled. In the mean time, a thread stopped because of this problem would sit in zombie state because no one is actively monitoring its state.

Ah, doing some git blamage, I find that the break/exit was introduced first in 2015.
{noformat}
commit c7591ef6bc6ad4c4e1c7f6a66de78b6ff70dc913
Author: Kevin Harwell <kharwell@digium.com>
Date:   Tue Jan 27 22:58:44 2015 +0000

   tcptls: Bad file descriptor error when reloading chan_sip
...
   valid. This is probably happening because unloading of chan_sip is not atomic.
   That however is outside the scope of this patch. This patch simply stops the
   logging of multiple occurrences of that message.
...
   ASTERISK-24728
{noformat}

By: Friendly Automation (friendly-automation) 2017-04-11 20:13:55.618-0500

Change 5448 merged by zuul:
tcptls.c: Cleanup TCP/TLS listener thread on abnormal exit.

[https://gerrit.asterisk.org/5448|https://gerrit.asterisk.org/5448]

By: Friendly Automation (friendly-automation) 2017-04-11 20:14:01.547-0500

Change 5449 merged by zuul:
tcptls.c: Cleanup TCP/TLS listener thread on abnormal exit.

[https://gerrit.asterisk.org/5449|https://gerrit.asterisk.org/5449]

By: Friendly Automation (friendly-automation) 2017-04-12 04:55:30.907-0500

Change 5447 merged by Joshua Colp:
tcptls.c: Cleanup TCP/TLS listener thread on abnormal exit.

[https://gerrit.asterisk.org/5447|https://gerrit.asterisk.org/5447]