[Home]

Summary:ASTERISK-19277: [patch]endlessly repeating error: "poll failed: Bad file descriptor"
Reporter:Barry Chern (bchern)Labels:
Date Opened:2012-01-31 03:06:18.000-0600Date Closed:2015-07-02 07:55:16
Priority:MinorRegression?
Status:Closed/CompleteComponents:Resources/res_timing_kqueue Resources/res_timing_timerfd
Versions:1.8.9.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:OS X 10.5.8 on PPC Mac (G5)Attachments:( 0) attnxfer.sh
( 1) attnxfer-alice.xml
( 2) attnxfer-bob-from-alice-to-charlie.xml
( 3) attnxfer-bob-to-charlie.xml
( 4) attnxfer-charlie.xml
( 5) attnxfer-extensions.conf
( 6) attnxfer-sip.conf
( 7) issueA19277-fix_closing_of_fd_0.patch
( 8) onePointEightIssueLog
Description:Attempting to move to 1.8 from a long-time working, not very complex, 1.6.x installation. Started with a download of the last release (1.8.8), then tried latest builds via SVN twice. Latest attempt with Asterisk SVN-branch-1.8-r353175.

After apparently executing all of the steps of a dialplan without a problem, and hanging up, there is one error that seems to be incorrect, saying the csv couldn't be written to a readonly database, but the csv data HAS been written when I check on it. there follows a fast stream of these errors (after a slight pause) ad infinitum until I either cancel or the process kills itself by crashing:

*CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
*CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor
*[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor

etc.etc. etc. (edited for brevity)
Comments:By: Matt Jordan (mjordan) 2012-01-31 08:24:48.752-0600

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Barry Chern (bchern) 2012-02-03 01:17:00.512-0600

This is the log. Note that I cancelled the process because it showed no signs of ending the repeat of the error or crashing, and I had to delete many of the repeats at the end to make the file under 10 MB.

By: Barry Chern (bchern) 2012-02-03 01:18:42.365-0600

I have uploaded the log, with comment. In addition, note that the error happened after receiving my test call and recording voicemail.

By: Matt Jordan (mjordan) 2012-02-09 14:45:04.740-0600

Although I'm not convinced its causing your problem, the error statement that the database is read-only is coming from the lower level SQLite3 driver - so it, at least, thinks there is a problem.  Whether or not the write is actually succeeding in spite of that error is entirely up to the driver.

You may want to attempt to perform an INSERT into the CDR table under the same user permissions as what Asterisk executes under to see if it reports a similar problem.

By: Barry Chern (bchern) 2012-02-18 05:35:44.476-0600

I doubted the database error caused the fatal "poll failed" error just because there seemed to be such a long pause between them. but, do we have any idea what the "poll" is, what is being polled, and why it is so persistent in re-trying it?

By: laszlovl (lvl) 2012-02-18 10:51:10.010-0600

Are you by any chance using the asterisk HTTP manager interface?

By: Barry Chern (bchern) 2012-02-19 14:43:27.915-0600

no, just CLI

By: Walter Doekes (wdoekes) 2012-02-20 03:34:05.988-0600

> do we have any idea what the "poll" is, what is being polled, and why it is so persistent in re-trying it?

ast_el_read_char is reading characters from standard input and ast_consock, i.e. the input on the command line and the asterisk.ctl unix socket. They're both not expected to disappear., so it's probably not handled in the prettiest of ways. As far as I can see, a remote console only breaks if a write to stdout (STDOUT_FILENO==1) fails:
{code}
               if (!ebuf && write(1, "", 1) < 0)
                       break;
{code}
So that would have it continue indefinitely.

The question: why are the descriptors bad/gone?

By: Walter Doekes (wdoekes) 2015-06-30 05:27:31.082-0500

Okay, so this apparently still exists (in Asterisk GIT-11-96bbcf4M).

I've found a way to reproduce it too:
- ulimit -n 128  # to speed it up
- asterisk -c
- start a bunch of calls so we run out of FDs
- press a key in the console
- result: a flood of CLI> prompts, and the console is unusable

At that point we arrive here (this is not a patch, so you shall not remove it -- and if it we're, I hereby grant you permission to use it):
{noformat}
--- a/main/asterisk.c
+++ b/main/asterisk.c
@@ -2530,8 +2530,21 @@ static int ast_el_read_char(EditLine *editline, char *cp)
                       fprintf(stderr, "poll failed: %s\n", strerror(errno));
                       break;
               }
+              
+               /* revents can contains POLLIN, but also POLLERR, POLLHUP and POLLNVAL.
+                * If the process is out of file descriptors, we get POLLNVAL. At this
+                * point, the console is busted. */
+               if ((fds[0].revents & POLLNVAL) == POLLNVAL ||
+                               (max > 1 && (fds[1].revents & POLLNVAL) == POLLNVAL)) {
+                       fprintf(stderr, "Poll returned POLLNVAL, are you out of file descriptors?\n");
+                       fflush(stderr);
+                       usleep(500000);
+                       return 0;
+               }

-               if (!ast_opt_exec && fds[1].revents) {
+               if (max > 1 && fds[1].revents) {
                       num_read = read(STDIN_FILENO, cp, 1);
                       if (num_read < 1) {
                               break;
{noformat}

Unfortunetely, that isn't enough. At this point the console is busted.

So, it would seem that STDIN_FILENO is close()d by someone. Except, my GDB cannot find out who.

I tried this:

{noformat}
$ sudo gdb `which asterisk`
...
(gdb) run -c
...
...........................................................................................................................................................Asterisk Ready.
[New Thread 0x7fffa70f4700 (LWP 10377)]
*CLI>
{noformat}

At this point, the FDs look like this:

{noformat}
# ls /proc/`pidof asterisk`/fd -lv
total 0
lrwx------ 1 root root 64 jun 30 12:18 0 -> /dev/pts/37
lrwx------ 1 root root 64 jun 30 12:18 1 -> /dev/pts/37
lrwx------ 1 root root 64 jun 30 12:18 2 -> /dev/pts/37
lr-x------ 1 root root 64 jun 30 12:18 3 -> /dev/urandom
lrwx------ 1 root root 64 jun 30 12:18 4 -> socket:[1511689]
lrwx------ 1 root root 64 jun 30 12:18 5 -> /var/lib/asterisk/astdb.sqlite3
...
{noformat}

Add some breakpoints:
{noformat}
*CLI>
Program received signal SIGINT, Interrupt.
0x00007ffff78ff12d in poll () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) break close
Breakpoint 1 at 0x7ffff67a63f0: close. (2 locations)
(gdb) cond 1 $rbx == 0

(gdb) break dup2
Breakpoint 2 at 0x7ffff78fdfe0: file ../sysdeps/unix/syscall-template.S, line 81.
(gdb) cond 2 $rcx == 0

(gdb) break dup3
Breakpoint 3 at 0x7ffff78fe010: file ../sysdeps/unix/syscall-template.S, line 81.
(gdb) cond 3 $rcx == 0

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   <MULTIPLE>        
stop only if $rbx == 0
1.1                         y     0x00007ffff67a63f0 ../sysdeps/unix/syscall-template.S:81
1.2                         y     0x00007ffff78fdf50 ../sysdeps/unix/syscall-template.S:81
3       breakpoint     keep y   0x00007ffff78fdfe0 ../sysdeps/unix/syscall-template.S:81
stop only if $rcx == 0
3       breakpoint     keep y   0x00007ffff78fe010 ../sysdeps/unix/syscall-template.S:81
stop only if $rcx == 0
{noformat}

Carry on and start flooding the thing:
{noformat}
(gdb) cont
Continuing.
...
[New Thread 0x7fffa6ffc700 (LWP 10529)]
Poll returned POLLNVAL, are you out of file descriptors?
[Thread 0x7fffa6ffc700 (LWP 10529) exited]

Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  close () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000042c7f7 in ast_ouraddrfor (them=them@entry=0x17b78d8, us=us@entry=0x17b7ba8) at acl.c:972
#2  0x00007fffb9907c5c in ast_sip_ouraddrfor (them=0x17b78d8, us=us@entry=0x17b7ba8, p=0x17b3178) at chan_sip.c:3801
#3  0x00007fffb9934f8f in sip_alloc (callid=callid@entry=0x179b0b2 "163-10403@127.0.0.1", addr=addr@entry=0x7fffa73dbd20, useglobal_nat=useglobal_nat@entry=1, intended_method=intended_method@entry=5,
   req=req@entry=0x7fffa73db2b0, logger_callid=logger_callid@entry=0x1513df0) at chan_sip.c:8584
...
(gdb) up
#1  0x000000000042c7f7 in ast_ouraddrfor (them=them@entry=0x17b78d8, us=us@entry=0x17b7ba8) at acl.c:972
972 close(s);
(gdb) print s
$2 = 0
(gdb) list 960
955 if ((s = socket(ast_sockaddr_is_ipv6(them) ? AF_INET6 : AF_INET,
956 SOCK_DGRAM, 0)) < 0) {
957 ast_log(LOG_ERROR, "Cannot create socket\n");
958 return -1;
...
{noformat}

And indeed, fd 0 has been replaced. It has been (silently) closed by someone/something, and then a socket() call got hold of it.

{noformat}
# ls /proc/`pidof asterisk`/fd -lv
total 0
lrwx------ 1 root root 64 jun 30 12:18 0 -> socket:[1513038]
lrwx------ 1 root root 64 jun 30 12:18 1 -> /dev/pts/37
lrwx------ 1 root root 64 jun 30 12:18 2 -> /dev/pts/37
lr-x------ 1 root root 64 jun 30 12:18 3 -> /dev/urandom
lrwx------ 1 root root 64 jun 30 12:18 4 -> socket:[1511689]
lrwx------ 1 root root 64 jun 30 12:18 5 -> /var/lib/asterisk/astdb.sqlite3
...
{noformat}

Weird no? I would have expected a breakpoint on either close, dup2 or dup3 the first time around. I cannot figure out how this happens.

By: Walter Doekes (wdoekes) 2015-06-30 06:53:46.124-0500

For the record, I was testing something else and ran into this while running the attnxfer.sh test. See the attached files starting with {{attnxfer}}.

Note that I'm not doing any (explicit) SQLite stuff.

All files except sip.conf and extensions.conf are from {{make samples}}:
{noformat}
$ ls /etc/asterisk/ -1
asterisk.conf
attnxfer-alice.xml
attnxfer-bob-from-alice-to-charlie.xml
attnxfer-bob-to-charlie.xml
attnxfer-charlie.xml
attnxfer.sh
extensions.conf
modules.conf
sip.conf
{noformat}

By: Walter Doekes (wdoekes) 2015-07-01 04:35:12.774-0500

Aha!

So, I shouldn't have been looking at {{%rbx}} but at {{%rdi}} -- which I noticed when debugging ASTERISK-25220. Now it all became clear. The issue is in {{res_timing_timerfd.c}} (and kqueue).

See attached patch: {{issueA19277-fix_closing_of_fd_0.patch}}