[Home]

Summary:ASTERISK-18453: manager.c: HTTP Manager, fdopen failed: Bad file descriptor!
Reporter:Jaco Kroon (jkroon)Labels:
Date Opened:2011-09-07 11:20:03Date Closed:2012-03-05 12:52:19.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/ManagerInterface
Versions:1.8.5.0 Frequency of
Occurrence
Frequent
Related
Issues:
is the original version of this clone:ASTERISK-18700 chan_sip.c and tcptls.c - possible double close of file descriptor
is related toASTERISK-18223 using http manager commands causes global file descriptor instability, crashing Asterisk
Environment:Attachments:( 0) sip_loop.sh
( 1) valgrind.txt
Description:Hi,

I get the following warning in my asterisk/messages log file:

manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

Using GDB I managed to trace this to manager.c, line 5590, as a result of a failure on line 5588.  The appropriate code snippet:

   s.session = session;
   s.fd = mkstemp(template);   /* create a temporary file for command output */
   unlink(template);
   if (s.fd <= -1) {
       ast_http_error(ser, 500, "Server Error", "Internal Server Error (mkstemp failed)\n");
       goto generic_callback_out;
   }
   s.f = fdopen(s.fd, "w+");
   if (!s.f) {
       ast_log(LOG_WARNING, "HTTP Manager, fdopen failed: %s!\n", strerror(errno));
       ast_http_error(ser, 500, "Server Error", "Internal Server Error (fdopen failed)\n");
       close(s.fd);
       goto generic_callback_out;
   }

Honestly, I cannot see how that should fail, and I've had others look at it as well.  It makes no sense.

When compiling with DONT_OPTIMIZE this seems to go away, so instead simply compiled with -ggdb.  I can reproduce this by logging in via the HTTP interface and then cycling between two commands:

Action: Command
Command: reload

Action: SIPPeers

99% of the time it dies on the SIPPeers one, but sometimes on reload too.

I've got some php code that loops using curl and a cookie jar.  Only a single http request at a time.

[Edit: Matt Jordan]
The apparent solution was provided by Rosen:

Hi guys,

I was digging into this problem for a couple of days.
I did manage to get rid of that error message:
[Oct 2 11:24:16] WARNING[90943] manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

The problem was solved for us, by removing
close(s->fd);
from process_output.
manager.c Revision: 331315
on line 5510

"The fdopen() function associates a stream with the existing file descriptor, fildes. The mode of the stream must be compatible with the mode of the file descriptor. When the stream is closed via fclose(3), fildes is closed also."
Comments:By: Jaco Kroon (jkroon) 2011-09-07 11:30:07.059-0500

Running under valgrind seemingly hides the problem.  I've left the loop running for around a minute under valgrind now, not a single failure, running standalone or under gdb fails within seconds..

The gdb backtrace:

#0  generic_http_callback (ser=0x7fffec563518, method=AST_HTTP_GET, format=FORMAT_RAW, remote_address=<value optimized out>, get_params=0xe70da0, headers=0xd784e0, uri=<value optimized out>) at manager.c:5590
#1  0x00000000004e70c3 in rawman_http_callback (ser=0x7fffec563518, urih=<value optimized out>, uri=<value optimized out>, method=AST_HTTP_GET, get_params=0xe70da0, headers=0xd784e0) at manager.c:6046
#2  0x00000000004cd2f4 in handle_uri (data=0x7fffec563518) at http.c:744
#3  httpd_helper_thread (data=0x7fffec563518) at http.c:932
#4  0x000000000053dd0d in handle_tcptls_connection (data=0x7fffec563518) at tcptls.c:233
#5  0x0000000000548b2b in dummy_start (data=<value optimized out>) at utils.c:1004
#6  0x00007ffff6a52d4c in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff751ac6d in clone () from /lib64/libc.so.6

I've made a core dump of the state, so can pull out additional info.

Also, tried to reproduce with DONT_OPTIMIZE, and this too fails.

Feels like a race condition of sorts - I just don't see HOW.

By: Leif Madsen (lmadsen) 2011-09-13 12:22:48.343-0500

With valgrind it'll often cause the issue to "go away" but the output can still be useful to tracking down the problem. Please provide.

By: Jaco Kroon (jkroon) 2011-09-14 01:54:07.974-0500

valgrind doesn't output anything except during startup.  Thus why I didn't provide any valgrind output.  If there is any specific valgrind setup you'd like me to run under, or ever different compile options, or any specific things that I can provide/try, please let me know.

As I said - this makes no sense to me - there is no way except for cross-thread stack corruption as far as I can tel that this can possibly happen.

By: Leif Madsen (lmadsen) 2011-09-20 10:05:11.828-0500

Your backtrace appears to contain memory corruption and we require valgrind output in order to move this issue forward. Please see https://wiki.asterisk.org/wiki/display/AST/Valgrind for more information about how to produce debugging information. Thanks!



By: Jaco Kroon (jkroon) 2011-09-20 12:41:18.752-0500

Simple valgrind output up to where the SIPPeers comparison detects a lowered number of SIP peers (indicating a failed sip reload as part of the reload action).

By: Jaco Kroon (jkroon) 2011-09-20 12:49:22.749-0500

shell script (variation of my php production version) that I use to trigger the issue.  Under valgrind this seems to only ever cause some loss of SIP peers.  Under not valgrind it triggers an error on fdopen as described above.

I do recall some output from valgrind on one run that moaned about something in the cli processing code ... could this by any chance be triggered by me using the "reload" from cli aliases instead of "module reload"?

By: Rosen Iliev (rokoboko) 2011-10-03 16:12:27.955-0500

Hi guys,

I was digging into this problem for a couple of days.
I did manage to get rid of that error message:
[Oct 2 11:24:16] WARNING[90943] manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

The problem was solved for us, by removing
close(s->fd);
from process_output.
manager.c Revision: 331315
on line 5510

"The fdopen() function associates a stream with the existing file descriptor, fildes. The mode of the stream must be compatible with the mode of the file descriptor. When the stream is closed via fclose(3), fildes is closed also."

By: Erik Wallin (erikw@pushtalk.se) 2011-10-04 09:15:43.078-0500

Excellent work!

That looks like a strong candidate. Definitely a bug.

It explains why this is difficult to trigger. It would have to be a race between two threads.

I looked through the rest of the asterisk source for similar situations. It seems that this has been thought of before. In some instances there is a comment on fclose(f) that it also closes fd.

I found two more suspicious fclose/close pairs.

channels/chan_sip.c line 2784 (1.8.7.0):

               if (tcptls_session->f) {
                       fclose(tcptls_session->f);
                       tcptls_session->f = NULL;
               }
               if (tcptls_session->fd != -1) {
                       close(tcptls_session->fd);
                       tcptls_session->fd = -1;
               }

main/tcptls.c line 206 (1.8.7.0):
                                               close(tcptls_session->fd);
                                               fclose(tcptls_session->f);

If I interpret include/asterisk/tcptls.h correctly f is a stream connected to fd. If SSL is enabled, there is also some encryption going on in the stream buffering code. I would therefore guess the stream functions are overloaded with asterisk specific ones.

This would lead to a similar problem if the overloaded fclose has the same semantics as regular stream IO fclose. That is - if it closes the underlying fd when the stream is closed.

The situation in main/tcptls.c is actually even worse, since the fd is closed before the user space buffer is flushed.

It might be better to use fileno() than having separate member for the file descriptor and the stream pointer. The fd is alread a member of the stream struct.  Why duplicate it? Performance? fileno() locks the stream struct when it's accessed. But then there's fileno_unlocked(). Still that's a function call, so it is a bit of overhead.

By: Rosen Iliev (rokoboko) 2011-10-09 21:34:12.768-0500

Hi guys,

Just to confirm, after removing close(s->fd); from line 5510 (manager.c Revision: 331315)
It has been 6 days now, and I have no more Bad file descriptor! messages, nor res_timing_dahdi or chan_sip messages. Before on busy hours my error log use to grow like crazy, filled up with res_timming_dahdi errors and chan_sip errors.

Regards,

Rosen

By: Jaco Kroon (jkroon) 2011-10-10 05:09:24.010-0500

Wouldn't that cause a potential file description leak in the case where s->f uses a different underlying fd than s->fd or is this guaranteed?

Also, the fclose here would still close fd, and the close call would simply return EBADFD not true?

Or are we suspect of a race?  Ie, the fclose closes the fd and then the this timing closes the fd opened by mkstemp on line 5586 before the fdopen on line 5592 gets to execute?  This would make a lot of sense if this is the case.

I've updated my merged install and am busy running the test case that breaks it for me, will revert later if I can still reproduce.  At least at the time of writing I've gone through 500 iterations so far without any of the problems I've noticed previously.  This is really good news.

These problems were:

* Lost SIP extensions during reload cycle.
* timer-fd issues during call pressure whilst reloading (unconfirmed, this was dodgy).
* SIP reload lockups

I did get one case in my loop that gave me a blank set of SIP peers after iteration 1383, after this running sip show peers on the server restored everything, so one of two things I can think of:

1.  Race between replacing the lists where sip show peers gets an empty list (this can theoretically be a call request too).
2.  Reload takes too long and the management interface times out after waiting for longer than 500ms and then reports failure resulting (I've seen similar things with asterisk -rx "" style commands when I still used that).

Those are the only two scenarios I can think of, and in this case can work around it.  Note that in this case it indicates that the connection from rawman to myself got closed without any payload, I haven't checked for http headers being received back ...

For now I've set my test loop to ignore blanc responses and look specifically for 500 errors, or incorrectly number of SIP peers (ignoring the zero case).

By: Jaco Kroon (jkroon) 2011-10-10 05:23:02.577-0500

Ok, it looks like the SIP peers is in fact a race.  Adjusted my loop to do the following:

1.  retrieve list of SIP peers.
2.  if we got any, check if the count is the same as the previous iteration, if not, report it.
3.  issue a reload.

With 50 peers I trigger a warning of changed number of sip peers in under 50 iterations for a few times now, also got a double-free bug reported by glibc (incomplete stack trace but confirmed chan_sip):


blackpearl*CLI> *** glibc detected *** /usr/sbin/asterisk: double free or corruption (!prev): 0x0000000002107620 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x72985)[0x7fe2ac4ab985]
/lib64/libc.so.6(cfree+0x6c)[0x7fe2ac4b08bc]
/usr/lib64/asterisk/modules/chan_sip.so(+0x1abdd)[0x7fe28a95bbdd]
/usr/lib64/asterisk/modules/chan_sip.so(+0x1ae3c)[0x7fe28a95be3c]
/usr/sbin/asterisk[0x4e66c5]
/usr/sbin/asterisk[0x4e82a7]
/usr/sbin/asterisk[0x4e8c33]
/usr/sbin/asterisk[0x4cf684]
/usr/sbin/asterisk[0x53f5ed]
/usr/sbin/asterisk[0x54a42b]
/lib64/libpthread.so.0(+0x6d4c)[0x7fe2aba47d4c]
/lib64/libc.so.6(clone+0x6d)[0x7fe2ac50fc6d]

The frequency is similar to:

Wrong number of SIP peers (got 49, expected 50), 18 times in 112 iterations.

However, I suspect the underlying bug for the fdopen (and probably quite a few other seemingly random bugs" has been sorted out be the proposed fix.

I'll see if I can get valgrind output on the double-free issue and log a different ticket.

I would recommend that someone find all calls to close and fclose and audit that they actually don't fail (at least output a log message that we can use to track similar issues).  In the usual case such failures won't cause problems, the cases where they silently succeed on the wrong fd is the ones causing the problems, however, in the case where they do no harm we'll get a log message which will indicate that we have a problem.  This can also explain quite a number of other issues we've seen, potentially including the timerfd lockups!

I would recommend the patch be merged soonest possible, at least into trunk.

By: Jonathan Rose (jrose) 2011-11-30 16:06:30.947-0600

A number of file descriptor changes have been made around chan_sip and tcptls that might possibly maybe perhaps have had an effect here as well. If it isn't too much trouble, could you check to see if this bug is still occurring in 1.8/10/trunk releases from svn?

By: Jaco Kroon (jkroon) 2011-12-01 02:05:49.750-0600

Hi,

The patch from Rosen already sorted out the reported problem.  Am not able to reproduce the other issues reliably, permitting the provided patch has been merged this issue can be closed IMHO.

Jaoc