Asterisk
  1. Asterisk
  2. ASTERISK-18453

manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

    Details

    • Type: Bug Bug
    • Status: Closed
    • Severity: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.5.0
    • Target Release Version/s: 1.8.12.0, 10.4.0
    • Component/s: Core/ManagerInterface
    • Security Level: None
    • Labels:
      None
    • Frequency of Occurrence:
      Frequent

      Description

      Hi,

      I get the following warning in my asterisk/messages log file:

      manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

      Using GDB I managed to trace this to manager.c, line 5590, as a result of a failure on line 5588. The appropriate code snippet:

      s.session = session;
      s.fd = mkstemp(template); /* create a temporary file for command output */
      unlink(template);
      if (s.fd <= -1)

      { ast_http_error(ser, 500, "Server Error", "Internal Server Error (mkstemp failed)\n"); goto generic_callback_out; }

      s.f = fdopen(s.fd, "w+");
      if (!s.f)

      { ast_log(LOG_WARNING, "HTTP Manager, fdopen failed: %s!\n", strerror(errno)); ast_http_error(ser, 500, "Server Error", "Internal Server Error (fdopen failed)\n"); close(s.fd); goto generic_callback_out; }

      Honestly, I cannot see how that should fail, and I've had others look at it as well. It makes no sense.

      When compiling with DONT_OPTIMIZE this seems to go away, so instead simply compiled with -ggdb. I can reproduce this by logging in via the HTTP interface and then cycling between two commands:

      Action: Command
      Command: reload

      Action: SIPPeers

      99% of the time it dies on the SIPPeers one, but sometimes on reload too.

      I've got some php code that loops using curl and a cookie jar. Only a single http request at a time.

      [Edit: Matt Jordan]
      The apparent solution was provided by Rosen:

      Hi guys,

      I was digging into this problem for a couple of days.
      I did manage to get rid of that error message:
      [Oct 2 11:24:16] WARNING[90943] manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

      The problem was solved for us, by removing
      close(s->fd);
      from process_output.
      manager.c Revision: 331315
      on line 5510

      "The fdopen() function associates a stream with the existing file descriptor, fildes. The mode of the stream must be compatible with the mode of the file descriptor. When the stream is closed via fclose(3), fildes is closed also."

      1. sip_loop.sh
        0.8 kB
        Jaco Kroon
      2. valgrind.txt
        16 kB
        Jaco Kroon

        Issue Links

          Activity

          Hide
          Rosen Iliev added a comment -

          Hi guys,

          Just to confirm, after removing close(s->fd); from line 5510 (manager.c Revision: 331315)
          It has been 6 days now, and I have no more Bad file descriptor! messages, nor res_timing_dahdi or chan_sip messages. Before on busy hours my error log use to grow like crazy, filled up with res_timming_dahdi errors and chan_sip errors.

          Regards,

          Rosen

          Show
          Rosen Iliev added a comment - Hi guys, Just to confirm, after removing close(s->fd); from line 5510 (manager.c Revision: 331315) It has been 6 days now, and I have no more Bad file descriptor! messages, nor res_timing_dahdi or chan_sip messages. Before on busy hours my error log use to grow like crazy, filled up with res_timming_dahdi errors and chan_sip errors. Regards, Rosen
          Hide
          Jaco Kroon added a comment -

          Wouldn't that cause a potential file description leak in the case where s->f uses a different underlying fd than s->fd or is this guaranteed?

          Also, the fclose here would still close fd, and the close call would simply return EBADFD not true?

          Or are we suspect of a race? Ie, the fclose closes the fd and then the this timing closes the fd opened by mkstemp on line 5586 before the fdopen on line 5592 gets to execute? This would make a lot of sense if this is the case.

          I've updated my merged install and am busy running the test case that breaks it for me, will revert later if I can still reproduce. At least at the time of writing I've gone through 500 iterations so far without any of the problems I've noticed previously. This is really good news.

          These problems were:

          • Lost SIP extensions during reload cycle.
          • timer-fd issues during call pressure whilst reloading (unconfirmed, this was dodgy).
          • SIP reload lockups

          I did get one case in my loop that gave me a blank set of SIP peers after iteration 1383, after this running sip show peers on the server restored everything, so one of two things I can think of:

          1. Race between replacing the lists where sip show peers gets an empty list (this can theoretically be a call request too).
          2. Reload takes too long and the management interface times out after waiting for longer than 500ms and then reports failure resulting (I've seen similar things with asterisk -rx "" style commands when I still used that).

          Those are the only two scenarios I can think of, and in this case can work around it. Note that in this case it indicates that the connection from rawman to myself got closed without any payload, I haven't checked for http headers being received back ...

          For now I've set my test loop to ignore blanc responses and look specifically for 500 errors, or incorrectly number of SIP peers (ignoring the zero case).

          Show
          Jaco Kroon added a comment - Wouldn't that cause a potential file description leak in the case where s->f uses a different underlying fd than s->fd or is this guaranteed? Also, the fclose here would still close fd, and the close call would simply return EBADFD not true? Or are we suspect of a race? Ie, the fclose closes the fd and then the this timing closes the fd opened by mkstemp on line 5586 before the fdopen on line 5592 gets to execute? This would make a lot of sense if this is the case. I've updated my merged install and am busy running the test case that breaks it for me, will revert later if I can still reproduce. At least at the time of writing I've gone through 500 iterations so far without any of the problems I've noticed previously. This is really good news. These problems were: Lost SIP extensions during reload cycle. timer-fd issues during call pressure whilst reloading (unconfirmed, this was dodgy). SIP reload lockups I did get one case in my loop that gave me a blank set of SIP peers after iteration 1383, after this running sip show peers on the server restored everything, so one of two things I can think of: 1. Race between replacing the lists where sip show peers gets an empty list (this can theoretically be a call request too). 2. Reload takes too long and the management interface times out after waiting for longer than 500ms and then reports failure resulting (I've seen similar things with asterisk -rx "" style commands when I still used that). Those are the only two scenarios I can think of, and in this case can work around it. Note that in this case it indicates that the connection from rawman to myself got closed without any payload, I haven't checked for http headers being received back ... For now I've set my test loop to ignore blanc responses and look specifically for 500 errors, or incorrectly number of SIP peers (ignoring the zero case).
          Hide
          Jaco Kroon added a comment -

          Ok, it looks like the SIP peers is in fact a race. Adjusted my loop to do the following:

          1. retrieve list of SIP peers.
          2. if we got any, check if the count is the same as the previous iteration, if not, report it.
          3. issue a reload.

          With 50 peers I trigger a warning of changed number of sip peers in under 50 iterations for a few times now, also got a double-free bug reported by glibc (incomplete stack trace but confirmed chan_sip):

          blackpearl*CLI> *** glibc detected *** /usr/sbin/asterisk: double free or corruption (!prev): 0x0000000002107620 ***
          ======= Backtrace: =========
          /lib64/libc.so.6(+0x72985)[0x7fe2ac4ab985]
          /lib64/libc.so.6(cfree+0x6c)[0x7fe2ac4b08bc]
          /usr/lib64/asterisk/modules/chan_sip.so(+0x1abdd)[0x7fe28a95bbdd]
          /usr/lib64/asterisk/modules/chan_sip.so(+0x1ae3c)[0x7fe28a95be3c]
          /usr/sbin/asterisk[0x4e66c5]
          /usr/sbin/asterisk[0x4e82a7]
          /usr/sbin/asterisk[0x4e8c33]
          /usr/sbin/asterisk[0x4cf684]
          /usr/sbin/asterisk[0x53f5ed]
          /usr/sbin/asterisk[0x54a42b]
          /lib64/libpthread.so.0(+0x6d4c)[0x7fe2aba47d4c]
          /lib64/libc.so.6(clone+0x6d)[0x7fe2ac50fc6d]

          The frequency is similar to:

          Wrong number of SIP peers (got 49, expected 50), 18 times in 112 iterations.

          However, I suspect the underlying bug for the fdopen (and probably quite a few other seemingly random bugs" has been sorted out be the proposed fix.

          I'll see if I can get valgrind output on the double-free issue and log a different ticket.

          I would recommend that someone find all calls to close and fclose and audit that they actually don't fail (at least output a log message that we can use to track similar issues). In the usual case such failures won't cause problems, the cases where they silently succeed on the wrong fd is the ones causing the problems, however, in the case where they do no harm we'll get a log message which will indicate that we have a problem. This can also explain quite a number of other issues we've seen, potentially including the timerfd lockups!

          I would recommend the patch be merged soonest possible, at least into trunk.

          Show
          Jaco Kroon added a comment - Ok, it looks like the SIP peers is in fact a race. Adjusted my loop to do the following: 1. retrieve list of SIP peers. 2. if we got any, check if the count is the same as the previous iteration, if not, report it. 3. issue a reload. With 50 peers I trigger a warning of changed number of sip peers in under 50 iterations for a few times now, also got a double-free bug reported by glibc (incomplete stack trace but confirmed chan_sip): blackpearl*CLI> *** glibc detected *** /usr/sbin/asterisk: double free or corruption (!prev): 0x0000000002107620 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x72985) [0x7fe2ac4ab985] /lib64/libc.so.6(cfree+0x6c) [0x7fe2ac4b08bc] /usr/lib64/asterisk/modules/chan_sip.so(+0x1abdd) [0x7fe28a95bbdd] /usr/lib64/asterisk/modules/chan_sip.so(+0x1ae3c) [0x7fe28a95be3c] /usr/sbin/asterisk [0x4e66c5] /usr/sbin/asterisk [0x4e82a7] /usr/sbin/asterisk [0x4e8c33] /usr/sbin/asterisk [0x4cf684] /usr/sbin/asterisk [0x53f5ed] /usr/sbin/asterisk [0x54a42b] /lib64/libpthread.so.0(+0x6d4c) [0x7fe2aba47d4c] /lib64/libc.so.6(clone+0x6d) [0x7fe2ac50fc6d] The frequency is similar to: Wrong number of SIP peers (got 49, expected 50), 18 times in 112 iterations. However, I suspect the underlying bug for the fdopen (and probably quite a few other seemingly random bugs" has been sorted out be the proposed fix. I'll see if I can get valgrind output on the double-free issue and log a different ticket. I would recommend that someone find all calls to close and fclose and audit that they actually don't fail (at least output a log message that we can use to track similar issues). In the usual case such failures won't cause problems, the cases where they silently succeed on the wrong fd is the ones causing the problems, however, in the case where they do no harm we'll get a log message which will indicate that we have a problem. This can also explain quite a number of other issues we've seen, potentially including the timerfd lockups! I would recommend the patch be merged soonest possible, at least into trunk.
          Hide
          Jonathan Rose added a comment -

          A number of file descriptor changes have been made around chan_sip and tcptls that might possibly maybe perhaps have had an effect here as well. If it isn't too much trouble, could you check to see if this bug is still occurring in 1.8/10/trunk releases from svn?

          Show
          Jonathan Rose added a comment - A number of file descriptor changes have been made around chan_sip and tcptls that might possibly maybe perhaps have had an effect here as well. If it isn't too much trouble, could you check to see if this bug is still occurring in 1.8/10/trunk releases from svn?
          Hide
          Jaco Kroon added a comment -

          Hi,

          The patch from Rosen already sorted out the reported problem. Am not able to reproduce the other issues reliably, permitting the provided patch has been merged this issue can be closed IMHO.

          Jaoc

          Show
          Jaco Kroon added a comment - Hi, The patch from Rosen already sorted out the reported problem. Am not able to reproduce the other issues reliably, permitting the provided patch has been merged this issue can be closed IMHO. Jaoc

            People

            • Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development