[Home]

Summary:ASTERISK-20032: Call to timerfd_gettime() error: Bad file descriptor
Reporter:Jeremiah Gowdy (jgowdy)Labels:
Date Opened:2012-06-21 15:21:24Date Closed:2015-03-14 09:58:51
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Resources/res_timing_timerfd
Versions:10.5.1 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:openSUSE 11.4 x86_64 on Dell M610 bladeAttachments:( 0) ASTERISK-20032-bt.txt
( 1) jgowdy-timerfd-6-22-2012.diff
Description:I have seen this issue previously on previous 10.x releases.  After running call load against our ExternalIVR based application for a random amount of time, the log will flood this:

{noformat}
[Jun 21 12:39:18] VERBOSE[30608] pbx.c:     -- Executing [s@originate:1] Answer("Local/s@originate-dc89;2", "") in new stack
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] VERBOSE[30609] pbx.c:     -- Executing [9007862194338@dial:1] Goto("Local/s@originate-dc89;1", "dial-sansay,5400117862194338,1") in new stack
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] VERBOSE[30609] pbx.c:     -- Goto (dial-sansay,5400117862194338,1)
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] VERBOSE[30609] pbx.c:     -- Executing [5400117862194338@dial-sansay:1] Dial("Local/s@originate-dc89;1", "SIP/5400117862194338@sansay,,G(post^s^1)") in new stack
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
[Jun 21 12:39:18] ERROR[30608] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor
{noformat}

The log only seems to contain a single thread that is stuck in this state.  Other calls on other channels seem to continue to function.  I am working to provide better context surrounding when this error occurs.

---

I've changed the timer interface to make the timer ack function(s) in all of the timer implementations return 0 on success / -1 on failure, and I've changed all the places that call ast_timer_ack check that return code and return error themselves if it fails.  This stops the streaming of the timer errors and allowed me to determine which call is getting the error:

{noformat}
[Jun 22 10:46:09] VERBOSE[20352] pbx.c:     -- Executing [s@originate:1] Answer("Local/s@originate-3722;2", "") in new stack
[Jun 22 10:46:09] ERROR[20352] res_timing_timerfd.c: Call to timerfd_gettime() using handle 257 error: Bad file descriptor
[Jun 22 10:46:09] ERROR[20352] channel.c: Timer failed in ast_read
[Jun 22 10:46:09] VERBOSE[20352] pbx.c:   == Spawn extension (originate, s, 1) exited non-zero on 'Local/s@originate-3722;2'
{noformat}

It seems that the issue happens when channel.c calls ast_timer_ack from ast_read
Comments:By: Matt Jordan (mjordan) 2012-06-21 15:26:35.903-0500

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Does it only happen when you're interacting with app_externalivr?  Can you provide your extensions.conf as well?

By: Jeremiah Gowdy (jgowdy) 2012-06-22 15:45:27.386-0500

While I haven't isolated how the bad timerfd is getting used, this patch changes the timer interface to return a result code, which allows callers to exit their operation rather than looping on the bad file handle indefinitely.  The patch includes changing all callers to check the result and log an error if it happens while also attempting to properly exit the method.  The patch also includes efforts to set the fd to -1 and the timer pointer to NULL any time they are closed/freed.  The result of this patch is, while I have seen the error reproduced once, it happened once and that operation/call was ended without potentially taking down the entire Asterisk service with streaming errors due to looping on a bad file pointer and pounding the error log.

By: Jeremiah Gowdy (jgowdy) 2012-06-25 13:46:29.075-0500

Can we look at getting this patch applied, reducing the severity of the issue drastically?

By: Matt Jordan (mjordan) 2012-06-25 13:51:42.848-0500

Its in the queue - please be patient :-)

By: Jeremiah Gowdy (jgowdy) 2012-06-25 13:53:14.748-0500

I was just making sure it wasn't stalled waiting for me to click send back or something.  Thanks.  :)

By: Dan Cropp (daninmadison) 2012-09-18 08:08:36.897-0500

We have seen this issue as well.  It appears there hasn't been a debug sent in yet.

I will retrieve the trunk and try to replicate this.  From what I remember, we see it around 20K-25K outbound calls (local) to ourselves (20K-25K inbound local calls).  If I can replicate it, I will follow the wiki information for collecting a complete debug.

By: Matt Jordan (mjordan) 2012-11-05 17:12:21.055-0600

I've committed your timer cleanup patch to the various release branches and trunk.  I didn't, however, close this issue as I don't think the cleanup patch addresses the root cause of your problem - namely, whatever happened to the file descriptor on the Local channel in the first place.

If we get a dialplan and a scenario that lets us reproduce the initial problem, let me know and we'll take a look at figuring out what's going on.

By: Matt Hamilton (mistral9999) 2012-11-29 09:27:52.564-0600

This issue seems to be similar to the following error we get from res_timing_dahdi.c. It happens in the ast_read loop which eventually crashes Asterisk.

[Mar  5 12:11:04] ERROR[24524] res_timing_dahdi.c: Failed to configure DAHDI timing fd for 0 sample timer ticks

We have verified that this is not due to running out of file descriptors, and I think in this case, too, something happens to the file descriptor as suggested.

By: Matt Hamilton (mistral9999) 2012-11-30 10:03:31.548-0600

Addendum to my previous comment:

This error originates from ast_timer_set_rate in ast_read (AST_TIMING_EVENT_EXPIRED) in  channel.c.

Unlike ast_timer_ack, ast_timer_set_rate returns an error (-1), but this is not checked. However, the recent changes to ast_timer_ack in the patch should prevent the code reaching there.

So far we haven't been able to pinpoint why/when the timer fd goes bad (assuming that's the case). We get this error (and crash) every couple of days - it's random, might happen under very light load. Having said that, it started happening less after some changes to our dialplan which resulted in less musiconhold activity. Of course, might be just a coincidence...






By: Jamuel Starkey (jamuel) 2012-12-21 09:51:39.372-0600

Seeing segfaults in SendFAX when running against asterisk 1.8.20.0-rc1.  Will upload BT.

{code}
Program terminated with signal 11, Segmentation fault.
#0  0x0819340e in ast_timer_set_rate (handle=0x0, rate=50) at timing.c:167
167             res = handle->holder->iface->timer_set_rate(handle->fd, rate);
{code}


By: Jamuel Starkey (jamuel) 2012-12-21 09:52:45.585-0600

BT for segfault in ast_timer_set_rate.

By: Matt Jordan (mjordan) 2012-12-21 11:05:05.750-0600

Jamuel:

So there's something very odd here. The backtrace indicates that the timer handle is NULL, which only occurs when a fax session has been disposed of. Otherwise, the timer handle should be created on successful completion of {{spandsp_fax_new}}, and should never be changed from there.

The odd thing is that where the crash happens - {{spandsp_fax_start}} - can't be called unless {{spandsp_fax_new}} completed succesfully. {{res_fax}} bails out if it can't create the fax session, so I'm not sure how we've gotten into a state where {{spandsp_fax_new}} has a timer handle that's NULL. A few things to check:

# Are there any modifications to the system's {{res_fax}} or {{res_fax_spandsp}}?
# If not, are there any errors in the log leading up to the error?

If both of those are "No", then I'll probably make a debug patch to try and figure out how we got into a started fax session with no timer.

By: Jamuel Starkey (jamuel) 2012-12-21 11:35:46.212-0600

Matt: As indicated on IRC the answer is "No" to both #1 and #2.  I'm happy to test your patch as soon as you have something.

By: Paul Belanger (pabelanger) 2013-03-01 22:37:29.396-0600

I'm able to join the party now too.  Having this issue also, but have not applied said patch from above.  Will be doing so shortly and report back any additional errors.

By: Matt Jordan (mjordan) 2015-03-14 09:58:45.798-0500

Since this patch was committed in 1.8+ and no one has been able to reproduce the fax issue (which was ancillary to the FD timer issue reported here), I'm going to close this out as Fixed.

If someone reproduces the fax issue in a supported branch of Asterisk, please open a new issue. Thanks!