[Home]

Summary:ASTERISK-17867: [patch] Random entire asterisk deadlock when use builtin_atxfer when use res_timing_timerfd module
Reporter:Igor Nikolaev (microlana)Labels:
Date Opened:2011-05-16 12:55:47Date Closed:2011-09-22 06:59:04
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_timing_timerfd
Versions:1.8.4 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-17151 Deadlock - Asterisk stops processing sip calls after a few calls.
is related toASTERISK-17407 [patch] Seems Like ast_read / timerfd_timer_ack Causes lockups and resource drain
is related toASTERISK-17512 Asterisk-deadlocks - followup to #18497
Environment:Attachments:( 0) gdb-asterisk-deadlock1.txt
( 1) gdb-asterisk-deadlock2.txt
( 2) gdb-asterisk-deadlock3.txt
( 3) res_timing_timerfd.c.diff
( 4) res_timing_timerfd.full.diff
( 5) res_timing_timerfd.optimized.diff
( 6) res_timinig_timerfd_read_deadlock.patch
Description:Random entire asterisk deadlock when use builtin_atxfer when use res_timing_timerfd module.

In this case impossible to create new channels and/or destruct existing channels.


****** ADDITIONAL INFORMATION ******

Channels containers is locked into (example 2 and 3) __ao2_callback() at
astobj2.c:769, while it waiting to lock some channel, which already locked by __ast_read() at channel.c:3661, while this channel is into infinity read loop on disarmed timerfd timer.
Comments:By: Igor Nikolaev (microlana) 2011-05-16 12:59:15

To avoid this deadlock don't call a read() function into disarmed timer.

Patch for this solution is attached into diff file

By: Igor Nikolaev (microlana) 2011-05-16 13:02:35

<inline patch removed by lmadsen>



By: Leif Madsen (lmadsen) 2011-05-16 14:53:00

You can't provide patches inline like that. I've removed it. Please wait for your license to be approved. Thanks!

By: Lott Caskey (lottc) 2011-05-16 22:41:53

I have a situation matching this description.  I am eager to see if your patch works.  Willing to test as soon as license is submitted.

By: Igor Nikolaev (microlana) 2011-05-17 00:12:41

While pending license you may disable or reorder res_timing_timerfd and use res_timing_dahdi timing source instead (not tested by me).

By: Gregory Hinton Nietsky (irroot) 2011-05-17 03:36:50

ASTERISK-17407 is related

By: Mark Murawski (kobaz) 2011-05-24 01:22:02

Patch is working well for me so far.

I'm getting the disarmed timer warning quite often with one of my unit tests.  I'll break it down into a small bit of dialplan so we can figure out what causes the problem in the first place.

By: Mark Murawski (kobaz) 2011-05-24 01:31:56

By working well, I mean asterisk isn't crashing.  But something is broken in handling a channel hangup and channels stick around longer than they should.

And the stuck channels doesn't happen with dahdi timing.



By: zvision (zvision) 2011-05-26 08:23:59

I still see a small window in the proposed patch, where the timer can be disarmed between a call to the timerfd_gettime and the read functions from another thread, as there is no serialization during timerfd_timer_ack.

By: Michel Verbraak (astmiv) 2011-06-07 02:09:20.894-0500

Update patch which has right debug text and uses locking to prevent changes when reading.

By: zvision (zvision) 2011-06-07 09:59:35.505-0500

I've attached another patch proposals. Both are similar to the existing ones, but use ast_poll before read to avoid a read block. The optimized version does not use container lookups and timer locks in order to make timer_ack call as fast as possible. Diffs are made from the 1.6.2 SVN

By: zvision (zvision) 2011-06-08 08:35:57.967-0500

From my observations, using timerfd_gettime call and checking for no remaining time can give false warnings on a heavily loaded system. Sometimes timerfd_gettime returns 0 in the timerspec for already expired timers, if no read call has been made yet. If we use ao2_find to find the timer, the best way would be to check the saved_timer values for a disarmed timer.

By: Mark Murawski (kobaz) 2011-06-27 10:56:29.086-0500

Out of the patches, which patch is proposed to be the 'official' patch?

By: Gregory Hinton Nietsky (irroot) 2011-06-28 06:38:10.105-0500

been a few hours running the patch from RB with log messages commented out no ill effects so far ...

By: David Vossel (dvossel) 2011-06-29 14:05:16.836-0500

https://reviewboard.asterisk.org/r/1255/ Fixed this.

r325673 in 1.8


By: zvision (zvision) 2011-06-30 04:27:36.635-0500

I think the committed patch has two issues. The first one is not unreferencing a timer object after ao2_find, causing its reference count to increase each time timer_ack is called. The second one is a missing lock for the timer. It can happend that the timer will get disarmed after the check for disarmed condition and before the read call. Can anyone confirm I am right? Should I report it as a new issue?

By: Michel Verbraak (astmiv) 2011-06-30 10:02:13.935-0500

If they use my patch "res_timinig_timerfd_read_deadlock.patch" Then both comments of zvision have been resolved. But this patch was not committed.

By: Mark Murawski (kobaz) 2011-06-30 17:34:41.535-0500

I'll do testing with res_timinig_timerfd_read_deadlock.patch.

Other testers.  Please test with this patch and report your results.


By: KevinH (kevinh001) 2011-06-30 19:54:57.645-0500

I just upgraded to the 1.8.5-rc1 for testing and the dial plan code below causes loop on "Wait", have to kill the Asterisk process to terminate the loop.  1.8.4.3 works fine.

{quote}
[Jun 30 13:08:00] VERBOSE[10836] pbx.c:   == Starting DAHDI/1-1 at from-pstn,,1 failed so falling back to exten 's'
[Jun 30 13:08:00] VERBOSE[10836] pbx.c:     -- Executing [s@from-pstn:1] Answer("DAHDI/1-1", "") in new stack
[Jun 30 13:08:00] VERBOSE[10836] pbx.c:     -- Executing [s@from-pstn:2] Wait("DAHDI/1-1", "0.5") in new stack
[Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd.
[Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd.
[Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd.
[Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd.
[Jun 30 13:08:00] DEBUG[10836] res_timing_timerfd.c: Reading attempt on idle timerfd.
just keeps going....
{quote}

By: David Vossel (dvossel) 2011-07-06 10:31:40.961-0500

It has been discovered that the proposed fix for timerfd caused a serious regression and has been reverted.

By: Leif Madsen (lmadsen) 2011-09-22 06:59:04.157-0500

res_timing_timerfd should now be fixed in Asterisk 1.8.7.0. Please test the current release candidate, and if you continue to have problems with res_timing_timerfd, please open a new issue. Thanks!