[Home]

Summary:ASTERISK-18101: Asterisk 1.8 Deadlock in app_queue
Reporter:Paul Rolfe (prolfe)Labels:
Date Opened:2011-07-08 01:26:33Date Closed:2011-09-15 10:46:41
Priority:CriticalRegression?
Status:Closed/CompleteComponents:Applications/app_queue
Versions:1.8.4 1.8.5.0 1.8.6.0 Frequency of
Occurrence
Constant
Related
Issues:
must be completed before resolvingASTERISK-18499 Asterisk 1.8.8.0 Release Blockers
is duplicated byASTERISK-18487 Daily deadlock issue
Environment:Debian Squeeze - Linux pabx 2.6.35.11 #1 SMP Tue Feb 8 14:35:57 WST 2011 x86_64 GNU/LinuxAttachments:( 0) 1139_backtrace_threads.txt
( 1) 1139_core_show_locks.txt
( 2) 12740_backtrace_threads.txt
( 3) 12740_core_show_locks.txt
( 4) 13994_backtrace_threads.txt
( 5) 13994_core_show_locks.txt
( 6) 14276_backtrace_threads.txt
( 7) 14276_core_show_locks.txt
( 8) 25292_backtrace_threads.txt
( 9) 25292_core_show_locks.txt
(10) 28050_backtrace_threads.txt
(11) 28050_core_show_locks.txt
(12) 32179_backtrace_threads.txt
(13) 32179_core_show_locks.txt
(14) 4619_backtrace_threads.txt
(15) 4619_core_show_locks.txt
(16) 4670_backtrace_threads.txt
(17) 4670_core_show_locks.txt
(18) 7646_backtrace_threads.txt
(19) 7646_core_show_locks.txt
(20) app_queue_0830.patch
(21) bugASTERISK-18101.patch
(22) bugASTERISK-18101-1.8.patch
Description:Asterisk 1.8 SVN is deadlocking at random intervals.

Have switched from timerfd timing to DAHDI, issue still occurs.

The machine is a production PABX.  We run approx 30 queues with about 60 agents logged in, and a large number of AMI manager connections from 2 webservers.  Calls are frequently transferred between queues and agents, as well as having several channel spy sessions initiated from the webservers.

Previously to this we were running 1.4.40 and had not had any deadlocking issues.

We have been through versions 1.8.3, 1.8.4, 1.8.4.2, 1.8.4.3, 1.8.4.4 and all have exhibited similar deadlocking issues.

This seems to occur on average about twice a day for normal load.  On weekends and evenings which have significantly less load the problem does not appear to occur.

Comments:By: Paul Rolfe (prolfe) 2011-07-12 22:29:26.954-0500

Had a further lockup today.

core show locks and backtrace attached.


By: Paul Rolfe (prolfe) 2011-07-13 02:15:33.407-0500

And another just now.  

Debug info attached.



By: Paul Rolfe (prolfe) 2011-07-13 22:21:46.490-0500

The system got very busy today, and consequently we had 4 deadlocks in a 30 minute perdiod.

I have attached the debug files for all.

We're going to have to revert to 1.4 until this issue is resolved.



By: Paul Rolfe (prolfe) 2011-07-28 02:01:58.009-0500

Further lockup again today.

Recent SVN updates seem to have reduced the frequency.


By: Paul Rolfe (prolfe) 2011-08-03 22:32:22.845-0500

Frequency of this issue seems to have reduced overall, however it still occurs during periods of high load.

Is this actively being looked at?


By: Gregory Hinton Nietsky (irroot) 2011-08-05 03:25:09.100-0500

Have mostly got it gone since last week small patch and discussion on IRC but still having infrequent channels been orphaned

By: Paul Rolfe (prolfe) 2011-08-05 03:48:06.584-0500

Great,

Thanks.

Had another one yesterday, but unfortunately I wasn't around to capture debug info, and Asterisk was just restarted.



By: Paul Rolfe (prolfe) 2011-08-17 21:57:27.724-0500

The bulk of this issue seems to have been resolved.

We are currently running revsion 330704 from SVN, and have a current uptime in production of 1 week, 6 days, 19 hours, 49 minutes, 13 seconds.

This is more than 3 times our previous best uptime in production.


By: Paul Rolfe (prolfe) 2011-08-25 00:35:22.333-0500

Update:

We've had two lockups in the past 2 weeks.  These have been the biggest uptimes between dealocks since rolling out 1.8.

I've added the 'core show locks' and backtraces for these two lockups.

Today's lockup was in SVN version 332432

Hope its all helping.


By: Paul Rolfe (prolfe) 2011-08-25 00:37:19.559-0500

4670 - Lockup on 25 Aug.
4619 - Lockup on 19 Aug.

By: Richard Mudgett (rmudgett) 2011-08-25 10:22:01.291-0500

The last two core show locks files indicate a locking order problem between the queues container lock, channels container lock, and a channel lock.

By: Gregory Hinton Nietsky (irroot) 2011-08-30 05:01:37.657-0500

Based on the latest deadlocks here is a patch.

1)Lock queues in update_queue before iterating
2)queues is held when calling a channel in some cases and probably should not

By: Gregory Hinton Nietsky (irroot) 2011-08-30 08:48:48.704-0500

Fixed stupid mistake in previous patch

By: Gregory Hinton Nietsky (irroot) 2011-09-05 02:23:10.595-0500

Ok there is a new patch on the RB that cleans up the locking please try it out and feedback.

By: Paul Rolfe (prolfe) 2011-09-05 19:49:51.977-0500

I have compiled and installed this patch into our production system.

Determining success may be difficult as the mean time between deadlocks has increased over the last month, and I don't have a method to reproduce the problem at will.

Thanks.


By: Gregory Hinton Nietsky (irroot) 2011-09-06 00:51:34.099-0500

Thx paul indeed 1.8 is far more stable and the key is the do no harm principle as long as this patch does no harm [cause additional more frequent DL's] hope fully the mean time will increase further and any DL's will not be related to this issue.

please post any further information here so we can attend to it.

By: Kristijan Vrban (vrban) 2011-09-09 04:11:01.148-0500

perhaps i hade the same issue today. all calls into the queues just stopped. but asterisk itself was still running. all other call which where not going into a queue, where fine.

the last output when a call was going into a queue:

{noformat}
   -- Executing [*770002-K1386@main-functions:3] ExecIf("SIP/K138617D9D-00001215", "1?Queue(K1386-0002,,,,10)") in new stack
[Sep  9 10:44:06] DEBUG[9823]: app_queue.c:5873 queue_exec: NO QUEUE_PRIO variable found. Using default.
[Sep  9 10:44:06] DEBUG[9823]: app_queue.c:5923 queue_exec: queue: K1386-0002, options: , url: , announce: , expires: 1315557856, priority: 0
{noformat}

and then nothing.

i try the patch now.

By: Gregory Hinton Nietsky (irroot) 2011-09-10 14:15:47.187-0500

commit 331774 resolves one of these problems

r331774 | mnicholson | 2011-08-12 21:01:27 +0200 (Fri, 12 Aug 2011) | 11 lines

Unlock the channel before calling update_queue.

Holding the channel lock when calling update_queue which attempts to lock the
queue lock can cause a deadlock. This deadlock involves the following chain:

1. hold chan lock -> wait queue lock
2. hold queue lock -> wait agent list lock
3. hold agent list lock -> wait chan list lock
4. hold chan list lock -> wait chan lock


By: Gregory Hinton Nietsky (irroot) 2011-09-10 14:37:14.309-0500

It appears that once the queues lock is held no channels must be locked there is no reason in app_queue to ever currently lock the queues container outside of the ao2 API.

working on a comprehensive patch to implement this the patch i have previously posted should circumvent the reported problems in this bug report however its not complete or correct.

By: Gregory Hinton Nietsky (irroot) 2011-09-10 15:09:27.991-0500

new review patch posted this should be a blocker for 1.8.8

By: Gregory Hinton Nietsky (irroot) 2011-09-14 09:26:14.246-0500

New and hopefully last patch.

By: Jason Legault (jlegault) 2011-09-14 12:50:18.807-0500

Tried new patch with a fresh 1.8.7.0-rc1 and got the following when compiling.

app_queue.c: In function ‘reload_queues’:
app_queue.c:6784: error: ‘OBJ_NOLOCK’ undeclared (first use in this function)
app_queue.c:6784: error: (Each undeclared identifier is reported only once
app_queue.c:6784: error: for each function it appears in.)
make[1]: *** [app_queue.o] Error 1
make: *** [apps] Error 2



By: Gregory Hinton Nietsky (irroot) 2011-09-14 13:19:16.123-0500

Oops my bad this is a ver 10 compat issue im working on it

By: Gregory Hinton Nietsky (irroot) 2011-09-14 13:37:58.350-0500

Ok here is a 1.8 version of the patch the 10/Trunk version takes advantage of OBJ_NOLOCK not available in 1.8