[Home]

Summary:ASTERISK-19285: [regression] Deadlock in asterisk 1.8.9.0 (possible chan_agent and queues interaction)
Reporter:Alex Villacís Lasso (a_villacis)Labels:regression
Date Opened:2012-01-31 11:53:42.000-0600Date Closed:2012-02-03 15:29:01.000-0600
Priority:BlockerRegression?Yes
Status:Closed/CompleteComponents:Applications/app_queue Channels/chan_agent
Versions:1.8.9.0 Frequency of
Occurrence
Frequent
Related
Issues:
must be completed before resolvingASTERISK-19128 Asterisk 1.8.10 Blockers
must be completed before resolvingASTERISK-19129 Asterisk 10.2.0 Blockers
is caused byASTERISK-18092 asterisk segfault libpthread-2.9.so
Environment:Linux dialer3850.elastix.com 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux 32 GB RAMAttachments:( 0) asterisk-1.8.9.0-bug19285-fix-action-agents-deadlock.patch
( 1) backtrace-threads.txt
( 2) core-show-locks.txt
( 3) deadlock_agents3.diff
( 4) jrose_19285.diff
Description:The system is a x86_64 machine that is being used as a callcenter. The agents log in via the AgentLogin application, and each Agent/XXXX channel is assigned to one or more queues. A custom separate process generates calls into the queues for the agents to answer. The calls all go out through a SIP trunk, and all of the agent extensions are SIP. After an hour or so, asterisk deadlocks. Any attempt to run "agent show" or "agent show online" through the console hangs. Also, AMI events seem to stop. However, the users seem to be still connected, only they do not receive calls anymore (the custom process waits forever for the Originate response). The deadlock is apparently spontaneous - there is no explicit action taken by the administrator that seems to induce the issue.
Comments:By: Alex Villacís Lasso (a_villacis) 2012-01-31 11:55:04.650-0600

Report of "core show locks" on deadlock.

By: Alex Villacís Lasso (a_villacis) 2012-01-31 11:55:52.614-0600

gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk `pidof asterisk` > /tmp/backtrace-threads.txt

By: Alex Villacís Lasso (a_villacis) 2012-01-31 12:20:39.870-0600

Found the loop:

Thread ID: 0x42826940
Holds: 0x2aaadc1354f0 0x2aaab97ac790 0x2aaadc17ef90
Waits: 0x50f1b90

Thread ID: 0x418c6940
Holds: 0x50f1b90
Waits: 0x2aaadc17ef90

All the other threads are innocent bystanders.

By: Alex Villacís Lasso (a_villacis) 2012-01-31 14:40:17.199-0600

I have marked this bug as a regression because the patch that is supposed to fix ASTERISK-18092 is probably the one that introduced this bug.

By: Alex Villacís Lasso (a_villacis) 2012-01-31 15:07:28.864-0600

This is what I believe is happening:

The user is running a script that periodically invokes the AMI action "Agents", which is handled by action_agents() in channels/chan_agent.c:1499. This function traverses the agent list, and for each one first takes a lock on struct agent_pvt *p (chan_agent.c:1516), then attempts to take a lock on p->owner (a channel of type Agent, I think) at chan_agent.c:1534, in order to check whether this is a bridged channel. This second lock is the one that is introduced by the patch that "fixes" ASTERISK-18092.

Meanwhile, in another thread, some frames need to be written to the Agent/xxxx channel, at ast_write() in main/channel.c:4767 . In channel.c:4774, a lock is taken on the channel (which happens to be the one at p->owner), and then the tech-specific write method is invoked at channel.c:5032. For Agent channels, this method is agent_write() at channels/chan_agent.c:691. This method extracts tech_pvt from the channel (which happens to be the one picked up in the other thread at line 1516), then attempts to take a lock on it. Therefore, a deadlock.

I was about to perform what amounts to a revert of the fix for ASTERISK-18092, but then I looked in the CHANGELOG and realized this other issue. Also, I found an inconsistency in the handling of the Agents action as compared to the commands "agent show" and "agent show online". If the thread holds a lock on agent_pvt, and should then take a lock on agent_pvt->owner, then agents_show() at line 1702 and agents_show_online() at line 1771 must have the same lock taken in order to be consistent with the "fix" for action_agents(). On the other hand, I originally decided to revert the change in action_agents() in order to make the function consistent with agents_show() and agents_show_online() which do not take the lock.

By: Alex Villacís Lasso (a_villacis) 2012-01-31 16:01:11.304-0600

Patch to add deadlock avoidance on action agents, agent show, agent show online as done elsewhere in chan_agent.c.

By: Jonathan Rose (jrose) 2012-02-01 09:02:48.539-0600

https://wiki.asterisk.org/wiki/display/AST/Locking+in+Asterisk

"As a side note, if you ever find yourself in the position where you are designing a program and think using deadlock avoidance is a quick solution for a concurrency problem you run into, it is not. Running into a deadlock almost certainly means your design is flawed. I'd go as far as to say if you ever find yourself designing a system with multiple locks held at the same time, your design is flawed. Seriously, anyone who reads this remember this, and say it to yourself every time you create a new mutex."
– David Vossel

The real answer here is probably to unify the locking order.

By: Jonathan Rose (jrose) 2012-02-01 14:22:26.423-0600

https://reviewboard.asterisk.org/r/1708/

I posted a different patch to reviewboard.  I'm also attaching it to the issue, if you could report whether or not this works, that would be very helpful.

By: Jonathan Rose (jrose) 2012-02-02 11:56:52.543-0600

Updated to current version of patch on reviewboard.  Please test this instead.