[Home]

Summary:ASTERISK-17323: manager_park can deadlock with ast_channel_free, for channel 1 of the park operation (channel list v channel lock)
Reporter:David Woolley (davidw)Labels:
Date Opened:2011-01-31 10:48:59.000-0600Date Closed:2014-03-05 07:03:32.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:manager_park (in features.c) does ast_get_channel_by_name_locked on both channels that it is given.  This locks the list of channels during its processing, meaning that the manager thread ends up locking channel 1, and then the list of channels.

ast_channel_free write locks the list of channels, then temporarily locks the channel it is freeing, meaning it applies lock in the reverse order!

A real deadlock has been observed with manager_park waiting for channel list and ast_channel_free waiting for the channel.

****** ADDITIONAL INFORMATION ******

This was was noticed on 1.6.1.0, but the source code is unchanged in this area in 1.6.2 branch.  There has been significant restructuring of channel locking in trunk and I haven't analysed that to see if it fixes the problem.

The locking logic in ast_channel_free was introduced in revision 219136, to fix ASTERISK-14305.

Abstract from channel.c (ast_channel_free), with blame information:

219194 mnicholson               AST_RWLIST_WRLOCK(&channels);
219194 mnicholson               if (!AST_RWLIST_REMOVE(&channels, chan, chan_lis
t)) {
219194 mnicholson                       ast_debug(1, "Unable to find channel in list to free. Assuming it has already been done.\n");
219194 mnicholson               }
219194 mnicholson               /* Lock and unlock the channel just to be sure nobody has it locked still
219194 mnicholson                  due to a reference retrieved from the channel list. */
219194 mnicholson               ast_channel_lock(chan);
219194 mnicholson               ast_channel_unlock(chan);

Abstract from features.c (manager_park):
12163   tilghman       ch1 = ast_get_channel_by_name_locked(channel);
12163   tilghman       if (!ch1) {
12163   tilghman               snprintf(buf, sizeof(buf), "Channel does not exist: %s", channel);
12163   tilghman               astman_send_error(s, m, buf);
12163   tilghman               return 0;
12163   tilghman       }
12163   tilghman
12163   tilghman       ch2 = ast_get_channel_by_name_locked(channel2);
Comments:By: David Woolley (davidw) 2011-02-01 12:48:37.000-0600

We made this reproducible by strategically placed calls to sleep.  We then tried an avoidance technique, as we didn't want the channel2 functionality of manager_park (and it was ineffective anyway, which is another story), so we could force it to be passed into the main parking logic as NULL.

Unfortunately, we then hit the same deadlock combination, when masq_park_call called ast_channel_alloc.  I doubt there is a simple avoidance technique for this second deadlock point.

I don't have time to do the detailed analysis and verify that this is still true in 1.6.2, until tomorrow.

By: David Woolley (davidw) 2011-02-02 04:48:28.000-0600

I cannot see any reason why ast_channel_free should retain the lock on the channels list after it has unlinked the channel from it!  Is there one?

By: David Woolley (davidw) 2011-02-02 04:55:30.000-0600

ASTERISK-14305 only represents an indentation change at the point of deadlock, however it also make a change, in ast_hangup, that makes it less likely that the deadlocking code will be exercised, although, if exercised, that code is still unsafe, because of locking orders.



By: David Woolley (davidw) 2011-02-02 05:37:11.000-0600

The other approach I can see is to do what manager.c does for redirect, and to chicken out if check_hangup indicates the channel has gone.  That feels safe in this case, but if the channel list unlock is too late, fixing that is a more general solution.

By: David Woolley (davidw) 2011-02-02 05:37:15.000-0600

The other approach I can see is to do what manager.c does for redirect, and to chicken out if check_hangup indicates the channel has gone.  That feels safe in this case, but if the channel list unlock is too late, fixing that is a more general solution.

By: David Woolley (davidw) 2011-02-02 05:53:52.000-0600

The channel_free side of the problem seems to be have been introduced by revision 22632.  Unfortunately the log only says "auto-merge", even on trunk, so I cannot match it with an issue.

By: David Woolley (davidw) 2011-02-02 10:34:47.000-0600

Updated note 131382, as ASTERISK-14305 does seem to mitigate the problem in common cases.

It repeats the code that removes the channel from the channels list, but does it earlier in the process, and does it in a way where it releases the channel list lock promptly.



By: David Woolley (davidw) 2011-02-14 12:48:17.000-0600

Because this seems to be fixed in the most common case by an existing fix, I think this definitely should be downgraded to "minor".

Also, as that fix seems to cover the case we are encountering, we will not have the resources to follow this one through.

However, either there still is a locking order problem, or there is unreachable code, so I'd say there is still a bug here.  I think I would suggest considering suspending the issue.

By: Matt Jordan (mjordan) 2014-03-05 07:03:27.545-0600

Hey David -

So, I know there wasn't much feedback on this issue, but looking at the current Asterisk 1.8 code for {{manager_park}}, this should not longer be an issue. {{ast_channel_get_by_name}} no longer has a locking order inversion with {{ast_channel_free}}. I'm going to go ahead and close this out as "Fixed" in some release of 1.8 and later.

I know that obviously doesn't help the system that this was reported on, but since those versions didn't have a lot of the architectural changes that were made in 1.8, it was very difficult to fix it correctly in those versions.

Matt