Summary: | ASTERISK-17323: manager_park can deadlock with ast_channel_free, for channel 1 of the park operation (channel list v channel lock) | ||
Reporter: | David Woolley (davidw) | Labels: | |
Date Opened: | 2011-01-31 10:48:59.000-0600 | Date Closed: | 2014-03-05 07:03:32.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Channels/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ||
Description: | manager_park (in features.c) does ast_get_channel_by_name_locked on both channels that it is given. This locks the list of channels during its processing, meaning that the manager thread ends up locking channel 1, and then the list of channels. ast_channel_free write locks the list of channels, then temporarily locks the channel it is freeing, meaning it applies lock in the reverse order! A real deadlock has been observed with manager_park waiting for channel list and ast_channel_free waiting for the channel. ****** ADDITIONAL INFORMATION ****** This was was noticed on 1.6.1.0, but the source code is unchanged in this area in 1.6.2 branch. There has been significant restructuring of channel locking in trunk and I haven't analysed that to see if it fixes the problem. The locking logic in ast_channel_free was introduced in revision 219136, to fix ASTERISK-14305. Abstract from channel.c (ast_channel_free), with blame information: 219194 mnicholson AST_RWLIST_WRLOCK(&channels); 219194 mnicholson if (!AST_RWLIST_REMOVE(&channels, chan, chan_lis t)) { 219194 mnicholson ast_debug(1, "Unable to find channel in list to free. Assuming it has already been done.\n"); 219194 mnicholson } 219194 mnicholson /* Lock and unlock the channel just to be sure nobody has it locked still 219194 mnicholson due to a reference retrieved from the channel list. */ 219194 mnicholson ast_channel_lock(chan); 219194 mnicholson ast_channel_unlock(chan); Abstract from features.c (manager_park): 12163 tilghman ch1 = ast_get_channel_by_name_locked(channel); 12163 tilghman if (!ch1) { 12163 tilghman snprintf(buf, sizeof(buf), "Channel does not exist: %s", channel); 12163 tilghman astman_send_error(s, m, buf); 12163 tilghman return 0; 12163 tilghman } 12163 tilghman 12163 tilghman ch2 = ast_get_channel_by_name_locked(channel2); | ||
Comments: | By: David Woolley (davidw) 2011-02-01 12:48:37.000-0600 We made this reproducible by strategically placed calls to sleep. We then tried an avoidance technique, as we didn't want the channel2 functionality of manager_park (and it was ineffective anyway, which is another story), so we could force it to be passed into the main parking logic as NULL. Unfortunately, we then hit the same deadlock combination, when masq_park_call called ast_channel_alloc. I doubt there is a simple avoidance technique for this second deadlock point. I don't have time to do the detailed analysis and verify that this is still true in 1.6.2, until tomorrow. By: David Woolley (davidw) 2011-02-02 04:48:28.000-0600 I cannot see any reason why ast_channel_free should retain the lock on the channels list after it has unlinked the channel from it! Is there one? By: David Woolley (davidw) 2011-02-02 04:55:30.000-0600 ASTERISK-14305 only represents an indentation change at the point of deadlock, however it also make a change, in ast_hangup, that makes it less likely that the deadlocking code will be exercised, although, if exercised, that code is still unsafe, because of locking orders. By: David Woolley (davidw) 2011-02-02 05:37:11.000-0600 The other approach I can see is to do what manager.c does for redirect, and to chicken out if check_hangup indicates the channel has gone. That feels safe in this case, but if the channel list unlock is too late, fixing that is a more general solution. By: David Woolley (davidw) 2011-02-02 05:37:15.000-0600 The other approach I can see is to do what manager.c does for redirect, and to chicken out if check_hangup indicates the channel has gone. That feels safe in this case, but if the channel list unlock is too late, fixing that is a more general solution. By: David Woolley (davidw) 2011-02-02 05:53:52.000-0600 The channel_free side of the problem seems to be have been introduced by revision 22632. Unfortunately the log only says "auto-merge", even on trunk, so I cannot match it with an issue. By: David Woolley (davidw) 2011-02-02 10:34:47.000-0600 Updated note 131382, as ASTERISK-14305 does seem to mitigate the problem in common cases. It repeats the code that removes the channel from the channels list, but does it earlier in the process, and does it in a way where it releases the channel list lock promptly. By: David Woolley (davidw) 2011-02-14 12:48:17.000-0600 Because this seems to be fixed in the most common case by an existing fix, I think this definitely should be downgraded to "minor". Also, as that fix seems to cover the case we are encountering, we will not have the resources to follow this one through. However, either there still is a locking order problem, or there is unreachable code, so I'd say there is still a bug here. I think I would suggest considering suspending the issue. By: Matt Jordan (mjordan) 2014-03-05 07:03:27.545-0600 Hey David - So, I know there wasn't much feedback on this issue, but looking at the current Asterisk 1.8 code for {{manager_park}}, this should not longer be an issue. {{ast_channel_get_by_name}} no longer has a locking order inversion with {{ast_channel_free}}. I'm going to go ahead and close this out as "Fixed" in some release of 1.8 and later. I know that obviously doesn't help the system that this was reported on, but since those versions didn't have a lot of the architectural changes that were made in 1.8, it was very difficult to fix it correctly in those versions. Matt |