[Home]

Summary:ASTERISK-23554: [patch]deadlock on forced disconnect of DAHDI PRI span
Reporter:Tzafrir Cohen (tzafrir)Labels:patch
Date Opened:2014-03-27 15:38:52Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Channels/chan_dahdi
Versions:SVN 12.1.1 13.18.4 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) pri_destroy_span_nolock.patch
( 1) pri_destroy_span_prilist.patch
( 2) pri_destroy_span_threads.patch
( 3) sigpri_handle_enodev_1.patch
( 4) sigpri_handle_enodev.patch
( 5) test.sh
Description:There is a deadlock on the locks of iflock (the chan_dahdi lock of the interfaces list) and pri->lock of the specific span.

Path 1:
DAHDI_EVENT_REMOVED received on a B channel. Later on:
do_monitor
dahdi_destroy_channel_range (takes iflock)
destroy_dahdi_pvt
dahdi_unlink_pri_pvt (tries to take pri->lock)

Path 2:
pri_dchannel (takes pri->lock)
sig_pri_handle_dchan_exception
my_handle_dchan_exception (handling DAHDI_EVENT_REMOVED)
pri_destroy_span
dahdi_destroy_channel_range (tries to take iflock)

It seems that this is esy to reproduce as DAHDI will emit a DAHDI_EVENT_REMOVED in response to any ioctl on a channel in a device that has been removed.
Comments:By: Tzafrir Cohen (tzafrir) 2014-03-30 08:17:32.983-0500

pri_destroy_span_nolock.patch: initial workaround / fix. This mostly solves the initial issue. However I run into another issue:

[2014-03-30 14:07:27] ERROR[14206]: lock.c:444 __ast_pthread_mutex_unlock: sig_pri.c line 7618 (pri_dchannel): mutex '&pri->lock' freed more times than we've locked!
[2014-03-30 14:07:27] ERROR[14206]: lock.c:475 __ast_pthread_mutex_unlock: sig_pri.c line 7618 (pri_dchannel): Error releasing mutex: Operation not permitted

That is: this is not the right place to kill the thread.

By: Tzafrir Cohen (tzafrir) 2014-03-30 13:55:29.228-0500

pri_destroy_span_threads.patch: this patch mostly fixes the threads issue by having a different case for running in the pri thread.

Initial tests still show some deadlocks under stress: with the proper timing I eventually get an endless flow of the following:

ERROR[24349]: chan_dahdi.c:13690 dahdi_pri_error: PRI Span: 1 Read on 63 failed: No such device

This comes from libpri: pri_check_event calls the read function. read returns -ENODEV and the read function retuns 0 (as it ignores errors) which means that pri_check_event returns a NULL event. pri_dchannel() will just move on to the next iteration in this case.

By: Tzafrir Cohen (tzafrir) 2014-03-31 13:06:18.450-0500

pri_destroy_span_prilist.patch: A slightly different approach: destroy the span in the monitor thread:

* A global list: doomed_pris
* On DAHDI_EVENT_REMOVED to the span: add it to the list.
* The monitor thread checks the list periodically, and destroys all the spans there.

This mostly gets rid of the problem I mentioned in the previous comment, but I still managed to reproduce it.

By: Rusty Newton (rnewton) 2014-04-02 18:44:36.762-0500

Thanks , is this one already on reviewboard?

By: Tzafrir Cohen (tzafrir) 2014-04-28 14:51:47.995-0500

The missing second part: handling ENODEV from libpri in sigpri.

Is sigpri allowed to use code from chan_dahdi (I guess not) or should I expose it by adding yet another method?

By: Richard Mudgett (rmudgett) 2014-04-28 15:45:36.700-0500

sig_pri cannot use anything in chan_dahdi as it is a lower module and intended to be statically linked with a parent channel driver which may not be chan_dahdi.


By: Tzafrir Cohen (tzafrir) 2014-04-30 14:41:37.514-0500

sigpri_handle_enodev_1.patch - a newer version of that patch. Adds a new member to the sigpri callbacks: destroy_later(span)

By: Tzafrir Cohen (tzafrir) 2014-05-19 08:52:55.303-0500

Attached a test script to help reproduce the issue