Summary: | ASTERISK-23554: [patch]deadlock on forced disconnect of DAHDI PRI span | ||
Reporter: | Tzafrir Cohen (tzafrir) | Labels: | patch |
Date Opened: | 2014-03-27 15:38:52 | Date Closed: | |
Priority: | Major | Regression? | |
Status: | Open/New | Components: | Channels/chan_dahdi |
Versions: | SVN 12.1.1 13.18.4 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) pri_destroy_span_nolock.patch ( 1) pri_destroy_span_prilist.patch ( 2) pri_destroy_span_threads.patch ( 3) sigpri_handle_enodev_1.patch ( 4) sigpri_handle_enodev.patch ( 5) test.sh | |
Description: | There is a deadlock on the locks of iflock (the chan_dahdi lock of the interfaces list) and pri->lock of the specific span.
Path 1: DAHDI_EVENT_REMOVED received on a B channel. Later on: do_monitor dahdi_destroy_channel_range (takes iflock) destroy_dahdi_pvt dahdi_unlink_pri_pvt (tries to take pri->lock) Path 2: pri_dchannel (takes pri->lock) sig_pri_handle_dchan_exception my_handle_dchan_exception (handling DAHDI_EVENT_REMOVED) pri_destroy_span dahdi_destroy_channel_range (tries to take iflock) It seems that this is esy to reproduce as DAHDI will emit a DAHDI_EVENT_REMOVED in response to any ioctl on a channel in a device that has been removed. | ||
Comments: | By: Tzafrir Cohen (tzafrir) 2014-03-30 08:17:32.983-0500 pri_destroy_span_nolock.patch: initial workaround / fix. This mostly solves the initial issue. However I run into another issue: [2014-03-30 14:07:27] ERROR[14206]: lock.c:444 __ast_pthread_mutex_unlock: sig_pri.c line 7618 (pri_dchannel): mutex '&pri->lock' freed more times than we've locked! [2014-03-30 14:07:27] ERROR[14206]: lock.c:475 __ast_pthread_mutex_unlock: sig_pri.c line 7618 (pri_dchannel): Error releasing mutex: Operation not permitted That is: this is not the right place to kill the thread. By: Tzafrir Cohen (tzafrir) 2014-03-30 13:55:29.228-0500 pri_destroy_span_threads.patch: this patch mostly fixes the threads issue by having a different case for running in the pri thread. Initial tests still show some deadlocks under stress: with the proper timing I eventually get an endless flow of the following: ERROR[24349]: chan_dahdi.c:13690 dahdi_pri_error: PRI Span: 1 Read on 63 failed: No such device This comes from libpri: pri_check_event calls the read function. read returns -ENODEV and the read function retuns 0 (as it ignores errors) which means that pri_check_event returns a NULL event. pri_dchannel() will just move on to the next iteration in this case. By: Tzafrir Cohen (tzafrir) 2014-03-31 13:06:18.450-0500 pri_destroy_span_prilist.patch: A slightly different approach: destroy the span in the monitor thread: * A global list: doomed_pris * On DAHDI_EVENT_REMOVED to the span: add it to the list. * The monitor thread checks the list periodically, and destroys all the spans there. This mostly gets rid of the problem I mentioned in the previous comment, but I still managed to reproduce it. By: Rusty Newton (rnewton) 2014-04-02 18:44:36.762-0500 Thanks , is this one already on reviewboard? By: Tzafrir Cohen (tzafrir) 2014-04-28 14:51:47.995-0500 The missing second part: handling ENODEV from libpri in sigpri. Is sigpri allowed to use code from chan_dahdi (I guess not) or should I expose it by adding yet another method? By: Richard Mudgett (rmudgett) 2014-04-28 15:45:36.700-0500 sig_pri cannot use anything in chan_dahdi as it is a lower module and intended to be statically linked with a parent channel driver which may not be chan_dahdi. By: Tzafrir Cohen (tzafrir) 2014-04-30 14:41:37.514-0500 sigpri_handle_enodev_1.patch - a newer version of that patch. Adds a new member to the sigpri callbacks: destroy_later(span) By: Tzafrir Cohen (tzafrir) 2014-05-19 08:52:55.303-0500 Attached a test script to help reproduce the issue |