[Home]

Summary:ASTERISK-16961: [patch] hint state changes deadlock/race
Reporter:Steve Davies (one47)Labels:
Date Opened:2010-11-15 12:19:13.000-0600Date Closed:2011-03-28 05:47:06
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/PBX
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) ao2.patch.deadlock.txt
( 1) bug18310.patch
( 2) bug18310-ao2.patch
( 3) deadlock_backoff.patch
( 4) deadlock-1.8.0-20101117.txt
( 5) gdb.txt
( 6) locks.txt
( 7) statecbs_ao2.mk2.patch
Description:VERY similar to ASTERISK-16834, but this is a different deadlock path, so I have raised a separate report.

Thread 1: taskprocessor  -> handle_statechange
   Lock order = conlock, hints, hint, pvt
Thread 2: chan_sip.c:  handle_request_do -> handle_response_notify
   Lock order = pvt, conlock

Thread2 only gets into the conlock if STATECHANGEQUEUE is true, ie. there are rapid state changes happening on a subscribed hint.

****** ADDITIONAL INFORMATION ******

I tried to find a way to un-tangle this mess. The "core" issue as it looks to me is that the chan_sip state-notify code uses ast_get_hint to check:
     /* If none of the hinted devices are registered, we will
      * override notification and show no availability.
      */
To undo this race condition, chan_sip should not need to fetch this data - It should be provided in the state data being passed around. Under what circumstances it that test actually necessary. Surely we will receive state == AST_EXTENSION_UNAVAILABLE if no devices are AVAILABLE?

Attaching relevant gdb and core show locks output
Comments:By: Stefan Schmidt (schmidts) 2010-11-15 15:07:08.000-0600

locks like the same lock but in a different way to me.
i will try to bring my patch to 1.6.2 tomorrow cause its much faster and needs less locks at all.

By: Steve Davies (one47) 2010-11-16 08:21:13.000-0600

Hi, I assume you are referring to the patch here:
https://reviewboard.asterisk.org/r/1003/diff/5/

I do not think that catches this partcular deadlock. When chan_sip receives a packet, it immediately uses find_call() to lock the appropriate "pvt" lock, and in this circumstance (STATECHANGEQUEUE set), it uses ast_get_hint(), which requires the "conlock" lock.

The taskprocessor/handle_statechange code path always locks the "conlock" lock first, and the "pvt" lock is acquired inside the state-change callback, which is outside of the domain of your patch.

In fact, I believe there are a couple of places where chan_sip calls ast_get_hint() with a "pvt" lock held (eg. handling a hint subscribe request) that could potentially cause the same issue.

By: Stefan Schmidt (schmidts) 2010-11-16 08:32:27.000-0600

you are right for the patch ;)

ast_get_hint or in detail ast_hint_extension is another possible lock place but we could change this to use the hint container here too. there is IMO no need for the conlock anymore cause we could use the refcounter value of hints for this.

i will try to write a patch for this.

best regards

stefan

By: Steve Davies (one47) 2010-11-16 09:43:14.000-0600

I think the conlock is necessary in ast_hint_extension because this ends up calling pbx_find_extension.

To solve this deadlock, I think it will be necessary to update the callback functions to return -1 if they cannot get a lock. handle_statechange can then release (conlocks, hints, hint) locks, usleep(1), grab locks and try again.

I am experimenting with this now and will post a patch if it seems alright.

By: Matt Kemner (mkemner) 2010-11-17 00:55:33.000-0600

I have been chasing this exact same fault for weeks, on 1.8

I was running 1.8.0, and as of two days ago I upgraded to SVN-branch-1.8-r294989M to try and resolve this issue - at the same time I enabled thread debugging support.  I just had it deadlock again, and have uploaded the output to 'core show locks' - which shows the exact same situation as described by one47.

Thread 1
holds ast_rdlock_contexts &conlock
holds handle_statechange hints
holds handle_statechange hint
 waiting for cb_extensionstate
  Locked Here: chan_sip.c line 7428 (find_call)
Thread 2
holds handle_request_do &netlock
holds find_call sip_pvt_ptr
 waiting for ast_rdlock_contexts &conlock
  Locked Here: pbx.c line 9911 (ast_rdlock_contexts)

By: Stefan Schmidt (schmidts) 2010-11-17 01:27:53.000-0600

what i can see from the code is that the function ast_get_hint_extension is called from 2 different locations in the code:
1) inside pbx.c when initialising hints by reading them from the config file. They still should use the pbx_get_extension code part.
2) outside pbx.c like chan_sip.c which want to match a hint for statechange. they use ast_get_hint to search the right exten.

IMO the lock only happens for the second type of functions and for this functions we could use the container cause the hints are allready there.
I have allready tried this but i had some problems so it took some time to fix it.

By: Steve Davies (one47) 2010-11-17 04:44:07.000-0600

Hi Stefan,

As far as I know, the data returned by ast_get_hint is only stored in the dialplan, it is not copied into the hint container or hint objects, so it is still necessary to use pbx_get_extension to get the list of hinted/monitored devices attached to the extension/extension pattern.

I am attaching a patch which adds the back-off retry behaviour to the handle_statechange code. This has survived our load-test overnight. It is pretty crude, so perhaps a better programmer will be able to add some finesse to the issue :)

By: Stefan Schmidt (schmidts) 2010-11-17 04:50:40.000-0600

one47 you are right but a hint obj has a direct link back to its extension so
e = hint->exten would be what we search.

i will take a look at your patch and also do some load tests ;)

By: Matt Kemner (mkemner) 2010-11-22 23:16:48.000-0600

The deadlock_backoff.patch has worked around the issue for me too, asterisk has now processed more than 600 calls over the last 5 days.  Previously I'd be lucky to get 100 calls (one or two days) before deadlock.

By: Steve Davies (one47) 2010-11-23 04:59:27.000-0600

I have now put tens of thousands of calls over the last week through our test rig with no lockups. Normally our test rig lasted about 4 hours before a lockup.

By: Jeff Peeler (jpeeler) 2010-12-10 17:15:55.000-0600

I believe the patch posted here is the best approach, but to be fully safe all the callback calls needs to be checked. I have done that and would like it to be tested.

By: Stefan Schmidt (schmidts) 2010-12-11 01:19:54.000-0600

i am gonna try this patch against 1.8 to see the performance impact on this. As you know handle_statechange has a great impact on call handling performance so waiting there could maybe cause a regression.

IMHO the best way of really solving this would be moving the callbacks itself to ao2 objects so we dont need the context lock anymore.

By: Jeff Peeler (jpeeler) 2010-12-16 17:58:16.000-0600

I'm sorry, I didn't mean really best approach. Stefan what were your findings with the performance? I'll go ahead and start looking at what you suggested.



By: Ronald Chan (loloski) 2010-12-18 20:41:18.000-0600

we running this patch for sometime, this fix the problem for us. We are going to continue to use this patch as a stop gap measure until a right patch appears. thanks for one47 for posting this patch you are truly amazing :D

By: Steve Davies (one47) 2011-01-05 10:37:16.000-0600

It seems that this patch is okay, but that it does not go far enough...

1) Performance - This patch is no worse performance wise that the unpatched code - The deadlock avoidance is called so rarely that it will have an effect tending towards zero.

2) Additional issue. In chan_sip.c:cb_extensionstate(), there is a potential call to chan_sip.c:transmit_state_notify()

Within transmit_state_notify(), if "notifycid" is enabled, then there is a call into channel.c:ast_channel_search_locked(), which is a deadlock waiting to happen, particularly if called with 4 other locks held as in this case.

ast_channel_search_locked() has no deadlock avoidance code, so things lock hard.

The only way I have found to cause the above to be a problem is to drop call-files which generate a Local/ channel - The nature of the Local/ channel means that it grabs a channel lock, and then shortly afterwards needs the context-lock - This is the reverse of what handle_statechange does, so deadlocks...

3) Way forward - Did anyone (stefan?) manage to patch handle_statechange() so that it no longer needs the context-lock? That certainly looks like a promising way forward.

By: Steve Davies (one47) 2011-01-05 11:11:38.000-0600

This might be a bit simplistic, but can someone check I am not being utterly stupid:

At present, I believe that the context-lock is being used to protect the 'statecbs' list. 'statecbs' is managed in main/pbx.c, and is not referenced anywhere else. Everywhere in main/pbx.c that 'statecbs' is referenced, the hints-lock is held.

Is the hints lock therefore not sufficient to protect 'statecbs' so that the context-lock is only held when the context-lock is specifically needed???

I believe that the possible exception to this is where the dialplan is being reloaded, and hints potentially created or destroyed. This will be done from a thread that holds the context-lock while trying to grab and release the hints lock; I cannot see that being a problem, but could be wrong...

Thoughts please?

By: Jeff Peeler (jpeeler) 2011-01-05 18:14:02.000-0600

I've done as schmidts suggested and converted the callbacks to be ao2 objects. If anybody could test the ao2 patch I posted that would be great.

By: Steve Davies (one47) 2011-01-06 04:17:46.000-0600

Looks good from here at first read - I will be compiling and load-testing this today. Many thanks.

Only one comment immediately: Extraneous whitespace in statecbs_cmp().

By: Steve Davies (one47) 2011-01-06 07:06:13.000-0600

I am seeing some kind of corruption with the newer patch - I have not yet identified the source, but it seems that the CallBack data that is passed into chan_sip.c:cb_extensionstate() is not a sip_pvt*

The result is:

[Jan  6 12:41:37] ERROR[11547]: /usr/src/asterisk-1.6.2.15/include/asterisk/lock.h:619 __ast_pthread_mutex_lock: chan_sip.c line 13045 (cb_extensionstate): Error obtaining mutex: Invalid argument
Program received signal SIGSEGV, Segmentation fault.

I am still looking for any obvious reason for this. Not found it yet.

By: Steve Davies (one47) 2011-01-06 11:12:03.000-0600

I have updated jpeeler's patch, and will attach it in a moment. It makes the following changes, which now compile, run, and do not crash or deadlock here (yet)

1) find_hint_by_cb_id() - Do not use OBJ_NODATA in the ao2_find call or NULL is always returned. This results in old callbacks never being deleted, and invalid pointers soon get referenced.

2) ast_merge_contexts_and_delete() - The callbacks are temporarily moved into a state_list, and then restored to the hints. The patch as provided copied instead of moving, and also left a free() of an ao2 object in the code. The updated patch is my attempt to fix this - Hope it looks right!

I will leave it soaking under high load overnight.

By: Steve Davies (one47) 2011-01-07 05:40:50.000-0600

Soaked as follows:

5 calls per second (each one answered for 3 seconds) across 60 devices, and 110 hints with about 200 BLF subscriptions active.

Continuous for 18 hours and counting without a deadlock yet.

Also able to make repeated calls through Local/ channel without it locking up.

By: Ronald Chan (loloski) 2011-01-07 08:03:28.000-0600

can this committed to 1.8 svn branch?, since our environment is now in 1.8, thanks

By: Jeff Peeler (jpeeler) 2011-01-07 14:37:46.000-0600

one47, thanks for testing and fixing my silly mistakes! I'll go ahead and post the patch on reviewboard now.

By: Matt Kemner (mkemner) 2011-01-18 03:30:05.000-0600

the statecbs_ao2.mk2.patch does not work on 1.8.2 - although it patches (all hunks succeeded, at various offsets) at compile time I get:

pbx.c: In function 'hints_data_provider_get':
pbx.c:9844: error: request for member 'first' in something not a structure or union

line 9844 being AST_LIST_TRAVERSE(&hint->callbacks, watcher, entry)

I know the patch is for 1.6 not 1.8 but just wanted to give some feedback - this same bug exists in 1.8 so as loloski said it would be nice to have a fix there too :) - am using the deadlock_backoff.patch for now.

By: Jeff Peeler (jpeeler) 2011-01-18 10:58:33.000-0600

I haven't tried merging to 1.8 yet, but you could simply replace that AST_LIST_TRAVERSE block with: watchers = ao2_container_count(hint->callbacks);
The plan is to merge this fix to all branches up from 1.6.

By: Digium Subversion (svnbot) 2011-01-18 14:13:54.000-0600

Repository: asterisk
Revision: 302265

U   branches/1.6.2/main/pbx.c

------------------------------------------------------------------------
r302265 | jpeeler | 2011-01-18 14:13:53 -0600 (Tue, 18 Jan 2011) | 27 lines

Convert device state callbacks to ao2 objects to fix a deadlock in chan_sip.

Lock scenario presented here:
Thread 1
holds ast_rdlock_contexts &conlock
holds handle_statechange hints
holds handle_statechange hint
 waiting for cb_extensionstate
  Locked Here: chan_sip.c line 7428 (find_call)
Thread 2
holds handle_request_do &netlock
holds find_call sip_pvt_ptr
 waiting for ast_rdlock_contexts &conlock
  Locked Here: pbx.c line 9911 (ast_rdlock_contexts)

Chan_sip has an established locking order of locking the sip_pvt and then
getting the context lock. So the as stated by the summary, the operations in
thread 2 have been modified to no longer require the context lock.

(closes issue ASTERISK-16961)
Reported by: one47
Patches:
     statecbs_ao2.mk2.patch uploaded by one47 (license 23),
     modified by me

Review: https://reviewboard.asterisk.org/r/1072/

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=302265

By: Digium Subversion (svnbot) 2011-01-18 14:19:59.000-0600

Repository: asterisk
Revision: 302266

_U  branches/1.8/
U   branches/1.8/main/pbx.c

------------------------------------------------------------------------
r302266 | jpeeler | 2011-01-18 14:19:58 -0600 (Tue, 18 Jan 2011) | 34 lines

Merged revisions 302265 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.6.2

........
 r302265 | jpeeler | 2011-01-18 14:13:52 -0600 (Tue, 18 Jan 2011) | 27 lines
 
 Convert device state callbacks to ao2 objects to fix a deadlock in chan_sip.
 
 Lock scenario presented here:
 Thread 1
  holds ast_rdlock_contexts &conlock
  holds handle_statechange hints
  holds handle_statechange hint
   waiting for cb_extensionstate
    Locked Here: chan_sip.c line 7428 (find_call)
 Thread 2
  holds handle_request_do &netlock
  holds find_call sip_pvt_ptr
   waiting for ast_rdlock_contexts &conlock
    Locked Here: pbx.c line 9911 (ast_rdlock_contexts)
 
 Chan_sip has an established locking order of locking the sip_pvt and then
 getting the context lock. So the as stated by the summary, the operations in
 thread 2 have been modified to no longer require the context lock.
 
 (closes issue ASTERISK-16961)
 Reported by: one47
 Patches:
       statecbs_ao2.mk2.patch uploaded by one47 (license 23),
       modified by me
 
 Review: https://reviewboard.asterisk.org/r/1072/
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=302266

By: Digium Subversion (svnbot) 2011-01-18 14:41:00.000-0600

Repository: asterisk
Revision: 302270

_U  trunk/
U   trunk/main/pbx.c

------------------------------------------------------------------------
r302270 | jpeeler | 2011-01-18 14:41:00 -0600 (Tue, 18 Jan 2011) | 41 lines

Merged revisions 302266 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.8

................
 r302266 | jpeeler | 2011-01-18 14:19:57 -0600 (Tue, 18 Jan 2011) | 34 lines
 
 Merged revisions 302265 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.6.2
 
 ........
   r302265 | jpeeler | 2011-01-18 14:13:52 -0600 (Tue, 18 Jan 2011) | 27 lines
   
   Convert device state callbacks to ao2 objects to fix a deadlock in chan_sip.
   
   Lock scenario presented here:
   Thread 1
    holds ast_rdlock_contexts &conlock
    holds handle_statechange hints
    holds handle_statechange hint
     waiting for cb_extensionstate
      Locked Here: chan_sip.c line 7428 (find_call)
   Thread 2
    holds handle_request_do &netlock
    holds find_call sip_pvt_ptr
     waiting for ast_rdlock_contexts &conlock
      Locked Here: pbx.c line 9911 (ast_rdlock_contexts)
   
   Chan_sip has an established locking order of locking the sip_pvt and then
   getting the context lock. So the as stated by the summary, the operations in
   thread 2 have been modified to no longer require the context lock.
   
   (closes issue ASTERISK-16961)
   Reported by: one47
   Patches:
         statecbs_ao2.mk2.patch uploaded by one47 (license 23),
         modified by me
   
   Review: https://reviewboard.asterisk.org/r/1072/
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=302270