[Home]

Summary:ASTERISK-17956: [patch] Call pickup deadlock in ast_do_masquerade
Reporter:Steve Davies (one47)Labels:
Date Opened:2011-06-03 04:03:24Date Closed:2011-07-26 14:11:22
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/Channels
Versions:1.6.2.18 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) chan-pickup-deadlock-fix
( 1) CoreShowLocks-1.6.2-r321210.txt
( 2) deadlock-1.6.patch
( 3) deadlock-1.6.patch2
( 4) masq_locks.diff
Description:Using *8 call pickup can readily cause a deadlock where ast_do_masquerade() has a channel locked, when calling ast_channel_free() - ast_channel_free requests the channels (channel list) lock, which breaks locking order requirements and caused a deadlock.

****** ADDITIONAL INFORMATION ******

We have a script placing calls to virtual (pjsua) SIP endpoints
so we repeatedly make calls that ring for 5 seconds, then are answered
for 10 seconds. No more than about 20 calls at a time - We just do
this so that we have a constant list of ringing calls available.

Dialling *8 a number of times soon causes a deadlock. We've also seen
it with one inbound call to a queue in Ringall mode to 5 handsets. A
single *8 killed everything.

An example of the relevant locks is here:


=== Thread ID: -1218679920 (tps_processing_function started at [  451] taskprocessor.c ast_taskprocessor_get())
=== ---> Lock #0 (pbx.c): MUTEX 3911 handle_statechange hints 0xa1473b8 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/sbin/asterisk(_ao2_lock+0x2ec) [0x8085a9c]
       /usr/sbin/asterisk [0x814d15f]
       /usr/sbin/asterisk [0x81a9f0f]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]
=== ---> Lock #1 (pbx.c): MUTEX 3912 handle_statechange hint 0xd21fd88 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/sbin/asterisk(_ao2_lock+0x2ec) [0x8085a9c]
       /usr/sbin/asterisk [0x814d17c]
       /usr/sbin/asterisk [0x81a9f0f]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]
=== ---> Lock #2 (chan_sip.c): MUTEX 13469 cb_extensionstate p 0xb0c46918 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/sbin/asterisk(_ao2_lock+0x2ec) [0x8085a9c]
       /usr/lib/asterisk/modules/chan_sip.so [0xb38b13a1]
       /usr/sbin/asterisk [0x814d272]
       /usr/sbin/asterisk [0x81a9f0f]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]
=== ---> Lock #3 (channel.c): RDLOCK 1355 ast_channel_search_locked &(&channels)->lock 0x8222308 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/sbin/asterisk(ast_channel_search_locked+0x695) [0x80b8525]
       /usr/lib/asterisk/modules/chan_sip.so [0xb38b0d1d]
       /usr/lib/asterisk/modules/chan_sip.so [0xb38b1581]
       /usr/sbin/asterisk [0x814d272]
       /usr/sbin/asterisk [0x81a9f0f]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]
=== ---> Waiting for Lock #4 (channel.c): MUTEX 1357 ast_channel_search_locked &c->lock_dont_use 0xdb48380 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/lib/asterisk/modules/chan_sip.so [0xb38ebb19]
       /usr/sbin/asterisk(ast_sched_runq+0x6df) [0x8194f6f]
       /usr/lib/asterisk/modules/chan_sip.so [0xb38fa9a1]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]
=== --- ---> Locked Here: channel.c line 2195 (ast_waitfor_nandfds)


=== -------------------------------------------------------------------
===
=== Thread ID: -1324696688 (pbx_thread           started at [ 4676] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2195 ast_waitfor_nandfds &c[x]->lock_dont_use 0xdb48380 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/sbin/asterisk(ast_waitfor_nandfds+0x639) [0x80b8ce9]
       /usr/sbin/asterisk(ast_waitfor_n+0x1f) [0x80b9daf]
       /usr/lib/asterisk/modules/app_dial.so [0xb6f30d9e]
       /usr/lib/asterisk/modules/app_dial.so [0xb6f35e95]
       /usr/sbin/asterisk(pbx_exec+0xdc) [0x8150fbc]
       /usr/sbin/asterisk [0x815e84b]
       /usr/sbin/asterisk [0x8167389]
       /usr/sbin/asterisk [0x8168770]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]
=== ---> Waiting for Lock #1 (channel.c): WRLOCK 1500 ast_channel_free &(&channels)->lock 0x8222308 (1)
       /usr/sbin/asterisk(ast_bt_get_addresses+0x12) [0x812b3b2]
       /usr/sbin/asterisk [0x80b6915]
       /usr/sbin/asterisk(__ast_channel_alloc+0x3d) [0x80b6e6d]
       /usr/lib/asterisk/modules/chan_iax2.so [0xb39a4da8]
       /usr/lib/asterisk/modules/chan_iax2.so [0xb39b2627]
       /usr/lib/asterisk/modules/chan_iax2.so [0xb39b4dc7]
       /usr/sbin/asterisk [0x81bdc12]
       /lib/i686/cmov/libpthread.so.0 [0xb7b3d4c0]
       /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb7d716de]


Comments:By: Steve Davies (one47) 2011-06-03 04:10:13

I have attached a patch, which I am quite sure is wrong! It surely cannot be safe to unlock the original channel during a masquerade, even if we re-lock it immediately.

Can it?

I would prefer to be able to schedule the destruction of the channel, but there does not seem to be any generic scheduler running that we can hook into.

By: Gregory Hinton Nietsky (irroot) 2011-06-03 04:46:26

mr davies the patch i put up should help directly for the lock path above.

By: Steve Davies (one47) 2011-06-03 04:53:05

I considered that patch, but it is not quite so simple.

You are unlocking the channels list, which means that 'c' might not be valid any more. It might have been removed from the channel list or modified while you were unlocked.

IMHO If you unlock channels, you must start the search from the beginning when you re-lock.

By: Alec Davis (alecdavis) 2011-06-03 04:55:42

Using SVN-branch-1.6.2-r321210M

deadlock CoreShowLocks-1.6.2-r321210.txt

Setup, 2 internal calls to SIP C, don't answer SIP C:
DAHDI A  ---> SIP C
SIP B  ---> SIP C

Pickup at exactly the same time.
SIP D *8 pickup
SIP E *8 pickup

I'm able to get a deadlock by 2 callers (DAHDI A and SIP B) ringing same extension (SIP C), and another 2 (SIP D and E) attempt to pick up (SIP C) the ringing extension.

This is not that repeatable. Wish it was.



By: Gregory Hinton Nietsky (irroot) 2011-06-03 05:28:09

as the locking order is wrong we need to rectify it the latest patch should restore order to the universe ...

By: Steve Davies (one47) 2011-06-03 06:40:47

Sorry to keep being so negative, but deadlock-1.6.patch2 still has a problem - It will be a million times better I am sure, but...

If another thread comes along and starts waiting on &channels lock, then it has a chance to jump in between the unlock:
 AST_RWLIST_UNLOCK(&channels);
and
 ast_channel_free(clonechan);

If you are going to unlock origchan, can you just do it as per masq_locks.diff, which actually ensures we have a legal locking order when calling ast_channel_free().

Thoughts? I have now load-tested masq_locks.diff, and verified that unlocking/relocking origchan should not cause any massively obvious issues. There is only one obvious code path that has 2 channels locked when calling ast_do_masquerade(), and that will not be masquerading a zombie channel.

The sequence that causes this deadlock to happen or not happen is as follows:

No deadlock:
  1) SIP calls magic_pickup which sets-up a masq.
  2) Masq happens here in another thread, cannot deadlock.
  3) SIP calls hangup on the ringing channel. CDR handled, channel freed.

With deadlock:
  1) SIP calls magic_pickup which sets-up a masq.
  2) SIP calls hangup on the ringing channel, it gets marked zombie
     because it is about to be masq'ed.
  3) Masq happens here in another thread, channel freed, might deadlock.

The above sequence also causes 2 different CDR write paths. The zombie record never has its CDRs written, which fortunately should always be correct!

By: Leif Madsen (lmadsen) 2011-06-03 08:23:01

If you guys can get this fixed before we close the 1.6.2 branch entirely (and it's only still open because of one last blocking issue) then that's great, but I don't think this is going to get considered a blocker for 1.6.2. Thanks!

By: Steve Davies (one47) 2011-06-03 09:24:06

chan-pickup-deadlock-fix is simply a tidied up version of masq_locks.diff. It is the actual patch that I am load-testing with at present.

By: Gregory Hinton Nietsky (irroot) 2011-06-03 10:20:19

one47 thank you for the feedback its most appreciated dont think there is a ideal solution really and what you have may be closest.

By: Steve Davies (one47) 2011-06-03 11:20:44

No probs Greg - Fortunately, this seems to be a rare occurrence, and the code-paths that cause this issue are few and far between. Sadly, rare occurrences have a horrible habit of happening far too often :(

I have a load test running on this fix, and will leave it going over the weekend.

By: Alec Davis (alecdavis) 2011-06-03 14:43:23

chan-pickup-deadlock-fix doesn't break anything for me.
But as we've found, the deadlock isn't repeatable either.

By: Steve Davies (one47) 2011-06-04 04:37:43

Looking at where the deadlock occurs, I suspect that it is a combination of lots of BLF subscriptions with using *8 that causes it.

By: Alec Davis (alecdavis) 2011-06-06 02:24:55.612-0500

Both 1.6.2.18 and 1.6.2.svn are deadlocked at ast_channel_free which in both revisions has the locking order wrong, Greg's patch appears to be the correct patch. BLF hints arn't involved in my deadlock, although are configured.

1.6.2.18 deadlock
=== Thread ID: -1218679920 (tps_processing_function started at [ 451] taskprocessor.c ast_taskprocessor_get())
=== ---> Lock #0 (pbx.c): MUTEX 3911 handle_statechange hints 0xa1473b8 (1)
=== ---> Lock #1 (pbx.c): MUTEX 3912 handle_statechange hint 0xd21fd88 (1)
=== ---> Lock #2 (chan_sip.c): MUTEX 13469 cb_extensionstate p 0xb0c46918 (1)
=== --> Lock #3 (channel.c): RDLOCK 1355 ast_channel_search_locked &(&channels)>lock 0x8222308 (1)
=== --> Waiting for Lock #4 (channel.c): MUTEX 1357 ast_channel_search_locked &c>lock_dont_use 0xdb48380 (1)

=== -------------------------------------------------------------------
===
=== Thread ID: -1324696688 (pbx_thread started at [ 4676] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2195 ast_waitfor_nandfds &c[x]->lock_dont_use 0xdb48380 (1)
=== --> Waiting for Lock #1 (channel.c): WRLOCK 1500 ast_channel_free &(&channels)>lock 0x8222308 (1)

1.6.2.svn deadlock
=== Thread ID: 0xaef87b70 (sip_pickup_thread    started at [19583] chan_sip.c sip_pickup())
=== ---> Lock #0 (channel.c): RDLOCK 1355 ast_channel_search_locked &(&channels)->lock 0x81fdd88 (1)
=== ---> Waiting for Lock #1 (channel.c): MUTEX 1357 ast_channel_search_locked (channel lock) 0x9f9672c (1)
=== --- ---> Locked Here: channel.c line 2188 (ast_waitfor_nandfds)
===
=== Thread ID: 0xaefffb70 (ss_thread            started at [ 9806] chan_dahdi.c handle_init_event())
=== ---> Lock #0 (channel.c): MUTEX 2188 ast_waitfor_nandfds (channel lock) 0x9f9672c (1)
=== ---> Waiting for Lock #1 (channel.c): WRLOCK 1500 ast_channel_free &(&channels)->lock 0x81fdd88 (1)
=== -------------------------------------------------------------------


By: Steve Davies (one47) 2011-06-06 04:18:36.362-0500

Alec, can you clarify which patch you used? You mention "Greg's patch" above, and "chan-pickup-deadlock-fix" before that.

By: Alec Davis (alecdavis) 2011-06-06 06:18:29.294-0500

Steve, masq_locks.diff (which is a cleaned up chan-pickup-deadlock-fix) is the actual one I meant.

How did your testing go with PJSUA?



By: Steve Davies (one47) 2011-06-06 07:09:26.402-0500

The most recent of the patches, "chan-pickup-deadlock-fix" was tested all weekend through our normal load-test. We also manually simulated multiple rapid *8 calls at the start and end of the test, and cannot reproduce the deadlock with the patch applied.

We plan on running with this patch unless we find a better suggestion :)

By: Russell Bryant (russell) 2011-07-26 14:11:15.644-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions