ASTERISK-18136: r319652 causes deadlock with REFER

[Home]

Summary: ASTERISK-18136: r319652 causes deadlock with REFER

Reporter: Walter Doekes (wdoekes) Labels:

Date Opened: 2011-07-14 04:06:13 Date Closed: 2011-08-26 14:37:55

Priority: Major Regression?

Status: Closed/Complete Components: Core/Channels

Versions: 1.6.2.19 Frequency of
Occurrence Occasional

Related
Issues:
duplicates ASTERISK-18082 Deadlock of SIP or segfault when doing REFERs

Environment: Attachments:

Description: regression=yes

Recently I upgraded some machines to 1.6.2.19. I skipped 1.6.2.18 because of "the cisco bug" (via headers misinterpretation).

After this upgrade, I've twice seen a deadlock in SIP processing. The asterisk core seems respond, but no SIP traffic is read. In netstat the RecvQ increases infinitely.

In both cases, the last traffic seen, was a SIP REFER conversation in which this happened:
{code}
client asterisk
REFER --->
<--- 202
<--- NOTIFY
---> 200
BYE --->
BYE --->
BYE --->
(that last BYE is never handled)
{code}

After reverting to 1.6.2.17, these hangs haven't occurred again.

An examined core yielded the following information:
* The thread handling the original REFER (thread 16) was in handle_request_do -> handle_incoming -> handle_request_refer -> local_attended_transfer -> ast_indicate -> ast_indicate_data: ast_channel_lock(chan)
* In 1.6.2.19 r319652 was added, as seen below. Notable are the extra ast_indicate() calls.
* Most other threads all seem to be waiting on the mutex that the REFER-thread holds in handle_request_do
* It looked like the REFER-thread was waiting on a thread that was doing deadlock-avoidance in sip_hangup (thread 50).

r319652 in 1.6.2.19:
{code}
--- channels/chan_sip.c (revision 319652)
+++ channels/chan_sip.c (revision 319653)
@@ -21209,6 +21209,10 @@
ast_debug(1, "SIP attended transfer: Unlocking channel %s\n", targetcall_pvt->owner->name);
ast_channel_unlock(targetcall_pvt->owner);
}
+ ast_indicate(target.chan1, AST_CONTROL_UNHOLD); // <-- the ast_indicate the REFER-thread hangs on
+ if (target.chan2) {
+ ast_indicate(target.chan2, AST_CONTROL_UNHOLD);
+ }
}
if (targetcall_pvt)
ao2_t_ref(targetcall_pvt, -1, "drop targetcall_pvt");
{code}

Other threads waiting in handle_request_do:
{code}
/* Process request, with netlock held, and with usual deadlock avoidance */
for (lockretry = 10; lockretry > 0; lockretry--) {
ast_mutex_lock(&netlock);
{code}

Thread the REFER-thread was waiting on (thread 50):
{code}
/* We need to get the lock on bridge because ast_rtp_set_vars will attempt
* to lock the bridge. This may get hairy...
*/
while (bridge && ast_channel_trylock(bridge)) {
sip_pvt_unlock(p);
do {
/* Use oldowner since p->owner is already NULL */
CHANNEL_DEADLOCK_AVOIDANCE(oldowner);
} while (sip_pvt_trylock(p));
bridge = ast_bridged_channel(oldowner);
}
{code}

Backtrace of that last thread:
{code}
#0 0x00007f26577bcf51 in nanosleep () from /lib/libc.so.6
#1 0x00007f26577f1bb4 in usleep () from /lib/libc.so.6
#2 0x00007f265121b057 in sip_hangup (ast=0x7f26404256f8) at chan_sip.c:6388
#3 0x00000000004690aa in ast_do_masquerade (original=0x7f26380bd328) at channel.c:4892
#4 0x0000000000461243 in __ast_read (chan=0x7f26380bd328, dropaudio=0) at channel.c:2777
#5 0x00000000004635d6 in ast_read (chan=0x7f26380bd328) at channel.c:3272
#6 0x0000000000500b09 in bridge_p2p_loop (c0=0x7f26380bd328, c1=0x32a0968, p0=0x7f26385b20b8, p1=0x7f264c4160c8, timeoutms=-1, flags=0, fo=0x7f26442917f8, rc=0x7f26442917f0, pvt0=0x7f26383f9570, pvt1=0x7f264c5df580) at rtp.c:4343
#7 0x0000000000501aa5 in ast_rtp_bridge (c0=0x7f26380bd328, c1=0x32a0968, flags=0, fo=0x7f26442917f8, rc=0x7f26442917f0, timeoutms=-1) at rtp.c:4600
#8 0x000000000046c530 in ast_channel_bridge (c0=0x7f26380bd328, c1=0x32a0968, config=0x7f2644291b00, fo=0x7f26442917f8, rc=0x7f26442917f0) at channel.c:5594
#9 0x000000000049cccc in ast_bridge_call (chan=0x7f26380bd328, peer=0x32a0968, config=0x7f2644291b00) at features.c:3042
#10 0x00007f265354e055 in dial_exec_full (chan=0x7f26380bd328, data=0x7f2644294d20, peerflags=0x7f2644292770, continue_exec=0x0) at app_dial.c:2315
#11 0x00007f265354e933 in dial_exec (chan=0x7f26380bd328, data=0x7f2644294d20) at app_dial.c:2402
...
{code}

The oldowner was already locked in another thread (thread 94) which was in turn instantiated by the REFER-thread, as far as I can tell. It's backtrace is:
{code}
#2 0x00007f2656fb2111 in pthread_mutex_lock () from /lib/libpthread.so.0
#3 0x000000000045a009 in ast_mutex_lock (pmutex=0x2b6afa0) at /usr/src/asterisk-1.6.2.19-osvpi/include/asterisk/lock.h:1746
#4 0x00000000004636b3 in ast_indicate_data (chan=0x2b6ae58, _condition=20, data=0x0, datalen=0) at channel.c:3331
#5 0x0000000000463622 in ast_indicate (chan=0x2b6ae58, condition=20) at channel.c:3282
#6 0x000000000046caa5 in ast_channel_bridge (c0=0x7f26404256f8, c1=0x2b6ae58, config=0x7f2644861b00, fo=0x7f2644861818, rc=0x7f2644861810) at channel.c:5664
#7 0x000000000049cccc in ast_bridge_call (chan=0x7f26404256f8, peer=0x2b6ae58, config=0x7f2644861b00) at features.c:3042
#8 0x00007f265354e055 in dial_exec_full (chan=0x7f26404256f8, data=0x7f2644864d20, peerflags=0x7f2644862770, continue_exec=0x0) at app_dial.c:2315
#9 0x00007f265354e933 in dial_exec (chan=0x7f26404256f8, data=0x7f2644864d20) at app_dial.c:2402
#10 0x00000000004d29b5 in pbx_exec (c=0x7f26404256f8, app=0x2a12488, data=0x7f2644864d20) at pbx.c:1359
#11 0x00000000004db361 in pbx_extension_helper (c=0x7f26404256f8, con=0x0, ...
#12 0x00000000004dcbc1 in ast_spawn_extension (c=0x7f26404256f8, context=0x7f2640425ad0 ...
#13 0x00000000004dd3e4 in __ast_pbx_run (c=0x7f26404256f8, args=0x0) at pbx.c:4322
{code}
I cannot tell who holds that lock.. but chan=0x2b6ae58 happens to be the same address as 'chan1' in thread 16 and 'bridge' in thread 50.

I'm reluctant to provide a full backtrace because it may contain sensitive information, but I'll gladly answer any questions.

In 1.8.x, only the last 3 lines of r319652 were added, so if this is a problem in 1.8.x too, it might be an older issue there (that first line existed since r231436, 2009). And yes, I know that 1.6.2.x is closed. I just wonder what the bug is that r319652 is supposed to fix and whether locking issues were considered when including it in 1.6.2.19.

Regards,
Walter Doekes
OSSO B.V.

Comments: By: Stefan Schmidt (schmidts) 2011-07-18 02:40:55.330-0500

i hit the same problem with 1.6.2.19 :(
By: Michael Cramer (micc) 2011-07-19 01:37:04.574-0500

I think I've also seen this. I'd sure like to know as well.
By: leearcher (leearcher) 2011-07-21 04:14:46.870-0500

Can this be fixed in 1.6.2.19.1?
By: Walter Doekes (wdoekes) 2011-08-16 02:12:15.706-0500

I wonder if a backport of the fix for #18082 might this problem as well. It does shuffle some locking around in handle_request_refer. (r331867 in 1.8 branch.)
By: Paul Belanger (pabelanger) 2011-08-26 14:37:41.884-0500

This was fixed in Asterisk 1.8, you'll have to back port the patch from it since 1.6.2 is security mode now.

---
Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions. After testing with Asterisk 1.8, if you find this problem has not been resolved, please open a new issue against Asterisk 1.8.

By: vbcrlfuser (vbcrlfuser) 2011-10-11 21:20:39.002-0500

Just went from Asterisk 1.4.25 to Asterisk 1.6.2.19 because we hit issues with Asterisk 1.8 straight away (app_queue stopped writing queue_log and we run a call center so didn't want to chance it) Things have been stable except for an occasional deadlock that matches the backtraces reported here.

We don't run SNOM phones or use Local channels which is what the commit comments say those ast_indicate lines were target at. So I removed them recompiled, tested attended transfers still work with our SIP trunk and Polycom configuration. It worked fine on our 1.4.25 without it so I expected it would too on 1.6.2.19.

We cannot precisely reproduce this because it requires an attended transfer and perfect unlucky timing. Will report back if this clears the deadlock so others may do the same. We are in our heavy call volume season and didn't feel up to sorting through all the lock code in 1.8 to do the back port.

By: vbcrlfuser (vbcrlfuser) 2011-10-19 11:53:15.400-0500

Reporting back that this *should* have happened again by now and hasn't. We process several thousand phone calls a day and will for the next month or two. Will report back if it deadlocks again but am confident it should not.