ASTERISK-17046: [patch] Deadlock on SIP blind transfer (REFER)

[Home]

Summary: ASTERISK-17046: [patch] Deadlock on SIP blind transfer (REFER)

Reporter: Jonathan Thurman (jthurman) Labels:

Date Opened: 2010-11-30 18:25:57.000-0600 Date Closed: 2011-02-08 13:25:01.000-0600

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/Transfers

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 18403-kkm-locks-1.txt
( 1) 20101130-backtrace.txt
( 2) 20101130-debug.log
( 3) 20101130-locks.txt
( 4) 20101130-threads.txt
( 5) 20101216-backtrace-threads.txt
( 6) 20101216-core-show-locks.txt
( 7) 20101219-blind_deadlock.diff
( 8) 20101221-blind_deadlock.diff
( 9) 20101221-sip_deadlock.diff
(10) 20110103-blind_deadlock.diff
(11) another_core_show_locks.txt
(12) issue18403.patch

Description: All SIP:
- A calls B
- B blind-transfers A to C (Using REFER)
- Asterisk stops responding to SIP requests (Deadlocked)

****** ADDITIONAL INFORMATION ******

CentOS 5.5

I had the issue on 1.8.0, so I updated to the latest branch/1.8. Same host does not have this issue with 1.6.2.13. I will attach the threads, locks, and sip debug of this failure. I had to kill asterisk (-6) and kept the core so I can pull a backtrace if that would help.

Comments: By: Stefan Schmidt (schmidts) 2010-12-01 03:13:09.000-0600

do you have a backtrace for this? if yes iam really interested whats in there.
thx
By: se (se) 2010-12-01 10:24:43.000-0600

Seeing the same thing on our system.. I am not able to trigger it using blind transfer on Bria, but blind transfer from a grandstream and thomson deadlocks SIP every time here. Running realtime SIP as well.

By: Jonathan Thurman (jthurman) 2010-12-01 10:36:09.000-0600

I didn't have time last night to create the backtrace, but it is uploaded now.

I can reproduce this every time using two Polycom 331s and a Cisco 7941. The role of the phones (A, B, or C) does not matter.

ODBC Realtime is used for Extensions and SIP Accounts (rtcachefriends=yes). The configuration/database was copied from our production 1.6.2.13 server.

CentOS 5.5 64-bit.
By: viniciusfontes (viniciusfontes) 2010-12-10 06:32:10.000-0600

Also having the same issue. Had to revert to 1.6.2 since it was a production environment. I attached the output of my "core show locks" .
By: Jonathan Thurman (jthurman) 2010-12-16 16:00:21.000-0600

This issue still exists on 1.8.2-rc1. I uploaded the latest "core show locks" and thread backtrace as suggested by the debugging documentation.
By: Kirill Katsnelson (kkm) 2010-12-19 14:06:04.000-0600

I cannot reproduce yet, but we are using REFER a lot, so I am very concerned.

Copying Russel Bryant's response on asterisk-dev:

<<<
I just took a look. The "core show locks" output is what is most helpful to me in seeing where the problem is.

This problem is one of the more common deadlock situations that come up in Asterisk. There is a lock associated with the ast_channel as well as a lock for the private data in the channel driver (sip_pvt in this case). Based on Asterisk locking rules, it is only safe to lock the ast_channel first, and the sip_pvt second. Otherwise, a deadlock avoidance technique of some kind must be used.

The bug is in the do_monitor thread in the "core show locks" output. In that thread, you can see these locks:

1) netlock (irrelevant to this bug)
2) sip_pvt (variable i in sip_new)
3) channel <--- Blocking while waiting for this lock

The bug here is that the code is trying to lock the ast_channel while holding the sip_pvt lock. There are a few different solutions, and which one is most appropriate requires more in depth investigation of the code path in question.

1) Unlock the sip_pvt before doing anything that is going to try to grab the channel lock.

2) If you must hold both the sip_pvt and the channel lock at the same time, you can either:

2.a) lock the channel first, then the pvt (not usually an option, actually).

2.b) Acquire both locks using a deadlock avoidance loop like this:

lock(sip_pvt);
while (trylock(ast_channel)) {
unlock(sip_pvt);
sched_yield();
lock(sip_pvt);
}

... do whatever ...

unlock(ast_channel);
unlock(sip_pvt);
>>>
By: Jonathan Thurman (jthurman) 2010-12-19 16:59:38.000-0600

From tracing the code, this seems to be caused by the new call parking implementation. Or at least that section of code causes another lock to be held that doesn't look to be necessary.

The attached patch resolves the issue for me, but I am not certain that it would be the final solution. I have asked the dev list to take a look and see if the channel needs to be passed for any reason, and if so I'll try and keep digging.
By: Jonathan Thurman (jthurman) 2010-12-20 16:52:16.000-0600

As I expected, Russell stated that patch 20101219-blind_deadlock.diff will not be merged. I wasn't very happy with the way it avoided the issue anyway.

This issue also disappears when using static extension configuration using the same data as the Realtime configuration. As far as I can tell, the current build tests for blind transfers do not use any Realtime configuration which would expose the issue.
By: Jonathan Thurman (jthurman) 2010-12-21 12:24:27.000-0600

Moderator please delete 20101221-sip_deadlock.diff (wrong patch uploaded)

After taking a longer look, I think it should be safe to unlock 'current.chan1' before the call to ast_parking_ext_valid instead of where it was unlocked below that section of code. Nothing that I see is done to chan1 that doesn't handle it's own locking.
By: Kirill Katsnelson (kkm) 2010-12-22 02:05:15.000-0600

I have applied that to our pre-production 1.8.1.1. It certainly fixes the deadlock issue, although I could not get a reliable reproduction even before the patch. None happened after.

Thanks for the fix!
By: se (se) 2010-12-22 02:10:23.000-0600

Have applied the 20101219-blind_deadlock.diff patch to our testing servers (1.8.1.1), seems to be running stable (was able to reliably reproduce the bug before the patch).

Will apply and test 20101221-blind_deadlock.diff instead when I have time.
By: Andrea Cristofanini (hydrolife) 2010-12-22 03:57:07.000-0600

Using Fedora 13 X86_64 and realtime MYSQL, with the 20101221-blind_deadlock.diff patch, SIP does not freeze anymore, but hangup the calls. So TRANSFER does not work.
By: Kirill Katsnelson (kkm) 2010-12-22 04:01:28.000-0600

hydrolife: check if your issue is one of ASTERISK-16838, ASTERISK-16847, ASTERISK-16873, ASTERISK-16891, ASTERISK-16950, ASTERISK-16972. I ran into that as well.
By: Jonathan Thurman (jthurman) 2010-12-22 10:37:01.000-0600

After some more in-depth testing, it appears that the 20101221-blind_deadlock.diff patch fixes the initial deadlock issue that I was seeing. However, I am still able to get random deadlocks in chan_sip, but they are happening before handle_request_refer. After application of this patch there are no locks left after a transfer that could be cascading to the next deadlock, so it is probably a different issue.
By: Kirill Katsnelson (kkm) 2010-12-23 03:13:32.000-0600

jthurman: Got a lock-up with your patch, see 18403-kkm-locks-1.txt. Scenario: trunk A calls extension that ultimately rings a queue member B (via Queue()). B transfers to extension X. As far as I can tell, the extension X is not entered into -- a Verbose() first thing there did not print anything. Sounds just like what you are experiencing?

Linux pbx6 2.6.35-23-server ASTERISK-37-Ubuntu SMP Wed Nov 24 12:12:17 UTC 2010 x86_64 GNU/Linux; 4 CPUs.

By: Jonathan Thurman (jthurman) 2010-12-23 12:50:22.000-0600

kkm: Yes, I am still getting a deadlock randomly in find_call, same as the one you posted. While it is related, that deadlock is happening in code that didn't change and before the changed code is executed. Reverting the patch causes the deadlock for me everytime, so it seems to be in the right direction but not a complete fix yet.
By: Kirill Katsnelson (kkm) 2010-12-23 14:43:59.000-0600

I looked at your original trace, and it does not look like a deadlock to me. There is no note "Locked Here" on your do_monitor thread()! Yes, the locks *look* like they are taken in the wrong order, but do_handle_request() goes a great length to upgrade the pvt lock to a channel lock, and does that correctly.

In fact, the monitor thread is not locked, but looping in ast_autoservice_stop(). If you look at its code, there is a loop of the form

while (/* autoservice thread not completed iteration */) sleep 1 ms.

Now, I do not understand why autoservice is blocking. In the backtrace, it is blocked on the channel lock, but that block is limited to 50 ms. If unable to obtain lock, the autoservice thread would just start a new iteration, incrementing the counter and thus releasing monitor. That does not happen, and I am completely at loss as to why. ast_wait_n() supposed to return a failure in only 50 ms of waiting and hit the continue statement. Please debug that when you get the deadlock (that's all without your changes, on the original code).

Oh, and with your patch we are getting a real first class deadlock. pbx_builtin_setvar_helper() tries to obtain the channel lock, but, absent the caution of handle_request_do() in doing so, easily deadlocks.
By: Jonathan Thurman (jthurman) 2010-12-24 17:47:03.000-0600

kkm: I see the deadlock as two threads waiting for each other, not as one thread waiting on itself. Here is the flow as I see it with unmodified source for 1.8 latest SVN:

Thread: do_monitor
---------------------------------
handle_request_do
handle_incoming
handle_request_refer
ast_parking_ext_valid
pbx_find_extension
ast_autoservice_stop
while (chan_list_state == as_chan_list_state) { usleep(1000); }

Thread: autoservice_run
---------------------------------
autoservice_run
chan = ast_waitfor_n
ast_waitfor_nandfds
ast_waitfor_nandfds_classic / simple / complex (depending on your system)
ast_channel_lock(c[x]);

At this point, autoservice_run is waiting to lock the channel that was locked by handle_request_do, and handle_request_refer isn't going to unlock it until ast_parking_ext_valid returns! Deadlock.

From my reading of the code, the timeout you are referring to in ast_waitfor_nandfds is actually how long to read from the file descriptor. It has nothing to do with how long to wait for a lock, and I don't see any deadlock avoidance in autoservice_run.

My first patch *avoided* this problem by not passing a channel to the pbx_find_extension inside of ast_parking_ext_valid. That would prevent ast_autoservice_start/stop from executing at all. Russell stated clearly that this would not be accepted as a final solution as autoservice needs to be applied on these channels to handle cases of long return times (i.e. Database lookups).

The second patch attempts to unlock p->owner before the call to ast_parking_ext_valid, and seems to prevent the deadlock above. However, after a lot more testing, this doesn't prevent other deadlocks from happening (like the one you show in 18403-kkm-locks-1.txt) Even trying to relock p->owner after the ast_parking_ext_valid call I still get deadlocks. That makes me think that I am looking entirely in the wrong area, or there is something much larger going on here that I haven't found yet.
By: Kirill Katsnelson (kkm) 2010-12-24 18:17:08.000-0600

You are right, I did not realize that the ms parameter does not limit waiting for a lock. So autoservice_stop should not be called with the channel locked, or unlock it while doing the waiting. I do not know how to fix that.
By: Kirill Katsnelson (kkm) 2010-12-24 18:32:25.000-0600

Oh, what a catch-22! handle_request_refer()'s logic distills down to

if (ast_parking_ext_valid()) { // channel lock must be released before the call
do_something()
return;
}
pbx_builtin_setvar_helper(); // channel lock must still be held to call that

Original code calls ast_parking_ext_valid() with the lock held, and deadlocks.

Your patch calls pbx_builtin_setvar_helper() after having released the lock, and deadlocks.
By: Jonathan Thurman (jthurman) 2010-12-24 19:10:57.000-0600

I missed the only call to pbx_builtin_setvar_helper that actually sets something on current.chan1, and will try grabbing the lock again before that call to see if that helps. There are other possible SIP deadlock issues in the tracker though (0018455, 0018468, 0018441) that might also be in play here.
By: Kirill Katsnelson (kkm) 2010-12-29 20:33:38.000-0600

Please also check ASTERISK-17184 I logged today. I feel it might be related. Did you ever receive these errors related to the deadlock?
By: Jonathan Thurman (jthurman) 2011-01-03 16:42:04.000-0600

Uploaded a new patch that I can not get to deadlock after extensive tests. This unlocks p before calling ast_parking_ext_valid, and also relocks the current.chan1 before attempting to write to it (as pointed out by kkm)

kkm: Not sure if 0018554 is related or not, as I haven't had that issue come up. Please test the new patch when you get a chance. Thanks
By: Kirill Katsnelson (kkm) 2011-01-08 21:41:26.000-0600

I want to test your patch thoroughly, but I cannot repro the original issue any more. It happened but rarely in initial testing. I will be rolling 1.8 to production servers tonight, and watch them closely. They carry much higher load than the test server, used only in internal agent testing. On the first sign of a trouble, I'll give your patch a go. Give me a week or so.
By: Kirill Katsnelson (kkm) 2011-01-13 04:32:55.000-0600

@jthurman: A production server (1.8.1 without your patch) locked up today. Unfortunately I did not leave instructions to the ops what to do in this case, so, lacking any guidance, they simply reset the machine. Thus I got nether a coredump nor the lock dump.

I am going to put your patch to the server, in an assumption that the lock-up was the same bug that we are hunting.

May I ask you to test something for me please? The lock-up today caused a failure of AMI, and this is how it got detected in the first place. The network monitoring pulled data from asterisks on a schedule, and, being unable to log on, raised an alarm. As you are able to repro every time, could you please check if AMI also stops responding during the lock-up? That would add significant weight to the current hypothesis that we got the same issue. The monitoring software issues the CoreShowChannels command.

There is a convenient expect(1) script attached to ASTERISK-15698 to issue AMI commands interactively.
By: Jonathan Thurman (jthurman) 2011-01-13 14:03:00.000-0600

@kkm: Sorry to hear that. At least the issue has been picked up and scheduled to be fixed in 1.8.3!

As for your issue: Latest SVN 1.8 branch, no patches I can still deadlock every time, and after the deadlock I can call CoreShowChannels *once* from AMI. After that I get no response when issuing CoreShowChannels. I am able to login again to AMI, but still get nothing from CoreShowChannels.

With my latest patch attached to this issue, I don't get the deadlock and AMI always responds to CoreShowChannels.
By: Jeff Peeler (jpeeler) 2011-01-13 17:50:28.000-0600

After studying 20110103-blind_deadlock.diff, it looks like the right fix. I made two modifications: 1) moved unlocking order just to enforce good habits 2) changed locking the sip private until later so the channel lock doesn't need to be reaquired in handle_request_refer (pbx_builtin_setvar_helper locks the channel itself).
By: Jonathan Thurman (jthurman) 2011-01-13 18:23:35.000-0600

Testing with the updated patch (issue18403.patch) resolves the deadlock. Thanks jpeeler for the cleanup!
By: Digium Subversion (svnbot) 2011-01-14 11:32:53.000-0600

Repository: asterisk
Revision: 301790

U branches/1.8/channels/chan_sip.c

------------------------------------------------------------------------
r301790 | jpeeler | 2011-01-14 11:32:53 -0600 (Fri, 14 Jan 2011) | 42 lines

Resolve deadlock involving REFER.

Two fixes:
1) One must always have the private unlocked before calling
pbx_builtin_setvar_helper to not invalidate locking order since it locks the
channel.
2) Unlock the channel before calling pbx_find_extension, which starts and stops
autoservice during the lookup. The problem scenario as illustrated by the
reporter:

Thread: do_monitor
-----------------------
handle_request_do
handle_incoming
handle_request_refer
ast_parking_ext_valid
pbx_find_extension
ast_autoservice_stop
while (chan_list_state == as_chan_list_state) { usleep(1000); }

Thread: autoservice_run
-----------------------
autoservice_run
chan = ast_waitfor_n
ast_waitfor_nandfds
ast_waitfor_nandfds_classic / simple / complex (depending on your system)
ast_channel_lock(c[x]);

handle_request_do and schedule_process_request_queue locks the owner
if it exists. The autoservice thread is waiting for the channel lock, which
wasn't ever released since the do_monitor thread was waiting for autoservice
operations to complete. Solved by unlocking the channel but keeping a reference
to guarantee safety.

(closes issue ASTERISK-17046)
Reported by: jthurman
Patches:
20110103-blind_deadlock.diff uploaded by jthurman (license 614)
issue18403.patch uploaded by jpeeler (license 325)
Tested by: jthurman

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=301790
By: Digium Subversion (svnbot) 2011-01-14 11:34:30.000-0600

Repository: asterisk
Revision: 301791

_U trunk/
U trunk/channels/chan_sip.c

------------------------------------------------------------------------
r301791 | jpeeler | 2011-01-14 11:34:30 -0600 (Fri, 14 Jan 2011) | 48 lines

Merged revisions 301790 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.8

........
r301790 | jpeeler | 2011-01-14 11:32:52 -0600 (Fri, 14 Jan 2011) | 42 lines

Resolve deadlock involving REFER.

Two fixes:
1) One must always have the private unlocked before calling
pbx_builtin_setvar_helper to not invalidate locking order since it locks the
channel.
2) Unlock the channel before calling pbx_find_extension, which starts and stops
autoservice during the lookup. The problem scenario as illustrated by the
reporter:

Thread: do_monitor
-----------------------
handle_request_do
handle_incoming
handle_request_refer
ast_parking_ext_valid
pbx_find_extension
ast_autoservice_stop
while (chan_list_state == as_chan_list_state) { usleep(1000); }

Thread: autoservice_run
-----------------------
autoservice_run
chan = ast_waitfor_n
ast_waitfor_nandfds
ast_waitfor_nandfds_classic / simple / complex (depending on your system)
ast_channel_lock(c[x]);

handle_request_do and schedule_process_request_queue locks the owner
if it exists. The autoservice thread is waiting for the channel lock, which
wasn't ever released since the do_monitor thread was waiting for autoservice
operations to complete. Solved by unlocking the channel but keeping a reference
to guarantee safety.

(closes issue ASTERISK-17046)
Reported by: jthurman
Patches:
20110103-blind_deadlock.diff uploaded by jthurman (license 614)
issue18403.patch uploaded by jpeeler (license 325)
Tested by: jthurman
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=301791