ASTERISK-25023: Deadlock in chan_sip in update_provisional

[Home]

Summary: ASTERISK-25023: Deadlock in chan_sip in update_provisional_keepalive

Reporter: Arnd Schmitter (arnd) Labels:

Date Opened: 2015-04-28 05:10:02 Date Closed: 2016-03-17 10:29:48

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 13.3.2 Frequency of
Occurrence

Related
Issues:
is caused by ASTERISK-24212 testsuite: Sporadic crash due to assert on stopping RTP engine

is caused by ASTERISK-25355 sched: ast_sched_del may return prematurely due to spurious wakeup

Environment: centos 6 / 64Bit Attachments: ( 0) backtrace.txt
( 1) core-show-locks.txt
( 2) trace.txt

Description: There is a race condition / deadlock when update_provisional_keepalive is called.
If there is already a scheduler run plant for calling send_provisional_keepalive_full.

the func update_provisional_keepalive gets called with a locked sip_pvt struct and the first thing it does is delete the plant scheduler.
If the plant scheduler is started after the sip_pvt lock and before the AST_SCHED_DEL_UNREF call in update_provisional_keepalive is executed, the scheduler job is blocked in send_provisional_keepalive_full, waiting to get a lock on the sip_pvt struct.

The call to AST_SCHED_DEL_UNREF is waiting for a condistion signal, that the running scheduler finish.

Comments: By: Arnd Schmitter (arnd) 2015-04-28 05:27:56.204-0500

The information below is from a productive system without debugging enabled. If have extracted the following information so far:

[edit by Rusty - removing debug trace from comment as per the guidelines. Always better to attach this sort of thing as a .txt file for readability and accessibility.]
By: Rusty Newton (rnewton) 2015-04-30 19:02:38.070-0500

I know it is probably hard for you to get a new trace with DEBUG_THREADS, but can you recompile with DONT_OPTIMIZE and BETTER_BACKTRACES?
By: Arnd Schmitter (arnd) 2015-05-04 02:58:03.622-0500

I will try, to get a better trace, but the machine on which the deadlock appeared is a productive system with quite a high load and it is running for days without any problem.

But in my opinion, the case is quite clear and as far as i can see there are two possible solutions.

# Release the sip_pvt lock before deleting old scheduled jobs in update_provisional_keepalive and reacquire the lock afterwards.
# Use a try_lock in send_provisional_keepalive_full and if it fails return with an appropriate error code that the job gets rescheduled again.

By: Etienne Lessard (hexanol) 2015-10-09 12:42:19.350-0500

Hello,

We are currently experiencing a really similar problem with asterisk 13.5.0.

I'm attaching a backtrace and the output of "core show locks". When the freeze was detected, the process was killed with an ABRT signal, that's why the backtrace shows that the process terminated with signal 6. This asterisk was compiled with DONT_OPTIMIZE and DEBUG_THREAD, but without BETTER_BACKTRACES, sorry.

There is a bit of noise in these 2 files since they are taken from a "load test" system. The interesting threads are thread 202 (0xa7d4bb70) and thread 504 (0xb47bbb70); we see that the thread 504 is executing ast_sched_runq and is trying to lock a sip_pvt, but the thread 202 has already a lock on this sip_pvt and is trying to call ast_sched_del, which never returns since it's waiting on a condition variable that will be signaled only when thread 504 returns.

Thank you.

By: Morten Tryfoss (mtryfoss) 2015-10-16 08:41:14.582-0500

We also seem to have a similar issue in 13.5.0 where chan_sip locks up.

I don't have any better backtraces than what's already uploaded.
By: Jirka Hlavacek (jirka) 2015-10-30 09:13:07.409-0500

Hello,
We have reproduced the deadlock with asterisk 13.6.0, the same problem as reported by Etienne Lessard.

Many thanks for any help.
By: Matt Jordan (mjordan) 2015-11-24 10:56:33.691-0600

The inversion is as [~hexanol] pointed out: when a scheduled task is executing, that task isn't holding any locks. However, anyone attempting to delete the running task may block that task if they themselves are currently holding any locks.

This appears to have been caused by a combination of two commits:

{noformat}
commit 7c4ed8cc897ace58dd6b2534589fb5902eebdb14
Author: Mark Michelson <mmichelson@digium.com>
Date: Tue Aug 26 22:13:57 2014 +0000

Fix race condition in the scheduler when deleting a running entry.

When scheduled tasks run, they are removed from the heap (or hashtab).
When a scheduled task is deleted, if the task can't be found in the
heap (or hashtab), an assertion is triggered. If DO_CRASH is enabled,
this assertion causes a crash.

The problem is, sometimes it just so happens that someone attempts
to delete a scheduled task at the time that it is running, leading
to a crash. This change corrects the issue by tracking which task
is currently running. If that task is attempted to be deleted,
then we mark the task, and then wait for the task to complete.
This way, we can be sure to coordinate task deletion and memory
freeing.

ASTERISK-24212
Reported by Matt Jordan

Review: https://reviewboard.asterisk.org/r/3927
........

Merged revisions 422070 from http://svn.asterisk.org/svn/asterisk/branches/12

git-svn-id: https://origsvn.digium.com/svn/asterisk/branches/13@422071 65c4cc65-6c06-0410-ace0-fbb531ad65f3
{noformat}

And:

{noformat}
commit 85e1cb51b21c2d194647e16b81b5a1344d2ff911
Author: Joshua Colp <jcolp@digium.com>
Date: Fri Aug 28 21:57:14 2015 -0300

sched: ast_sched_del may return prematurely due to spurious wakeup

When deleting a scheduled item if the item in question is currently
executing the ast_sched_del function waits until it has completed.
This is accomplished using ast_cond_wait. Unfortunately the
ast_cond_wait function can suffer from spurious wakeups so the
predicate needs to be checked after it returns to make sure it has
really woken up as a result of being signaled.

This change adds a loop around the ast_cond_wait to make sure that
it only exits when the executing task has really completed.

ASTERISK-25355 #close

Change-Id: I51198270eb0b637c956c61aa409f46283432be61
{noformat}

In the first, we introduced the notion of waiting until a running task completes before attempting to delete/remove them from the scheduler. That didn't actually cause the deadlock, however, as {{ast_cond_wait}} wasn't being guarded with a predicate. As such, weird slow downs may have occurred, but eventually {{ast_cond_wait}} would *probably* have returned, resulting in things flowing along (although the delete would have probably been in a precarious position, as the running task would still be ... running.)

The second commit causes the deadlock by adding the predicate. Since the predicate will never be true - as the deleter of the scheduled task is holding a lock that task is waiting on - we're deadlocked.

There's at least two ways to solve this:
* Use a time-bounded wait condition instead of {{ast_cond_wait}}. I don't really love this option, however, as it will result in deleting a task being in an unknown state if the time limit passes - plus, we're liable to block the entire SIP stack while we wait for that timer to expire.
* Fix the priority inversions. Generally, that means callers of {{ast_sched_del}} *CANNOT* *HOLD* *ANY* *LOCKS*. That patch is mostly tedious to write, and should probably result in a macro that deals with the fun of ref bump/unlock/re-lock that will be necessary. Callers will also have to be aware that they may hold the only reference to the pvt/peer when it completes, which can result in some fun crashiness if we aren't careful.
By: ibercom (ibercom) 2016-03-16 05:05:19.258-0500

[~rmudgett]: I have a comment about Change-Id: I90f04208a089f95488a2460185a8dbc3f6acca12

Asterisk11: chan_sip.c line 22656
In most of functions the sip_pvt_unlock is after channel_unlock.

Is the similitude interesting ?
By: Richard Mudgett (rmudgett) 2016-03-16 10:43:33.897-0500

[~ibercom] Unlocking order does not affect deadlocks and is rarely important for other reasons. Unlocking order is mostly a convenience to the particular function. That particular patch (https://gerrit.asterisk.org/#/c/2386/) is not relevant to this issue and v11 is not affected by the deadlocks reported by this issue.