[Home]

Summary:ASTERISK-28282: AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
Reporter:Walter Doekes (wdoekes)Labels:patch
Date Opened:2019-02-11 09:18:35.000-0600Date Closed:2019-07-19 08:43:56
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:13.24.1 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Attachments:( 0) ASTERISK-28282_improving_sip_show_sched.patch
( 1) ASTERISK-28282_no_sip_poke_peer_now_if_no_qualify.patch
( 2) ASTERISK-28282_only_unref_if_ast_sched_del_succeeded.patch
( 3) ASTERISK-28282_report_deadlock_but_continue.patch
( 4) ASTERISK-28282_reproducing_sched_unref_self_deadlock.patch
( 5) ASTERISK-28282_suggest_locking_less_in_mixmonitor_b.patch
Description:Hi,

since ASTERISK-24212, a chan_sip bug has moved its appearance from apparent memory corruption to deadlocks.

In the ticket above, the scheduler is fixed so it waits for a scheduled callback to complete before allowing it to be cancelled: _that way the callback is removed before running OR it runs completely before the caller of ast_sched_del can remove objects that the callback may depend on_.

Example 1:

- thread 1: add sip_poke_peer_now schedule
- thread 2: del sip_poke_peer_now schedule

Example 2:

- thread 1: add sip_poke_peer_now schedule
- thread 2: run sip_poke_peer_now
- thread 3: attempt to delete sip_poke_peer_now schedule, wait
- thread 2: callback completes
- thread 3: the delete finishes

However, under certain conditions of heavy load, the following situation can occur:

- thread 1: add sip_poke_peer_now schedule
- thread 2: run sip_poke_peer_now
- also thread 2: attempt to delete sip_poke_peer_now schedule, but wait for callback to complete (which will never succeed)

This can happen because of this macro:

{code}
#define AST_SCHED_REPLACE_VARIABLE_UNREF(id, sched, when, callback, data, variable, unrefcall, addfailcall, refcall) \
...
               id = ast_sched_add_variable(sched, when, callback, data, variable); \
...
{code}
Which is used like this:
{code}
                       AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched,
                               0, /* Poke the peer ASAP */
                               sip_poke_peer_now, peer,
                               sip_unref_peer(_data, "removing poke peer ref"),
                               sip_unref_peer(peer, "removing poke peer ref"),
                               sip_ref_peer(peer, "adding poke peer ref"));
{code}
The scheduler gets tasked with this:
{code}
static int sip_poke_peer_now(const void *data)
{
       struct sip_peer *peer = (struct sip_peer *) data;

       peer->pokeexpire = -1;
       sip_poke_peer(peer, 0);
{code}
Which ends up here:
{code}
static int sip_poke_peer(struct sip_peer *peer, int force)
...
               AST_SCHED_DEL_UNREF(sched, peer->pokeexpire,
                               sip_unref_peer(peer, "removing poke peer ref"));
{code}
On a loaded system, the {{ast_sched_add_variable(sched, when, callback, data, variable);}} can complete and then the {{sip_poke_peer_now}} function can get called _before_ {{peer->pokeexpire}} is set.

I.e.:

- thread 1: {{ast_sched_add_variable(sched, when, callback, data, variable)}} returns ID123
- thread 2: sip_poke_peer_now: {{peer->pokeexpire = -1;}}
- thread 1: {{peer->pokeexpire = ID123}} (from macro)
- thread 2: sip_poke_peer: {{ast_sched_del(...ID123)}}

This results in a deadlock here:

{code}
               /* Wait for executing task to complete so that caller of ast_sched_del() does not
                * free memory out from under the task.
                */
               while (con->currently_executing && (id == con->currently_executing->sched_id->id)) {
                       ast_cond_wait(&s->cond, &con->lock);
               }
{code}

We're executing callback ID123, and we wait for that to complete before we can ast_sched_del it: infinite wait.

*Fixes:*

- Best fix is to not do any schedule-changes from the schedule-callbacks. But refactoring chan_sip is beyond the scope here.
- An alternate fix that makes the target ID (in this case {{peer->pokeexpire}}) get set by {{ast_sched_add_variable}} _before_ the scheduled item is added to the heap would be rather invasive.

But there should be other fixes we can do..

_Observe that this problem may exist in other places than just chan_sip and {{sip_poke_peer_now}}._
Comments:By: Asterisk Team (asteriskteam) 2019-02-11 09:18:36.320-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: George Joseph (gjoseph) 2019-02-11 09:38:33.692-0600

Will you be submitting the appropriate patches to gerrit?

By: Walter Doekes (wdoekes) 2019-02-11 10:12:12.959-0600

Once I know what the appropriate ones are. Gathering some stuff here now first.

By: Walter Doekes (wdoekes) 2019-02-28 03:47:44.401-0600

I'm hoping you're not waiting on something from me before reviewing https://gerrit.asterisk.org/c/asterisk/+/10991 .. ?

For this particular ticket, that changeset should be right fix (I think).

I'll gladly discuss the other attached patches, which I think could be included as well, but do not solve the core of the issue.

By: Friendly Automation (friendly-automation) 2019-07-19 08:43:59.475-0500

Change 11577 merged by Friendly Automation:
sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same thread

[https://gerrit.asterisk.org/c/asterisk/+/11577|https://gerrit.asterisk.org/c/asterisk/+/11577]

By: Friendly Automation (friendly-automation) 2019-07-19 08:48:45.375-0500

Change 11578 merged by George Joseph:
sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same thread

[https://gerrit.asterisk.org/c/asterisk/+/11578|https://gerrit.asterisk.org/c/asterisk/+/11578]

By: Friendly Automation (friendly-automation) 2019-07-19 08:48:52.496-0500

Change 10991 merged by George Joseph:
sched: Don't allow ast_sched_del to deadlock ast_sched_runq from same thread

[https://gerrit.asterisk.org/c/asterisk/+/10991|https://gerrit.asterisk.org/c/asterisk/+/10991]