ASTERISK-25645: res_rtp_asterisk: Lock inversion

[Home]

Summary: ASTERISK-25645: res_rtp_asterisk: Lock inversion

Reporter: Steve Davies (one47) Labels:

Date Opened: 2015-12-23 10:15:52.000-0600 Date Closed: 2016-09-23 14:27:55

Priority: Major Regression? Yes

Status: Closed/Complete Components: Resources/res_rtp_asterisk

Versions: Frequency of
Occurrence

Related
Issues:
is related to ASTERISK-25614 DTLS negotiation delays

Environment: Attachments: ( 0) deadlocked_threads.txt
( 1) experimental_anti_deadlock
( 2) unlock_ice_before_callback

Description: Reported by Steve Davies on asterisk-dev:

commit 5e6b1476a087407a052f007d326c504cfeefebe7
ASTERISK-25614

2 code paths which approximate the following will cause a lock-inversion deadlock:

approximate call orders are:

a)
pj_timer_heap_poll (PJ_LOCK)
ast_rtp_on_ice_complete
ast_rtp_instance_set_remote_address
remote_address_set
ast_rtp_remote_address_set
(DTLS_LOCK)
...

b)
ast_pbx...
app_dial
bridge...
read
rtp_read
...
__rtp_recvfrom
(DTLS_LOCK)
dtls_srtp_check_pending
__rtp_sendto
pj_ice_sess_send_data
(PJ_LOCK)

Comments: By: Joshua C. Colp (jcolp) 2015-12-23 10:17:26.814-0600

[~dade] Have you seen this occur in your environment at all?
By: Steve Davies (one47) 2015-12-23 10:24:22.497-0600

If it helps, in my setup I use pjproject 2.4.5 and asterisk-11.21.0-rc1 on debian wheezy.

I am not sure if it is relevant, but I am using JsSIP 0.7.10 with inband signalling and was calling from/to the same endpoint.

Looking at this, I suspect that the early-audio might be the cause of this because it will start the audio bridge (b above) early, giving the opportunity for both the ICE complete event, and the bridge read/write cycle to be running at the same time.
By: Dade Brandon (dade) 2015-12-23 13:35:00.956-0600

This should be unrelated to the patch, although more info like a full backtrace of threads would help to show otherwise.

We also use early media, and JsSIP (0.6.4 due to heavy modifications, but the version should be irrelevant for an Asterisk deadlock). We ran this patch successfully in a large production environment from Dec 10 to 17 & and have been running the updated patch (6dc4aacaed merged yesterday) since Dec 17th.

I'd be curious to see the 'thread apply all bt full' to see if there's a relation to ASTERISK-25275, since the workaround we've been using for that solved all of our previous deadlock issues.

You can't tell from the mentioned codepaths, what is holding the lock in pj_ice_sess_send_data. I'm willing to bet on it being a concurrent code path looking something like:
{noformat}
res_rtp_asterisk.c __rtp_recv_from
pjnath/ice_session.c pj_ice_sess_on_rx_pkt
- locks ((pj_ice_sess*)ice)->mutex
- calls pj_stun_session_on_rx_pkt
pjnath/stun_session.c pj_stun_session_on_rx_pkt
- locks ((pj_stun_sess*)sess)->lock
- calls check_cached_response
pjnath/stun_session.c check_cached_response
- stuck in the while loop due to t never being == sess->cached_response_list
{noformat}

sess->cached_response_list is corrupted by the pj_timer_heap_schedule in pj_stun_session_send_msg -- usually this results in a SIGSEGV in either check_cached_response_list due to null dereference (per ASTERISK-25275) or null dereference in pj_stun_session_destroy, however if the cache eviction timer executes while check_cached_response_list is in its loop, then it will never find its starting point, causing an infinite loop. From our experience, that happens roughly 1 out of 10 times (10 SIGSEGVs per 1 infinite loop)

As another indicator to whether or not I'm referencing the same problem, if you are using Chrome with JsSIP, then you'd have noticed this issue spike significantly after the Chrome M47 release on Dec 1st (or after your browser upgraded to it) since they adjusted their DTLS code (an improvement) which somehow managed to expose this issue, causing it to occur roughly 50x more often than before.

By: Steve Davies (one47) 2015-12-24 04:12:55.866-0600

Hi Dade,

- I am familiar with the SEGV / 100% CPU infinite loop bug, and this is not that bug. I added locks to on_cache_timeout() in pjproject as referenced in ASTERISK-25275 to resolve this, but none of those locks are held when this issue deadlocks.
- I can cause the deadlock 4 calls in 5, not 1 call in 10,000 like the 'cached_response_list' issue, but it does take a specific call-path (below).
- If I revert the patch in ast_rtp_remote_address_set() and suffer the potential 1-second delay instead, then the issue goes away.

I will provide the backtraces of the deadlocked threads to demonstrate the issue.

I should add that our performance/load/stress test rig can make 250,000 calls and not trip over this issue, but a single call that hair-pins from JsSIP via asterisk back to the same JsSIP instance will cause this deadlock - Perhaps the fact that the same chrome JsSIP instance handles both ends of the call causes some type if synchronisation and therefore a timing-dependent deadlock.

I should also add that like yours, our JsSIP is much hacked - The biggest hack is to replace SDP 'pranswer' with 'answer' to make the early audio work more smoothly cross browsers. It is available on github at davies147/JsSIP if that helps.
{noformat}
Sequence to reproduce here:
- JsSIP initiate call
- Dialplan calls-back to same JsSIP
- Rings okay
- JsSIP answer call (seems okay)
- JsSIP put both calls on hold
...deadlock...
{noformat}
Is libpj involved in putting the call on hold? Perhaps that is the trigger. It only happens because the front-end in our app detects the contradiction in 2-calls one speaker, and tries to resolve it with a hold.

backtraces to follow.
By: Steve Davies (one47) 2015-12-24 04:22:09.452-0600

On investigation I found 3 threads that are locking up, but that does make sense.

Thread 36 is the PJ callback on ICE complete, which holds the PJ group lock, and wants the DTLS lock.

Threads 34 and 3 are Asterisk RTP threads which call dtls_srtp_check_pending() with the DTLS lock, and IF this results in a write via pj_ice_sess_send_data(), will subsequently require the PJ group lock.

ASTERISK-25275 suggests that disabling the group lock might be a thing, but that seems to fly in the face of everything sane :-O

By: Steve Davies (one47) 2015-12-24 05:26:49.819-0600

Is the attached patch 'experimental_anti_deadlock' utterly stupid?

It tries to make sure that dtls->lock is always held when calling dtls_srtp_check_pending() and then if a call to __rtp_sendto() is necessary, it releases the lock and re-acquires it to prevent a potential deadlock.

It cures my issue, but is the locking and unlocking a bit overkill?
By: Joshua C. Colp (jcolp) 2015-12-28 10:33:52.682-0600

To clarify my original nudge of [~dade] - I know he's been working heavily in the area based on the contributions he's provided and the issues he's been a part of. I knew that he'd be a good source of insight and would know rather quickly if something was up. :)
By: Steve Davies (one47) 2016-01-04 04:16:47.266-0600

OK, I have checked this and can confirm that Dade was correct about my non-standard Asterisk-11 build.

Because I have linked asterisk 11 against pjproject-2.4.5, there is a different set of locking occurring in pjnath during the on_ice_complete callback.

In the asterisk-11 built-in version, the sequence is:
{noformat}
lock ice->mutex
unlock ice->mutex
call cb.on_ice_callback
{noformat}

In pjproject-2.4.5, the sequence is:
{noformat}
pj_group_lock ice->grp_lock
call cb.on_ice_callback
pj_group_unlock ice->grp_lock
{noformat}
(This same group lock is used in pj_ice_sess_send_data, hence the potential deadlock)

So I can 'fix' my asterisk 11 issue by reverting to the previous pjproject - I would warn that this locking sequence may still bite Asterisk-13, but I have not researched that version enough to know for sure, so YMMV.
By: Dade Brandon (dade) 2016-01-04 09:14:20.563-0600

@jcolp is this actionable for reverting 24ae124e4 and 965a0eee from 13/master?

I haven't played with how pjproject links in 13, or looked in to their group locks -- if they've indeed merged locking of ice->stun_cfg.timer_heap->lock (previous versions locked this in on_ice_complete) and ice->mutex (requested in pj_ice_sess_send_data) then this problem would feasibly affect anyone with pjproject 2.4.5 using dtls, unless the new group lock architecture is somehow cognicient that they don't need the same structures. I feel like I'm missing something though, because that seems like a pjproject bug to me -- if they're locking around "user-space" code (the cb.on_ice_callback) that may call something quite obvious, like pj_ice_sess_send_data.

By: Dade Brandon (dade) 2016-01-04 09:21:31.408-0600

Briefly looking at the code from github/asterisk/pjproject HEAD, I think the problem would stem from pj_timer_heap_schedule_w_grp_lock at ice_session.c:1261 - this is where the on_ice_complete callback is scheduled in a timer. Previously, locking would not occur around the callback's execution, this timer used to instead get scheduled by pj_timer_heap_schedule, which I think is correct- potentially someone became overzealous when creating their group lock architecture. I don't think the group lock should be used there since

Patching 2.4.5 to change the pj_timer_heap_schedule_w_grp_lock call to pj_timer_heap_schedule, in the 'on_ice_complete' function of ice_session.c should fix it

Where that leaves this particular issue then is determining whether to expect users to update their dynamically linked pjproject on 13 vs reverting the dtls negotiation delay patches. I suggest patching it on github/asterisk/pjproject, and holding 24ae124e4 and 965a0eee from 13 for a few months to give people a chance to upgrade, while leaving it on 11 as 11 is unaffected.
By: Joshua C. Colp (jcolp) 2016-01-04 09:43:57.299-0600

Any other repercussions to reverting those then? As well - if you can create reviews to revert those that would be appreciated, otherwise I will get to them in a bit.
By: Steve Davies (one47) 2016-01-04 09:53:50.168-0600

I may be talking rubbish here, but pj_timer_heap_schedule_w_grp_lock() only increases the lock ref-count, so in itself should be harmless (?am I understanding that right?). I think it is the hard-locks grabbed inside on_timer() in ice_session.c that are the problem.

The current workaround that I am testing today is to patch pjproject with a really dirty patch, as attached in 'unlock_ice_before_callback'. This seems to work so far with asterisk 11 branch and pjproject 2.4.5.

The fact that the lock ref-count is increased using pj_timer_heap_schedule_w_grp_lock() will in fact make this patch somewhat safer, even if it is ugly.

By: Etienne Lessard (hexanol) 2016-08-17 10:18:47.773-0500

Hi,

I'm in a situation where we would like to use the fixes for the DTLS negotiation delays that were done for ASTERISK-25614. We are currently using Asterisk 13.10.0 with pjproject 2.5, and we can apply patches to both these softwares if needed.

@Dade, in your 2016-01-04 comment, you said "Patching 2.4.5 to change the pj_timer_heap_schedule_w_grp_lock call to pj_timer_heap_schedule, in the 'on_ice_complete' function of ice_session.c should fix it". Have you tested it / are you using such a patch ? Or have you taken another approach to the problem ?

Another question: currently in Asterisk 13.10.0, the rtp_recvfrom function calls various dtls_* and openssl functions without holding the DTLS lock (&dtls->lock). If we were to reapply the commit https://github.com/asterisk/asterisk/commit/24ae124e4f7310cfa64c187b944b2ffc060da28d on our asterisk but without adding the ast_mutex_lock/unlock in rtp_recvfrom, would this create a worst situation than it currently is in Asterisk 13.10.0 (where no locking is done in that section) ?

Thank you
By: Dade Brandon (dade) 2016-08-17 11:05:32.025-0500

@hexanol

We aren't using Asterisk 13+, so I've neither verified that the issue in here was actually present, nor looked in to resolving what would be the pjproject issue of locking around user-space code. I personally don't think this bug exists since the person who opened this issue was using a trial and error approach to locating the commit that began causing is problem, and was using a weird use-case of making several calls "to/from the same endpoint" along with "inband signalling" (I assume he means DTMF, since inband means in the audio stream, and I doubt he means he's interpreting busy signals etc). It probably would have been a lesser evil for the most users of Asterisk-13 + DTLS to deal with his perils, rather than reverting this patch, unless it was a lot larger than the limited case he described.

You could try something else that's more of a middle-ground... reverting whatever the Asterisk-13 equivalent of commit 1ad827 is, which is the commit that introduced the delay problem. My patch was to try and resolve that more responsibly. Here is the one line that was removed from 1ad827, which caused a delay to be introduced on the dtls negotiation: (your line numbers will differ)
{quote}{noformat}
@@ -1997,8 +2029,6 @@ static int __rtp_recvfrom(struct ast_rtp_instance *instance, void *buf, size_t s
}

#ifdef HAVE_OPENSSL_SRTP
{noformat}{color:red}{noformat}- dtls_srtp_check_pending(instance, rtp, rtcp);{noformat}{color}{noformat}

/* If this is an SSL packet pass it to OpenSSL for processing. RFC section for first byte value:
* https://tools.ietf.org/html/rfc5764#section-5.1.2 */

{noformat}{quote}

Putting that back in is still the wrong way to fix this. The problem is that the DTLS Hello from the client isn't responded to by Asterisk due to the client's remote IP address not being stored on an internal structure that dtls_srtp_check_pending is dependent on existing, so it buffers the response rather than sending it immediately. My patch made it flush the buffer when the remote IP is set, however this previous code worked because the next UDP packet received on the same port (ie stun or rtp) would trigger the buffer flush. The reason the connection delay is consistent, is because the client will re-try the ssl handshake if no response is received after a certain amount of time, which for Chrome happens after one second.
By: Stian Halseth (halseth) 2016-09-16 03:25:45.433-0500

I will provide some additional info about this issue as we currently depend on the reverted patch from ASTERISK-25614.

Because the patch from ASTERISK-25614 has been was reverted , this problem is still present in current Asterisk 13.11.2, downloaded and compiled from asterisk.org today.

I would really like someone to look into this and verify whether this patch will cause locking issues with current version of pjproject on current versions of Asterisk 13.

If not, I believe this patch should be reintroduced, at least provide some documentation about the issue, and compatibility requirements for pjproject versions.

When using the latest version of Chrome , 53.0.2785.113 (64-bit), we're experiencing the delay of audio on incoming calls to WebRTC users without this patch.

Happens about 1 in 3, perhaps 1 in 5 times when receiving incoming calls with WebRTC. Issue not present on outgoing calls.
It's a big enough deal braker that we cannot use current 13 LTS og 13 certified in production without manual patching.
I'm not providing too much detail on the problem itself, as it's seems like it's a known issue.

Same issue in current Asterisk 13 certified.

We have fixed this by manually applying the patch https://code.asterisk.org/code/rdiff/asterisk?csid=965a0eee46d24321f74c244e23c5a5f45e67e12b&u&N

We will be conducting extensive testing on Asterisk 13.11.2, with the patch applied.

Hopefully we can determine that it's safe to install this in production, but we would really appreciate some input from someone with extensive knowledge of the impatch of the patch, and the potensial locking issue.

By: Steve Davies (one47) 2016-09-16 04:08:03.885-0500

I have recently been running my local builds with ASTERISK-25614 applied. On 6/Jan I commented (above) about a hackish patch I was applying to pjproject-2.4.5 at the same time, which I believe may in fact be correct. It ensures that the lock inversion does not occur. I've had no obvious problems with it.

--- pjproject-2.4.5.orig/pjnath/src/pjnath/ice_session.c
+++ pjproject-2.4.5/pjnath/src/pjnath/ice_session.c
@@ -1157,9 +1157,11 @@ static void on_timer(pj_timer_heap_t *th
ice_status = ice->ice_status;
on_ice_complete = ice->cb.on_ice_complete;

+ pj_grp_lock_release(ice->grp_lock);
/* Notify app about ICE completion*/
if (on_ice_complete)
(*on_ice_complete)(ice, ice_status);
+ return;
}
break;
case TIMER_START_NOMINATED_CHECK:

By: Stian Halseth (halseth) 2016-09-16 04:34:44.341-0500

Thank you for your response, Steve.

We're currently using pjproject-2.5.5, configured asterisk-13.11.2 with bundled pjproject (--with-pjproject-bundled).

From what I can tell your patch could also be applied to pjproject-2.5.5 as the function on_timer(pj_timer_heap_t *th, pj_timer_entry *te) is unchanged between pjproject-2.4.5 and pjproject-2.5.5.

I will give it a try!

By: Dade Brandon (dade) 2016-09-18 14:15:03.586-0500

@one47
Hi Steve,

If that patch is working, is there any chance of submitting it to pjproject, or maybe in to the bundled pjproject? (I think @jcolp could confirm if we can patch the bundled pjproject directly in the asterisk repository, or if alternatively it has to go through github.com/asterisk/pjproject -- I wasn't even aware that Asterisk 13 still had a bundled copy of pjproject.)

The pjproject bundled with asterisk-11, which doesn't use group locks, looks like this:

```
/* Release mutex in case app destroy us in the callback */
ice_status = ice->ice_status;
on_ice_complete = ice->cb.on_ice_complete;
has_mutex = PJ_FALSE;
pj_mutex_unlock(ice->mutex);

/* Notify app about ICE completion*/
if (on_ice_complete)
(*on_ice_complete)(ice, ice_status);
```

lower down in this function, it does not re-acquire ice->mutex, so I imagine that your patch, which released but does not re-acquire, is also correct, although I don't fully understand the differences between the group lock ice->mutex lock. Before looking at the older pjproject code, I assumed that your suggested pjproject patch should have been first checking to see if the lock had already been acquired before releasing, and then also re-acquiring after the on_ice_complete callback (which may still be the case, but I doubt it..?)
By: Joshua C. Colp (jcolp) 2016-09-18 14:30:40.481-0500

All changes should be upstreamed. Patches only exist if a release has not yet been made of PJSIP which includes them, or if we absolutely have to maintain an external patch.