[Home]

Summary:ASTERISK-25614: DTLS negotiation delays
Reporter:Dade Brandon (dade)Labels:
Date Opened:2015-12-07 19:03:47.000-0600Date Closed:2016-01-08 12:05:29.000-0600
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:
Versions:11.20.0 Frequency of
Occurrence
Constant
Related
Issues:
is duplicated byASTERISK-25467 chan_sip/webrtc Asterisk + Chrome M47 consistent 0.9s ice handshake delay since commit 1ad827
is related toASTERISK-25645 res_rtp_asterisk: Lock inversion
Environment:Attachments:( 0) ASTERISK-25614_patch.txt
Description:Chrome M47 and higher was experiencing a *consistent* 0.9s delay bridging audio when being a webrtc peer with Asterisk.

Chrome M47 was released Dec 1st 2015

We asked the Chrome team back when M47 was beta and they suggested that they've improved their SSL negotiation code in M47, and that Asterisk was likely not yet listening at the time that they sent the handshake.  [External link to Chrome issue 536209|https://code.google.com/p/chromium/issues/detail?id=536209&thanks=536209&ts=1443219585]

We noticed that before Gerrit commit 1ad827, this issue wasn't present.

The particular line that changed in 1ad827, which when reversed corrects the issue, is in __rtp_recvfrom;  a dtls_srtp_check_pending was removed, which was happening on every call to dtls_srtp_check_pending.

Re-adding a constant dtls_srtp_check_pending resolved the delay issue permanently.  Now that M47 is live, I noticed this has greatly increased our crashes per day (up to 40 crashes today across 50 servers, having processed only 20 million calls WebRTC calls.

Digging in to this, I found that the reason the added dtls_srtp_check_pending resolved the issue, is because when we receive the handshake (*in == 22), there is no remote_address set -- the ast_sockaddr_isnull check in dtls_srtp_check_pending causes a return, keeping the response to the handshake in the write BIO.  Chrome no longer sends unnecessary repetitive handshakes (their negotiation code before M47 was quite unorganized).  This means that dtls_srtp_check_pending isn't called again until 0.9 seconds later, when Chrome times out the first handshake and sends another.  The response Asterisk is providing Chrome at this point, is actually the response to the first one!

remote_addr is set shortly after the handshake is received, when ice candidates are processed by a separate thread.

I've created a patch which triggers dtls_srtp_check_pending to send on ast_rtp_remote_address_set.

This patch causes the response to send from a thread other than the one which __rtp_recvfrom is running from.  I've added locks to the dtls mutex within __rtp_recvfrom, as I believe that the unlocked access to the BIO and DTLS structures when ice candidate negotiation happens concurrently with a filled write bio sending through dtls_srtp_check_pending, was causing extremely inconsistent crashes (they also seemed to be more significant for customers with higher jitter)

Further, in the triggered check in ast_rtp_remote_address_set, I took the care to only perform the triggered check if the DTLS structure is set to passive - as I confirmed through added debug logs, that for this issue, the code under the comment in __rtp_recvfrom "If we don't yet know if we are active or passive and we receive a packet... we are obviously passive" - has definitely executed when this issue occurs.

I'm listing this as a regression since we identified that the problem was introduced in a particular commit, however I believe that this solution is much better than a revert, based on the testing I described.
Comments:By: Asterisk Team (asteriskteam) 2015-12-07 19:03:48.603-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Dade Brandon (dade) 2015-12-07 19:14:27.166-0600

Patch attached.  My Contributor License Agreement is pending, so I can't yet openId in to Gerrit to submit.

By: Rusty Newton (rnewton) 2015-12-08 16:47:10.119-0600

You cannot attach a patch as contribution until your license agreement is accepted. It usually takes less than 24 or 48 hours.

I'll put this issue in Waiting on Feedback. When your agreement is accepted, attach the patch as a code contribution and then put it up on Gerrit. Thanks!

By: Dade Brandon (dade) 2015-12-14 14:26:07.874-0600

Patch submitted to Gerrit for review & accessible under https://gerrit.asterisk.org/#/c/1813/1

By: Dade Brandon (dade) 2015-12-17 18:50:37.439-0600

The patch is already merged, except I have updated the code review with some questions regarding an edge case that is not covered by the patch & of which I have a new verified patch for.

By: Dade Brandon (dade) 2015-12-17 22:10:33.619-0600

Additional patch submitted to Gerrit for review & accessible under https://gerrit.asterisk.org/#/c/1838/1

Once merged, this can be closed as resolved.  

By: Richard Mudgett (rmudgett) 2016-01-08 12:05:29.714-0600

All patches are merged.