[Home]

Summary:ASTERISK-22911: [patch]Asterisk fails to resume WebRTC call from hold
Reporter:Vytis Valentinavičius (xytis)Labels:
Date Opened:2013-11-26 07:35:24.000-0600Date Closed:2014-03-04 10:49:25.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip Resources/res_rtp_asterisk
Versions:12.0.0-beta2 Frequency of
Occurrence
Constant
Related
Issues:
is related toASTERISK-23213 SIP over WS: Audio problems when upgrading to 11.8 from 11.7 with endpoints behind NAT
is related toASTERISK-23145 Sporadic one way audio between a SIP hard phone and a SIPML5 browser client on LAN
is related toASTERISK-23337 One way audio issues with WebRTC - PJNATH limits number of ICE candidates
Environment:Server: asterisk:svn r403157 --with-srtp --with-pjproject pjproject:git asterisk/pjproject HEAD --with-external-srtp --enable-shared CFLAGS="-DNDEBUG" Ubuntu Precise 64, 3.2.0-23-generic. Client: Chrome 33.0.1720.0 canary http://sipml5.org/call.htm?svn=203Attachments:( 0) capture_asterisk_211_client_15.pcap.gz
( 1) issue_22911.full.log
( 2) issue_22911.full.log
( 3) issue_22911.full.pjsip.log
( 4) works_on_my_machine.patch
Description:When in call between soft-phone and WebRTC resuming from holden call does not resume the sound.
Notices:
1. Hold and resume must be made by WebRTC client. Tested with sipml5.org demo.
2. Wireshark dump showed that after call is resumed all UDP packets do not reach WebRTC client due to wrong destination port.
3. Chrome stops active channel when issued hold command and creates new channel on resume. Channel is bound to new port each time.
4. Asterisk spits out such verbose errors:
Before connection:
[Nov 26 13:27:01] ERROR[2088]: pjsip:0 <?>: icess0x7fbe000 ..Error sending STUN request: Invalid argument
Later in call (not related to Hold/Resume sequence):
[Nov 26 13:28:06] WARNING[2177][C-00000000]: res_srtp.c:406 ast_srtp_unprotect: SRTP unprotect failed with: authentication failure 10
Comments:By: Matt Jordan (mjordan) 2013-11-26 08:57:13.242-0600

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Please also load the res_pjsip_logger module and enable SIP trace debugging (pjsip set logger on). Please also attach your pjsip.conf. Thanks!

By: Vytis Valentinavičius (xytis) 2013-11-26 15:15:27.956-0600

I created a virtual environment for this specific issue:
https://github.com/xytis/asterisk-22911

Bug is reproducible in this environment.

Notable features:
All configs are based on default ones from svn.
Only sip.conf, extensions.conf and http.conf are modified.
Default pjsip.conf is effectively empty.

VM mounts ./source and ./cache directories internally and uses those directories for building.
This was done to ease the debugging and tracing the bug.

By: Vytis Valentinavičius (xytis) 2013-11-26 15:20:04.601-0600

Using the supplied environment bug can be reproduced as follows:

1. clone git repo
2. alter Vagrantfile to contain a valid (not used) ip in hosts subnet, that could be used for asterisk instance.
3. `vagrant up`
4. use WebRTC client to connect to asterisk with user 1111, pass 1111.
5. use soft-phone to connect to asterisk with user 2001, pass 2001.
6. make a call 1111<->2001.
7. issue hold from WebRTC client, resume the call.
8. sound should no longer propagate from soft-phone to WebRTC.

Hope this helps =]

By: Vytis Valentinavičius (xytis) 2013-11-27 05:39:00.242-0600

Here are the logs:
[full|https://raw.github.com/xytis/asterisk-22911/master/issue_22911.full.log]
[full with pjsip|https://raw.github.com/xytis/asterisk-22911/master/issue_22911.full.pjsip.log]

Note regarding the environment:
If returning to already initialised virtual box -- please use `vagrant ssh` to issue `sudo service asterisk restart`.
This is required because asterisk server is started BEFORE /etc/asterisk is mounted to external folder, thus server is launched with wrong configuration.

By: Matt Jordan (mjordan) 2013-11-27 08:33:07.873-0600

In general, you should attach the files to this issue. Remote repos, bug trackers, and other systems are outside of the Asterisk project's control, and may not have the data when someone looks at the issue. I've gone ahead and attached the files here.

Note that this issue is with {{chan_sip}}, not the PJSIP stack in Asterisk 12+.

By: Matt Jordan (mjordan) 2013-11-27 08:39:17.806-0600

Your log files don't illustrate the bug report.

In your first, issue_22911.full.log, two INVITE requests are received over the WS. In the first, the INVITE request is rejected due to the device not having registered. In the second, the INVITE request from the WS is accepted, but the target of the call returns a 486. This causes the call to be hung up.

Your second log file, issue_22911.full.pjsip.log, does not illustrate a call at all.

Please produce a DEBUG log illustrating the problem in the bug report. This should show the initial INVITE request over the WS and an established, bridged call. After the call is established, please show the hold taking place with restricted media via an INVITE request, followed by a subsequent INVITE request that attempts to take the call off hold.

By: Vytis Valentinavičius (xytis) 2013-11-28 00:52:28.758-0600

Added surely the correct log.

By: Vytis Valentinavičius (xytis) 2013-11-28 00:56:43.038-0600

Sorry about the log mixup. Fixed it now.
issue_22911.full.log contains described test sequence.

By: Vytis Valentinavičius (xytis) 2013-11-28 00:59:12.073-0600

The reason why I tagged PJSIP stack is because the bug seemed to originate from bad ICE resolution, which was done using pjnath.

By: Vytis Valentinavičius (xytis) 2013-11-28 09:24:01.689-0600

Just tested the same configuration on debian wheezy-backports. Bug exists.

By: Matt Jordan (mjordan) 2013-11-29 12:29:11.101-0600

Just so we know how to try and reproduce this, you are running a version of Asterisk 12, using {{chan_sip}} with a websocket connection on one end of the bridge and a standard SIP over UDP channel on the other end of the bridge. The bridge chosen is a {{simple_bridge}}, as one channel is encrypting media using SDES-SRTP while the other is unencrypted.

From a signalling perspective, there doesn't appear to be anything wrong here: Asterisk properly sends {{recvonly}} when the SIP over WS channel indicates {{sendonly}} and responds with {{sendrecv}} when the channel removes the Hold.

There are no unprotect errors in the log you supplied. I'm also not seeing any RTP errors or other indications of any issues in the log file.

You stated earlier that wireshark showed that Asterisk was sending RTP to the wrong port; it is clear that Chrome is changing the port number on the re-INVITE that initiates the hold. Can you attach a pcap demonstrating this?

By: Vytis Valentinavičius (xytis) 2013-12-01 04:25:09.850-0600

Attached pcap file.
It contains unfiltered traffic between asterisk server[10.15.38.211] and webrtc client[10.15.38.15].
Note that soft-phone is launched on the same client[10.15.38.15].

By: Vytis Valentinavičius (xytis) 2013-12-01 04:32:06.592-0600

> Just so we know how to try and reproduce this, you are running a version of Asterisk 12, using chan_sip with a websocket connection on one end of the bridge and a standard SIP over UDP channel on the other end of the bridge. The bridge chosen is a simple_bridge, as one channel is encrypting media using SDES-SRTP while the other is unencrypted.

Absolutely correct.

> You stated earlier that wireshark showed that Asterisk was sending RTP to the wrong port; it is clear that Chrome is changing the port number on the re-INVITE that initiates the hold. Can you attach a pcap demonstrating this?

I checked the 'chrome://webrtc-internals/' regarding the last notice. It surely shows the changed port. Bottom of attached pcap shows UDP traffic with ICMP 'Port unreachable' responses for each server->client packet.

By: Vytis Valentinavičius (xytis) 2013-12-01 04:41:51.297-0600

Another (perhaps irrelevant) notice:
We had issues with asterisk core crashing when WebRTC issued hold.
Crashes occurred because we had compiled PJPROJECT without NDEBUG flag, thus internal macro PJ_ASSERT_RETURN actually asserted out the core.
Check that failed due to WebRTC hold is:
   PJ_ASSERT_RETURN(rcand_cnt + ice->rcand_cnt <= PJ_ICE_MAX_CAND, PJ_ETOOMANY);
Located at pjnath/ice_session.c:pj_ice_sess_create_check_list()

Perhaps it is related to the whole wrong port issue?

By: Vytis Valentinavičius (xytis) 2013-12-04 14:20:52.117-0600

After a while I think I have found a solution:
During second (and consequent) SIP INVITE, which negotiates new port for audio channel ICE session remote candidates are not updated. This had to be fixed by removing flag which prevented re-pairing of remote candidates and by clearing RTP session container, which held SIP INVITE announced ICE candidates after pairing.

I can submit a patch for these changes. I specifically targeted 11.5.1, since it is my priority target.
Where should I submit this patch?

By: Vytis Valentinavičius (xytis) 2013-12-05 10:03:54.220-0600

-- Attached without any licence. Consider this given away with any licence that is most acceptable to You.

By: Matt Jordan (mjordan) 2013-12-05 10:38:04.418-0600

Hey Vytis -

I'm working on a more in depth reply to your -dev post, but I'll comment here on code contributions. Unfortunately, we can't accept patches that aren't submitted back with a license contributor agreement, so I've removed the file you've attached (sorry :-(). If you can, please sign a license contributor agreement (there's a button at the top of the page) and re-attach it.

Thanks!

By: Igor Skomorokh (ISkomorokh) 2014-01-29 06:20:08.753-0600

Matt Jordan, will the fix for this issue be applied to 11.x branch?  The same problem exists in Asterisk 11.7.

By: Joshua C. Colp (jcolp) 2014-01-29 06:24:07.857-0600

The change for this was placed into the Asterisk 11, 12, and trunk branches.

By: Igor Skomorokh (ISkomorokh) 2014-01-29 06:37:25.072-0600

So it is going to be in 11.8 release, isn't it?

By: Joshua C. Colp (jcolp) 2014-01-29 06:40:38.261-0600

Yes. It's in 11.8.0-rc1 already.

By: Igor Skomorokh (ISkomorokh) 2014-02-04 02:06:45.663-0600

Hi! I've tried 11.8.0-rc1 and can confirm, that now hold/unhold works properly but not in all cases. For example, if I initiate a call and I initiate hold and then unhold - everything works fine, but when I initiate a call and the other side accepts it an initiate hold and unhold then nothing works and Asterisk logs:

DEBUG[24993][C-00000006]: res_rtp_asterisk.c:2682 ast_rtp_write: No remote address on RTP instance '0x7f1420018308' so dropping frame
....
[Feb  4 10:04:58] DEBUG[24993][C-00000006]: channel.c:8070 ast_channel_bridge: Bridge stops bridging channels SIP/1-0000000c and SIP/2-0000000d
[Feb  4 10:04:58] DEBUG[24993][C-00000006]: res_rtp_asterisk.c:2607 ast_rtp_raw_write: Starting RTCP transmission on RTP instance '0x7f1420018308'
[Feb  4 10:04:58] DEBUG[24992][C-00000006]: chan_sip.c:4568 __sip_ack: Stopping retransmission on '4ba1fb9b3575d1247dc95c6d29a375cc@172.19.2.10:5060' of Response 64144: Match Not Found
[Feb  4 10:04:58] WARNING[24993][C-00000006]: res_srtp.c:406 ast_srtp_unprotect: SRTP unprotect failed with: authentication failure 10
[Feb  4 10:05:00] WARNING[24993][C-00000006]: res_srtp.c:406 ast_srtp_unprotect: SRTP unprotect failed with: authentication failure 110


This might help: https://code.google.com/p/sipml5/issues/detail?id=44#c23

By: Jonathan Rose (jrose) 2014-02-18 11:31:19.250-0600

Hey Vytis, the patch on this issue appears to have caused a regression and I'm hoping that you can help to test a replacement patch that I think will also address your issue while not causing the regression. If you can test it out, it's on ASTERISK-23213
and is called ice_candidates_excessive_clearing.patch

Or if you'd prefer a direct link: https://issues.asterisk.org/jira/secure/attachment/49328/ice_candidates_excessive_clearing.patch
It applies directly to 11 from svn right now, but I think it will also apply cleanly to 11.7.0-rc1.

By: Vytis Valentinavičius (xytis) 2014-02-18 12:06:16.526-0600

Hey Jonathan.
I have two points to make:
1. First patch DID NOT fix the problem (my original patch could not be applied since it hacked pjnath library)
2. Your patch may fail to fix this particular bug also:
There are several reasons why asterisk does not resume audio after hold. This time lets fix only on one of them: Asterisk does not change the media destination address.
This happens because Asterisk always 'append' candidates list in ice session. So after a new REINVITE is received old candidates still are at the top of ice resolve list (and are actually confirmed to work).
Thus any new packets that are dispatched to ice resolved address go to wrong (outdated) destination.
As far as I see Your patch just 'works around' the assertion in pjnath library and does not actually alter existing candidates in it. This may be enough IF webrtc client never changes it's destination address (that is client is not running on google chrome =]).
Also I am not aware about thread related issues, but
{code}ao2_container_count(rtp->remote_candidates) + rtp->ice->rcand_cnt > PJ_ICE_MAX_CAND{code}
is practically the same check that is done in pjnath, which (convieniently) is encapsulated in PJ_ASSERT_RETURN. So
{code}pj_ice_sess_create_check_list(rtp->ice, &ufrag, &passwd, ao2_container_count(rtp->remote_candidates), &candidates[0]){code}
Should never return PJ_ETOOMANY...

Maybe You are willing to alter res_rtp_asterisk engine so that an ICE session would be _actually_ recreated and started on each 'ice_start' (and 'ice_stop' should be called appropriately). Then fix proccess_sdp in chan/sip.c to do a parse step separate from ice and other logic.

=]

Either way, I know about these issues, unfortunately I have my own stuff to hack on... So any help will be highly appreciated =]

By: Jonathan Rose (jrose) 2014-02-18 12:24:25.438-0600

By the first patch, I didn't mean the one that was posted here... I meant the one that was applied in r405234.

The problem with relying on that ASSERT return is that if it triggers, Asterisk crashes. At least on my box and those of a few other developers around the office. I understand you can disable those crashes, but for the purpose of Asterisk development, we want to be PJProject assertion free whenever possible.

Also that's not the only code path that is capable of returning PJ_ETOOMANY. There is a second check later on that returns PJ_ETOOMANY without asserting.
{code}
   if (clist->count >= PJ_ICE_MAX_CHECKS) {
       pj_mutex_unlock(ice->mutex);
       return PJ_ETOOMANY;
   }
{code}
Which is why I figured I still needed to be ready to anticipate that return value and clear the candidate list if that happens. I was actually under the impression that this is what you were running into since Asterisk wasn't crashing outright at the PJNATH assertion. After rereading your comments, it seems clear that you were manually disabling the crash on assertion behavior of PJSIP, but we I don't believe we instruct users to do that normally so I don't think we can rely on that.

To be honest, I'm less concerned with fixing issues that the existing patch didn't already solve right now. This is more about fixing a regression, but I don't want to leave this issue unresolved since that would also be a regression of sorts if the problem described here was ever actually fixed.

By: Vytis Valentinavičius (xytis) 2014-02-18 12:31:52.552-0600

By the first patch I meant the same patch as You. r405234 did not fix this bug in all scenarios.

And to be fair, I have no idea in what scenario r405234 helped.

So, since the problem was never fixed and the patch introduced a regression, You can mark this bug as open/unresolved and we live happily ever after =]

By: Jonathan Rose (jrose) 2014-02-18 13:07:38.198-0600

{quote}r405234 did not fix this bug in all scenarios{quote}

What about in some scenarios? As long as the new patch covers the things that were fixed by the old patch (without breaking new things in the process), I'll at least feel a little more confident about replacing it.

By: Vytis Valentinavičius (xytis) 2014-02-18 13:49:05.674-0600

{quote}And to be fair, I have no idea in what scenario r405234 helped.{quote}

Let me think aloud:

The patch (r405234) cleared out locally stored remote candidates on create check list failure. So if we imagine that we are talking about a session with several REINVITEs, and the first INVITE created a valid ice session which resolved a path for media to travel, any subsequent requests to append the candidate list are useless since they will not be evaluated.

Now IF the first invite did not contain valid candidate pair, or all checks returned failure THEN appending the candidate list has at least some logic.

But either way, if first call to create check list succeeds, ice session is marked as started and the topmost 'if' returns early.

No, I still see no valid scenario where the patch in question fixed anything.

Maybe the person who created r405234 could explain more?

By: Kevin Harwell (kharwell) 2014-02-18 17:14:53.479-0600

Probably not much to add.  r405234 was basically the original patch on the issue refactored a bit ([re]initialized the pjproject values without having to change pjproject source).

I think the intent was that if the session check list could be created and a path found then it would always just use those candidates from then on (it had found a valid path no reason to redo).  Otherwise if it failed then on subsequent reinvites retry in hopes that it might succeed.

The patch had fixed a couple of things (albeit insufficiently), one it no longer crashed, two the call would resume after a hold/unhold.  Unfortunately, the latter was only in one direction it seems.

Apparently the assumption that remote candidates won't change on subsequent reinvites is wrong, so yeah I would agree with the idea of that list having to be "refreshed" somehow each time.  However, that probably depends on how pjsip handles things.

By: Jonathan Rose (jrose) 2014-02-27 14:03:24.441-0600

Hey Vytis, I have a new patch you could try on this code review: https://reviewboard.asterisk.org/r/3275/