[Home]

Summary:ASTERISK-28828: NOTIFY sequence upsets MS Teams SIP trunk
Reporter:Carlos Mendioroz (tron)Labels:
Date Opened:2020-04-14 09:17:27Date Closed:
Priority:MinorRegression?Yes
Status:Open/NewComponents:Resources/res_pjsip_refer
Versions:17.3.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Ubuntu 18.04 install from sourcesAttachments:
Description:Resuming an on hold call from MS Teams side often results in the call going to limbo. Trace shows out of order NOTIFYs and 500 response to a Trying event.

On Joshua's advise, rolled back https://gerrit.asterisk.org/c/asterisk/+/13852 and even though the out of order Trying subsists, it does not gerenate a 500 and the call resumes correctly.

Discussion here:
https://community.asterisk.org/t/sip-xfers-via-reinvite-refer-and-order-of-notifys/83572
Comments:By: Asterisk Team (asteriskteam) 2020-04-14 09:17:29.495-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: George Joseph (gjoseph) 2020-04-14 09:31:46.328-0500

[~lvl] can you take a look at this as it relates to your commit from last month?


By: laszlovl (lvl) 2020-04-14 09:49:21.222-0500

I can offer some context here that might be relevant.

The reason I was adding a Proceeding() in my dialplan in the first place, was that the code at https://github.com/asterisk/asterisk/blob/master/res/res_pjsip_refer.c#L141 can cause out-of-order NOTIFYs. If a 100 Trying hasn't been sent already, it will send one _right_ before sending the 200 OK. The 100 and 200 are then so close together that they would often arrive out-of-order at the client, causing various issues.

Adding the Proceeding() in the dialplan means the 100 Trying is sent at the time of calling Proceeding(), which means there's plenty of time between that and Asterisk sending the 200 OK. But after doing that, I ran into ASTERISK-28766.

I suspect that the issue here already existed before, but was hidden by the bug in ASTERISK-28766. I'll take a deeper look when I have more time later this week.

By: George Joseph (gjoseph) 2020-04-14 13:49:21.007-0500

Thanks [~lvl].  I'm going to acknowledge this issue and assign it to you for now.  If you need to send it back to us, just do a re-triage.


By: laszlovl (lvl) 2020-04-15 04:47:11.129-0500

I checked the supplied SIP trace.

So Asterisk receives a 100 Trying and then a 180 Ringing from the transferee. That prompts it to subsequently generate a 100 Trying and a 180 Ringing NOTIFY for the transferer. However, the two notifies end up being sent out-of-sequence and thus rightfully rejected by the MS Teams server.

Indeed before the patch in ASTERISK-28766 the 100 Trying NOTIFY would set {{progress->subclass = AST_CONTROL_PROCEEDING}} which would in some scenario's prevent the SIP 180 Ringing (and the 200 OK) from generating a NOTIFY. If only one NOTIFY is generated it won't get out-of-sequence.

So reverting ASTERISK-28766 might fix this specific scenario, but will cause other issues and won't fix the underlying problems. So I think this issue should look at: how can we prevent Asterisk from sending the NOTIFYs out-of-sequence.

In the mean time, the issue can probably be alleviated the same way I did: add a Proceeding() in the dialplan before the Dial().

By: laszlovl (lvl) 2020-04-15 04:49:10.144-0500

[~gjoseph] I don't think I have the permission to change the status to Triage, please do so for me.

By: Carlos Mendioroz (tron) 2020-04-15 06:27:30.330-0500

Ivi,
It is not stated here, but in the forum discussion, that reverting did make the call work but did not fix the ordering.
I was at first convinced that the order was generating the 500, but it seems there is something else.
Also, during my tests I prevented the generation of the Trying and that also did not work.
(work, as in Teams completing the resume process)

By: laszlovl (lvl) 2020-04-15 06:30:02.098-0500

Can you show a trace with the reverted version?

By: Carlos Mendioroz (tron) 2020-04-15 06:33:57.007-0500

Sure, here: https://pastebin.com/1xjUpsG1

By: laszlovl (lvl) 2020-04-15 07:15:12.327-0500

Well, I can see the difference between the two. In the "normal" version, Asterisk immediately sends three NOTIFYs:

* 180 Ringing
* 100 Trying
* 200 OK

In the "reverted" version, only two:

* 180 Ringing
* 100 Trying

That makes somewhat sense, since the patch from ASTERISK-28766 fixes {{refer_progress_framehook}} to allow generating a 200 OK after a previous 100/180.

The 200 is sent very early though, if the logging's order is correct before the call is even answered/bridged. Seems like it's triggered too early, perhaps by {{if (f->frametype == AST_FRAME_VOICE}} or {{} else if (f->subclass.integer == AST_CONTROL_ANSWER) {}}. You'd have to add some debug statements here and then somebody smarter than me could investigate why :)

Why the trunk gets upset about 3 out-of-sequence NOTIFYs but not about 2, I don't know. You could try if the dialplan "Proceeding()" (in combination with the "regular" version) would fix the issue for you for now.

By: Carlos Mendioroz (tron) 2020-04-15 07:19:40.974-0500

I don't follow.  I can see the OK NOTIFY at line 692 of the log, or am I seeing something else ?

By: laszlovl (lvl) 2020-04-15 07:24:53.244-0500

Yes, but it comes at a different time. In the first trace, immediately after the 180/100; in the second trace only once the call was answered.

By: Carlos Mendioroz (tron) 2020-04-15 07:35:10.274-0500

All is happening in a very short time.
The NOTIFY/OK can not possible happen before the actual call answer, which is the trace next message.
Tracing has for sure some locking mech that is making the received OK follow the sent NOTIFY. Multithreded magic.

I don't understand the upset, nor am I an expert SIP person. I do know about protocols in general :)
But as I said in the forum, I do have this end's control. Not the other one.
I'm open to help dig this out. Teams is going to be a big player in the not so distant future, IMHO.


By: laszlovl (lvl) 2020-04-15 07:39:51.570-0500

You could add some debugging (ast_log and/or ast_log_backtrace) at https://github.com/asterisk/asterisk/blob/master/res/res_pjsip_refer.c#L235 and https://github.com/asterisk/asterisk/blob/master/res/res_pjsip_refer.c#L256 to check if & why the NOTIFY 200 OK is indeed sent too early.

Not completely sure that it relates to the problem you're experiencing, but it's worth a try.

By: Carlos Mendioroz (tron) 2020-04-15 07:46:01.652-0500

Ah, now I see. You mean the OK was an inferred one by a media frame and not in response to the actual (following ?) OK.

I'll see to test that...

By: Carlos Mendioroz (tron) 2020-04-15 09:13:59.478-0500

Well, timing is everything here. If I enable debug collection, to get the debug messages, Teams gets upset again.


By: Carlos Mendioroz (tron) 2020-04-16 21:02:51.471-0500

This is above my level of understanding.
But it seems to me that the reordering of messages is in fact a trigger of the issue. I do have a trace where an inverted sequence did not upset Teams, but the order of the responses is the canonical one (Trying, Ringing) so my take is that somehow they made it to the link in the right order.
So given that there is some kind of queuing and message generation order is not always kept, I decided to revert the patch that causes the delayed Trying (Asterisk-25771, change id I09cfc9a5d6ed4c007bc70625e0972b470393bf16) and that seems to fix things.
Correct fix, I guess, would be to somehow enforce TX order, but that is inside PJSIP library ?