ASTERISK-21677: NOTIFYs for BLF start queuing up and fail to be sent out

[Home]

Summary: ASTERISK-21677: NOTIFYs for BLF start queuing up and fail to be sent out

Reporter: Dan Martens (dlmarten) Labels:

Date Opened: 2013-04-24 08:50:39 Date Closed: 2013-05-08 02:25:38

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 11.3.0 Frequency of
Occurrence Frequent

Related
Issues:

Environment: Centos 5.6, Quad Core Intel Xeon @ 3.0GHZ, 4GB RAM Attachments: ( 0) review2475.diff2.txt

Description: Hello,

We have noticed that since we upgraded to Asterisk 11, the BLF lamps on phones (multiple makes and models) stop working from time to time. To get them to work again, we have to bring the device offline and back online again.

When this happens, we start to see a lot of "queued" messages in the logs regarding the extension that is not working. For example:

Extension Changed 100[witgoffice-local] new state Ringing for Notify User witg_116 (queued)

Once a device is listed as "queued", it will never be dequeued unless you make it go offline. It gets stuck in this state.

SIP network traces show that once the device goes into queued state, Asterisk will no longer send any NOTIFY messages. It only sends them when it is in non-queued state.

A brief look at the code, shows that the flag which gets reset to allow these notifications to get through only gets reset in a single branch of code:

chan_sip.c@22939 in handle_response_notify
ast_clear_flag(&p->flags[1], SIP_PAGE2_STATECHANGEQUEUE);

This will only occur when a device sends back a 200 OK request to a previous NOTIFY message. If this response never comes back (ie. packet loss etc.), then the flag gets stuck in this state forever.

I would propose that a fix to this would be to reset the flag:

ast_clear_flag(&p->flags[1], SIP_PAGE2_STATECHANGEQUEUE);

during either a SUBSCRIBE request or REGISTER request. That way, if the flag is stuck, it will get reset in a short amount of time when the device performs its next registration routine. That is, unless I am completely wrong or there is a better way of doing things.

Your help is greatly appreciated.

Thanks,

Comments: By: Michael L. Young (elguero) 2013-04-24 13:03:43.937-0500

I wonder if this is related to what is on Reviewboard: https://reviewboard.asterisk.org/r/1813/
By: Alec Davis (alecdavis) 2013-04-24 16:47:35.367-0500

I had wondered why BLF's would stop randomly, then rebooting the device they would stop again within minutes!
Review1813 deals with buggy Grandstream firmware that accepts a unsolicited (because the device was rebooted) BLF update. The version number of the BLF subscription gets set to the old subscription.

This lost notify issue, is more interesting.

{code}
referring to RFC4235
3.10. Rate of Notifications

For reasons of congestion control, it is important that the rate of
notifications not be excessive. It is RECOMMENDED that the server
not generate notifications for a single subscriber faster than once
every 1 second.
{code}

The 'for a single subscriber', is that rate control for a peer that may have many BLF subscriptions, or for each subscription?

Implementing a timer on each subscription then would allow retries and rate control.

Your suggestion to remove the flag, similar code was removed in Asterisk 1.2, which would remove old subscriptions related to a peer.

Not that this fixes your problem right now.

By: Dan Martens (dlmarten) 2013-04-24 17:09:11.432-0500

Hi Alec,

I was wondering why this flag existed in the first place, now I know.

The way I read the RFC you posted would be for each subscription. If it were to be per device, updates to phones with many subscriptions (ie. reception consoles) could take a very long time and would not be acceptable to the user.

I like your idea of a timer on the notify messages. It makes more sense to implement this type of mechanism instead of my suggestion (which is basically just a quick hack to get things working). In my opinion, the flag method for implementing this RFC as it currently stands is a poor way of doing things if it is just meant as a rate limiting feature.

I'm new to posting for asterisk, so not sure where this process goes from here. Is it up to me to submit a patch or is there someone who looks after chan_sip that has to verify the bug and approve all proposed changes?

Thanks.
By: Rusty Newton (rnewton) 2013-04-24 17:29:01.901-0500

When filing an issue the guidelines were listed: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines (we need to make it a required field)

You can always submit a patch. Post it this issue following the guidelines found in the article above. Submitting your own patch is typically going to help things progress quicker, but of course it could always end up that the patch is not appropriate.

It would be good to also help others verify this bug. Please post an Asterisk log with VERBOSE and DEBUG messages at level 5 or above, with SIP debug enabled, plus a PCAP of the SIP traffic happening at the same time. That should be a good example of the issue happening. Then it's probably a good idea to go ahead and provide all relevant .conf files that would allow someone to reproduce the issue. Please attach all debug and configuration to the issue as separate .txt files for ease of use. Hit "Send Back" to change the state of the ticket back.
By: Alec Davis (alecdavis) 2013-04-24 21:44:01.716-0500

uploaded bug21677.diff.txt

Different tact.
Prevent the lockup, by tracking when the last notify went out.

Now, when the next notify event happens if there is still an outstanding notify for longer than 2 seconds, then assume it's lost.
By: Alec Davis (alecdavis) 2013-04-24 21:53:29.258-0500

Testing.

Device A has a hint to watch Device B
Device C rings Device B
On Device A, the BLF for Device B now show RINGING.
pull the Ethernet from device A
Hang up Device C
Wait for "2013-04-25 14:35:00.759512] WARNING[3845]: chan_sip.c:4215 retrans_pkt: Retransmission timeout reached on transmission 695f11d65c5c13c8@192.168.xxx.yyy for seqno 105...
reinsert Ethernet in Device A
On Device A, the BLF for Device B still shows RINGING.

Device C rings Device B
On Device A, the BLF for Device B still shows RINGING (it actually hasn't changed)
Hang up Device C
On Device A, the BLF for Device B still shows RINGING (this is where any further updates to that BLF stop).

After patch, Device A will update it's BLF on the next Notify event.

By: Alec Davis (alecdavis) 2013-04-25 05:44:56.741-0500

Correct way to fix this issue.

{code}
RFC6665 4.2.2. Sending State Information to Subscribers
"If the NOTIFY request fails due to expiration of SIP Timer F (transaction timeout), the notifier SHOULD remove the subscription."

Upon client restart or reestablishment of a network connection, it is
expected that clients will send SUBSCRIBE requests to refresh
potentially stale state information; such requests will reinstall
subscriptions in all relevant nodes.
{code}

patch reveiw2475.diff.txt as on reviewboard https://reviewboard.asterisk.org/r/2475/
By: Rusty Newton (rnewton) 2013-04-25 09:38:01.648-0500

Dan, please test Alec's patch when you get a chance and report back.
By: Alec Davis (alecdavis) 2013-04-26 15:29:01.657-0500

updated patch review2475.diff2.txt as from reviewboard.

This fix, although correct by the RFC6665, is going to break a lot of installs, as sip.conf:maxexpiry default is 3600.

For 1.8 and 11 we need to fix the logic error, in retrans_pkt() I'd propose to clear the pvt->pendinginvite and clear the SIP_PAGE_STATECHANGEQUEUE flag after the retries fail. This will allow further NOTIFYs to happen. But not remove the subscription.

For trunk, remove the subscription, and let the client re-subscribe.

edit: ========================

What am I thinking, regarding breaking a lot of installs?
1.8 and 11 already are broken after a Notify fails.
The only fix today is rebooting the client, which will get a new subscription, the old subscription will expire in it's own time.

So the proposed fix review2475.diff2.txt for 1.8 to trunk is to remove the subscription when a State NOTIFY fails, and let the client re-subscribe after it's own expiry timer fires.

By: David Brillert (aragon) 2013-04-26 18:53:17.587-0500

I work with Dan and we were preparing to test the patch.
But I am really confused...
The reviewboard is showing diff3 and the ticket is showing diff2
Which patch should we be testing?
Or should we just hold off until the reviewboard patch is reviewed and shipped?
By: Alec Davis (alecdavis) 2013-04-26 20:22:21.225-0500

Testing helps get a reviewboard 'ship it'.

I think the reason for reviewboard showing 'diff r3' is that it's the 3rd update I've uploaded to reviewboard.

on this ticket, I've got the same patch review2475.diff2.txt.

So either review2475.diff2.txt from this ticket will work, or downloading https://reviewboard.asterisk.org/r/2475/diff/3/raw/ will achieve the same result.
By: David Brillert (aragon) 2013-04-29 16:05:38.971-0500

Using the same steps to reproduce:

Device A has a hint to watch Device B

Device C rings Device B
On Device A, the BLF for Device B now show RINGING.
pull the Ethernet from device A
Hang up Device C
Wait for "2013-04-25 14:35:00.759512] WARNING[3845]: chan_sip.c:4215 retrans_pkt: Retransmission timeout reached on transmission 695f11d65c5c13c8@192.168.xxx.yyy for seqno 105...
reinsert Ethernet in Device A
On Device A, the BLF for Device B still shows RINGING.

I finished testing the BLF issue with the patch review2475.diff2.txt and the subscriptions work again after the phone resubscribes.
Just like the patch was designed to do...

Dan is testing on a production server soon.
By: Alec Davis (alecdavis) 2013-04-29 17:10:27.797-0500

On 3 of our production servers. 2x Asterisk 11, 1 asterisk 1.8

The 11 systems are update BLF on re-subscribe.

The 1.8 system, as expected doesn't refresh the BLF when the phone re-subscribes, but does on the next Notify event, but I expected that.

By: Dan Martens (dlmarten) 2013-05-01 17:33:50.710-0500

Been running patch on 1 test and 2 production servers for 2 days. No crashes and BLF lights are still all working properly.
By: Alec Davis (alecdavis) 2013-05-05 02:34:49.486-0500

I think we agree, this is the correct way to fix a network issue, where the response doesn't come back, or request didn't make it to device thus a response was never going to come back.

Dan/David you say 'since upgrading to 11':

This network pacing control (interlock) where the device is required to respond has been there since asterisk 1.4.
On a 1.8 system without the patch, simulating the fault (with the same steps to reproduce as noted earlier) should also prevent further BLF updates, waiting until the next re-subscribe **WONT** fix the BLF updates. Only by restarting the phone will a new subscription be obtained, and BLF will be correct.

On a 1.8 system with the patch, after simulated fault, the phone will re-subscribe (asterisk will create new subscription), and the BLF will be correct.

There is a change in 11, where the device state notify is now sent out immediately after each re-subscribe.
Refer https://reviewboard.asterisk.org/r/2048 comment in code {code}"/* RFC 3265: A notification must be sent on every subscribe, so force it */"{code}

Prior to 11, the device state notify would only be sent on the initial subscribe, re-subscribes didn't cause a state notify.
{code}transmit_state_notify(p, &data, 1, FALSE); /* Send first notification */{code}

I have a few questions:
1). What models of phones are these?
2). On a non patched system are you able to provide a sip debug trace of one failing (but not by disconnecting the network).

By: David Brillert (aragon) 2013-05-06 07:48:28.505-0500

1) I tested with Yealink phone using steps to reproduce.
2) Dan reported the issue so he'll have to answer that one. But as Dan has told me all is good on his production servers with the patch and he has seen no bad side effects.
By: Dan Martens (dlmarten) 2013-05-06 08:47:37.938-0500

Hi Alec,

We never noticed this issue prior to 11. If it existed, it was never brought to our attention by any of our users.

We have a wide variety of phone makes and models in our user base (Polycom, Yealink, Panasonic, Astra, Softphones etc.). From what I can tell, it was not vendor specific.

If I look to a non-patched server, I could probably find a phone that is currently in blocked state (ie. no notifications sent out), however a SIP trace will not help as there are no SIP messages being sent out at all (outside from call INVITE's). We had already confirmed this prior to posting this bug report. So, I'm not sure what this trace would show you.

What I can tell you though is that we have not had a single issue with any of the 3 production servers we have patched. All seem to be working well.

By: Alec Davis (alecdavis) 2013-05-06 15:35:56.726-0500

Dan.
I knew that obtaining a SIP trace of a failing phone was an awkward question, I should have clarified it by saying a trace started before it went wrong to when it was noticed it was wrong. This may be impossible, as the probability of failure is so small.

Did you change maxexpiry in sip.conf, as the default is 3600 seconds, a BLF could be wrong for this long. We set ours 300.

Thanks for the testing.

By: Dan Martens (dlmarten) 2013-05-06 15:39:36.031-0500

Hi Alec,

Yes, we always change maxexpiry to 360 or less. Typically it is set to 60 seconds for most of our clients to play nicely with NAT.
By: Steven Sedory (stevensedory) 2016-07-08 11:42:43.696-0500

Hi all,

We are on Asterisk 13.9.1 using Chan_SIP and are experiencing this same issue. Once the BLF keys stop working correctly, you lots of queue messages like:

Extension Changed 521[ext-local] new state InUse for Notify User 523 (queued)
Extension Changed 521[ext-local] new state Idle for Notify User 523 (queued)
Extension Changed 521[ext-local] new state Ringing for Notify User 523 (queued)

Obviously this issue was last discussed over 3 years ago, but how would I go about fixing it now? I have configured maxexpiry=300 in sip.conf.
By: Steven Sedory (stevensedory) 2016-07-17 00:50:38.504-0500

Any help out there anyone?