[Home]

Summary:ASTERISK-25476: chan_sip loses registrations after a while
Reporter:Michael Keuter (mkeuter)Labels:
Date Opened:2015-10-19 07:30:17Date Closed:2015-12-03 16:01:01.000-0600
Priority:BlockerRegression?Yes
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:11.20.0 13.6.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:AstLinux 1.2.4, Linux 3.2.66, x86 32bitAttachments:( 0) 11.20.0-SIP-options.txt
( 1) asterisk-11-issue-ASTERISK-25476.patch
( 2) asterisk-13-issue-ASTERISK-25476.patch
( 3) ASTERISK-25476-11.20.0-combined.patch
( 4) ASTERISK-25476-13.diff
Description:In Asterisk 11.20.0 chan_sip looses registrations of some IP-phones (on several different systems) after a while. In Asterisk the phone then shows "unreachable", while the phone's status shows "Registered". Reverting back to Asterisk 11.19.0 solves the problem. The phones are mostly Yealink phones, different models with different firmwares on different systems, none of the phones firmware was recently upgraded. Dis-/enbling the phone's SIP account or rebooting the phone makes it work again for a while, but then  Asterisk shows the phone again as unreachable.

I attach a SIP trace of a Yealink W52P DECT base with 2 handsets (331/332). "qualify=yes" is enabled, after a while Asterisk stops sending Option packets to ext. 331 while keeps sending packets to 332.
Comments:By: Asterisk Team (asteriskteam) 2015-10-19 07:30:18.886-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: abelbeck (abelbeck) 2015-10-19 10:33:59.001-0500

Possibly this may have something to do with the main/sched.c change:

scheduler: Use queue for allocating sched IDs.
http://git.asterisk.org/gitweb/?p=asterisk/asterisk.git;a=commit;h=5c713fdf18f

But the clue by the Reporter that a qualify simply stopped to one account and not the other on the same phone seems odd, but a potentially useful clue.

By: ibercom (ibercom) 2015-10-19 14:01:37.767-0500

[~mkeuter] mark this as regression.
It happens to me too with realtime.
At night everything is stable, peers ok, with traffic during the day I lose 10% of peers.


By: Matt Jordan (mjordan) 2015-10-19 19:06:13.828-0500

I've attached a patch for 13 that uncovered two other locations we missed when the scheduler got updated. It *may* fix the issue, but I'm not entirely sure as it appears it mostly affected the CLI command 'sip unregister'.

By: abelbeck (abelbeck) 2015-10-20 10:17:08.933-0500

Hi Matt,

The file "asterisk-11.20.0/res/res_rtp_asterisk.c" contains two lines containing "rtp->rtcp->schedid > 0"
{noformat}
asterisk-11.20.0/res/res_rtp_asterisk.c:4638: if (rtp->rtcp->schedid > 0) {
asterisk-11.20.0/res/res_rtp_asterisk.c:4866: if (rtp->rtcp && rtp->rtcp->schedid > 0) {
{noformat}
Should those be "rtp->rtcp->schedid > -1" ?

By: Richard Mudgett (rmudgett) 2015-10-20 10:56:32.940-0500

Yes.  Yes they should be {{rtp->rtcp->schedid > -1}}

By: abelbeck (abelbeck) 2015-10-20 11:12:16.507-0500

Good, that could be the culprit for this issue.

Addendum to commit: http://git.asterisk.org/gitweb/?p=asterisk/asterisk.git;a=commit;h=33c945f471e92046d50cb35482f4d94e5ed86e86

And issue: ASTERISK-25449

By: abelbeck (abelbeck) 2015-10-20 13:25:53.283-0500

Attached patches for res/res_rtp_asterisk.c

Will test over the next few days.

By: Dirk Wendland (kesselklopfer79) 2015-10-27 04:29:18.485-0500

Please see the comment in ASTERISK-25449 about chan_sip in asterisk 11.20.0 and REGISTER expiry threads problems with scheduler ID 0:
https://issues.asterisk.org/jira/browse/ASTERISK-25449?focusedCommentId=228015&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-228015

By: abelbeck (abelbeck) 2015-10-27 07:52:24.120-0500

Hi Dirk,

Related to your scheduler ID 0 is not handled in:
sip_unregister(): {{peer->expire > 0}}
complete_sip_registered_peer(): {{peer->expire > 0}}

Those both are addressed here by Matt with his ASTERISK-25476-13.diff attachment, though that seems only related to the CLI command 'sip unregister'.

Though your set_peer_defaults() problem has not yet been addressed, and it seems {{if (peer->expire == 0) {}} should be {{if (peer->expire < 0) {}} in set_peer_defaults(), though others should verify that is the best fix.

Additionally, another problem is in res_rtp_asterisk.c which I placed patch attachments here for Asterisk 11/13, and Michael Keuter has been actively testing.

By: Michael Keuter (mkeuter) 2015-10-27 08:07:45.351-0500

I tested the Asterisk 11 patch from abelbeck with Asterisk 11.20.0 on 6 different boxes for a few days. The problem has not re-occured so far. But I will continue to monitor the behavior.

By: Michael Keuter (mkeuter) 2015-10-28 04:04:43.563-0500

Update: 5 of the 6 boxes looked fine but one still had these issues although the patch was installed. After a while I figured out, that this was the only box of the 6 that had res_stun_monitor enabled. I now disabled res_stun_monitor and restarted Asterisk and now the issue are gone (at least for 10 hours). Could there been a relation between res_stun_monitor and this issue here? There is one open issue related to res_stun_monitor: ASTERISK-21856

By: Gerald Schnabel (geraldsf) 2015-10-29 11:13:26.091-0500

bq. Those both are addressed here by Matt with his ASTERISK-25476-13.diff attachment, though that seems only related to the CLI command 'sip unregister'.
The same problem exists for the CLI command 'iax2 unregister':
{noformat}
[gerald@buildsys asterisk-11.20.0]$ grep -F -r -n "p->expire > 0" *
channels/chan_iax2.c:7071:              if (p->expire > 0) {
channels/chan_iax2.c:7104:                              ++which > state && p->expire > 0) {
{noformat}


By: Michael Keuter (mkeuter) 2015-11-05 04:27:17.963-0600

Update2: After over 14 days the patch from abelbeck seems to solve this issue, for the most part, but there still are occasional, but rare occurrences of this issue, especially for devices with more than 1 SIP account configured (e.g. ATAs, DECT base stations).  Matt and others have noted other fixes that are needed after the "scheduler: Use queue for allocating sched IDs." commit was made.

So, what is the next step?  Add the patch from abelbeck here, and others along the line of ASTERISK-25449 that Matt and other noted, or should we consider reverting "scheduler: Use queue for allocating sched IDs."?

This is a major issue, and not much discussion.

By: Steve Davies (one47) 2015-11-11 03:39:34.430-0600

Adding my 2p since I am surprised that we've not seen 11.20.1 and 13.6.1 releases yet!

In my assumed scenario:

- chan_sip is up and running, and several registrations are current.
- one of these registrations has a peer->expire sched_id value of 0
- SIP RELOAD is called, resulting in a call to set_peer_defaults(peer)

set_peer_defaults(peer) contains the following code:

       if (peer->expire == 0) {
               /* Don't reset expire or port time during reload
                  if we have an active registration
               */
               peer->expire = -1;
               peer->pokeexpire = -1;
               peer->keepalivesend = -1;
               set_socket_transport(&peer->socket, SIP_TRANSPORT_UDP);
       }

At which point we leak a scheduled expire_register call. The knock on effect of this is an hourly line de-registration that is only resolved by stopping the phone for an hour to let the leaked event trickle out. We may also leak 'pokeexpire' and 'keepalivesend' - Should they not be cleared using AST_SCHED_DEL_UNREF?

NOTE: There are 2 more peer->expire > 0 comparisons in chan_sip but they only affect a CLI command so I'll let them get mopped up as I've already seen them mentioned.



By: Steve Davies (one47) 2015-11-11 04:13:45.809-0600

A roll-up of all the scheduler issues that affect me personally. It compiles, so I am putting it here in case it helps others. I will endeavour to push this through Gerrit too.

By: Corey Farrell (coreyfarrell) 2015-11-19 15:30:06.511-0600

It looks like this ticket should have been closed by commit e741101, does anyone know why it's still open?  Does the commit resolve the registration issues?

By: Richard Mudgett (rmudgett) 2015-11-19 15:52:45.453-0600

I think the commit resolved the registration issues identified by this issue.  However, I think an audit of the scheduler users is needed to find any other misuses of the API.


By: Gerald Schnabel (geraldsf) 2015-11-20 03:45:30.979-0600

{quote}
Should they not be cleared using AST_SCHED_DEL_UNREF?
{quote}
At least your patch changes the old behaviour when calling _peer_sched_cleanup(peer)_ now. The {{if (peer->expire == 0)}} block was a special case that (mis?)used scheduler ID 0 (that didn't exist before the scheduler change) to initialise the sip_peer structure with default values when it was allocated and zero-filled by ao2_t_alloc().


By: Richard Mudgett (rmudgett) 2015-12-03 16:01:01.574-0600

Audit complete and patches committed to branches.

By: Carl Farrington (carl0s) 2016-01-18 18:06:47.565-0600

This problem plagued me in October/November. I gave up having multiple extensions on the same phone (Cisco SPA-508G).

I installed another system yesterday and the DECT Gigaset N300IP handset is becoming intermittently unreachable also. The Asterisk build on this box hasn't been updated since I prepared the box around the end of November.

Is an update to Asterisk 13.7.0 going to cure this then?

Here was my original post on the previous system, which seems to be working OK now, but I'm aware of this issue, and have had to remove additional extensions from the phones that wanted dual-extensions:
https://supportforums.cisco.com/discussion/12345136/spa504g-and-pbx-asterisk-different-registration-state

and then today what looks like the same issue at the new install with the Gigaset handset:
http://pbxinaflash.com/community/threads/extensions-going-offline-at-two-sites.18467/

As Michael Keuter says above, this is/was a major issue, yet hardly anybody seemed to encounter it? Even though I did on two installs in a row?

Just wondering if it's fixed or if I should look at Asterisk 11?


By: Joshua C. Colp (jcolp) 2016-01-18 18:12:05.088-0600

This issue has been closed and marked as fixed in 11.21.0 and 13.7.0, so yes. Those versions should be fixed when it comes to this specific issue.

By: Stephan Eisvogel (eisvogel) 2016-01-18 18:16:09.825-0600

I can confirm for 11-head with PAP2 and Snom 8xx devices that this has been fixed since around December 2015.

By: Carl Farrington (carl0s) 2016-01-18 18:20:35.136-0600

Thanks very much. I'll sleep better tonight, and get compiling tomorrow !