[Home]

Summary:ASTERISK-10103: Registration/Un-Registration Problem with chan_skinny
Reporter:sbisker (sbisker)Labels:
Date Opened:2007-08-17 06:54:16Date Closed:2011-06-07 14:07:46
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_skinny
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) chan_skinny-reregister.txt
Description:Having a problem with channels becoming partially unregistered.  This is happening with 7920 and 7921 phones.  At some point during the day, asterisk thinks that the  device becomes unregistered.  If you do a "skinny show devices"  the device appears that it is registered, and the phone screen indicates that it is registered.  However when making a call, the following is in the log.


[Aug 16 09:46:14] VERBOSE[13870] logger.c:     -- Starting simple switch on '5264@5264'
[Aug 16 09:46:15] DEBUG[12158] chan_skinny.c: Current sub [Skinny/5264@5264-148] already has owner
[Aug 16 09:46:15] WARNING[13870] chan_skinny.c: Asked to transmit to a non-existant session!
[Aug 16 09:46:15] NOTICE[13870] chan_skinny.c: Asked to indicate 'Stop tone' condition on channel Skinny/5264@5264-148, but sessio
n does not exist.
[Aug 16 09:46:15] WARNING[13870] chan_skinny.c: Asked to transmit to a non-existant session!
[Aug 16 09:46:15] NOTICE[13870] chan_skinny.c: Asked to indicate 'Stop tone' condition on channel Skinny/5264@5264-148, but sessio
n does not exist.
[Aug 16 09:46:15] WARNING[13870] chan_skinny.c: Asked to transmit to a non-existant session!
[Aug 16 09:46:15] NOTICE[13870] chan_skinny.c: Asked to indicate 'Stop tone' condition on channel Skinny/5264@5264-148, but sessio
n does not exist.
[Aug 16 09:46:16] WARNING[13870] chan_skinny.c: Asked to transmit to a non-existant session!
[Aug 16 09:46:16] VERBOSE[13870] logger.c:     -- Executing [7848@longdistance:1] Dial("Skinny/5264@5264-148", "IAX2/isdpbx:password@isdpbx/7848||r") in new stack



If I do a "core show channels" no skinny sessions are shown, and no channels are active for that device.  On a side note  can't do a "skinny set debug" because asterisk crashes because the session doesn't exist, and the debug code references session and not device.  I'll submit another bug for this one too, if you want.
Comments:By: Damien Wedhorn (wedhorn) 2007-08-17 17:57:44

sbisker, that last issue (debug causing crash) is corrected in the patch in bug 10396. You may like to apply that patch (trasmit.diff) so you can turn on debugging.

By: sbisker (sbisker) 2007-08-23 07:39:15

transmit_a.diff doesn't apply cleanly 1.4.10/11.  Any chance of a 1.4 patch for this, or some steps to manually fix the debug problem?

By: sbisker (sbisker) 2007-08-27 15:12:04

I'm going to try to debug this one better tomorrow.  One more thing to note.  I had a phone do this today.  It thinks it's registered, but asterisk doesn't.  The phone is reachable via http, and continues to send keepalives to the asterisk server, which in turn continues to responds with

SKINNY KeepAliveAckMessage
SKINNY DefineTimeDate

Some how the session just session/device just get's dropped without any warning.

By: sbisker (sbisker) 2007-09-13 10:05:21

I've been watching this for quite some time now, and the phones seem to be losing registration quite often, even with a good signal.

I was getting a lot of "Skinny Client was lost, unregistering" followed immediately by a re-registration.  I tracked down all occurences of this, and flagged them 001 - 004.  All of the registrations were being lost for a 002 90% of the time and a 003 the other 10%.

It seems that most of the unregisters are a result of

res = poll(fds, 1, (keep_alive * 1100));

Where res == 0

Why would res = 0   so often.  The phone is still sending keepalives every 30 seconds, so it's beyond me.



Here's a snip of the code with my modified log messages for 002-004.

       fds[0].fd = s->fd;
       fds[0].events = POLLIN;
       fds[0].revents = 0;
       res = poll(fds, 1, (keep_alive * 1100)); /* If nothing has happen, client is dead */
                                                /* we add 10% to the keep_alive to deal */
                                                /* with network delays, etc */
       if (res < 0) {
               if (errno != EINTR) {
                       ast_log(LOG_WARNING, "Select returned error: %s\n", strerror(errno));
                       return res;
               }
       } else if (res == 0) {
               if (skinnydebug)
                       ast_verbose("Skinny Client was lost, unregistering\n");
               ast_log(LOG_WARNING, "Skinny Client was lost, unregistering: 002\n");
               skinny_unregister(NULL, s);
               return -1;
       }

       if (fds[0].revents) {
               ast_mutex_lock(&s->lock);
               memset(s->inbuf,0,sizeof(s->inbuf));
               res = read(s->fd, s->inbuf, 4);
               if (res < 0) {
                       ast_log(LOG_WARNING, "read() returned error: %s\n", strerror(errno));

                       if (skinnydebug)
                               ast_verbose("Skinny Client was lost, unregistering\n");
                       ast_log(LOG_WARNING, "Skinny Client was lost, unregistering: 003\n");
                       skinny_unregister(NULL,s);
                       ast_mutex_unlock(&s->lock);
                       return res;

By: dea (dea) 2007-09-13 12:58:49

Can you check the firmware versions on the problem phones?

When I reworked the lost registration code earlier this year
another reporter had a similar problem.  We tracked it down
to firmware 3.02 on his phone.  My phone with 3.01 was solid,
and once he downgraded, so was his.

Also what do you have in skinny.conf for your keepAlive value?
Increasing it may help.

By: sbisker (sbisker) 2007-11-15 12:45:25.000-0600

I've downgraded the firmware to 3.01 and the 7920's are perfect.  The 7921's still present a bit of a problem  

At times a 7921 will get unregistered for no reason, but still think it's registered because asterisk still responds to keepalive messages.  Is there any interface in the keepalive code to not send a keepalive message if the phone isn't registered?  Or even better, is there a way in chan_skinny to force a re-registration, if a keepalive is received but the phone isn't registered?

By: dea (dea) 2007-11-15 13:23:49.000-0600

I'm not sure I understand.  The phone is responsible
for sending the keepalive, and chan_skinny only
'expects' to get those messages from registered
phones.  A message from an unregistered phone
should trigger a registration.

If that is not happening, then there's a bug that
needs tracking, but my small development/test environment
does seem to work in the manner expected.

By: sbisker (sbisker) 2007-11-15 13:29:22.000-0600

That's what I thought as well.  At some point asterisk un-registers the phone, i.e., no IP address shows up, and the R column is N in "skinny show devices".  The phone continues to send keep alives to the server, and the server continues to respond with the appropriate SKINNY KeepAliveAckMessage.  Because of this, the phone still thinks it's registered.

By: dea (dea) 2007-11-15 13:55:26.000-0600

Thanks for the details.  That will make it easy to track down, and
I'll try to make some time for it today.

By: dea (dea) 2007-11-15 14:11:34.000-0600

VERY limited testing performed on the attached patch.

7940 allowed to register, pulled it's network connection
and waited to see Asterisk drop the registration.  The
phone still appeared to think it was registered.  Reconnected
the network, and Asterisk processed the keepalive and
triggered a reregistration.

The change is pretty trivial and localized to one function,
but bad things might be hiding, and it may not be as trivial
as it appears to me.  Let me know if it works, and if not I'll
dig deeper.

By: Damien Wedhorn (wedhorn) 2007-11-15 14:42:59.000-0600

Dan, long time no post. Only had a really quick look, but it looks like when we don't receive a keep alive, we unregister chan_skinny but don't send an unregister/reregister message to the device.

Arguably your patch will do this when the device next sends a message, but we should probably send the message when we decide to unregister the device as well.

By: dea (dea) 2007-11-15 16:46:45.000-0600

I thought about that, and perhaps it should be added, but this particular
case might be triggered with a momentary network disruption.

With the right timing, Asterisk does not hear from the phone, tries to
send the unregister and drops the device, but the phone never gets the
message and continues to think it is registered and plugs along sending
the keepalive packets and never tries to re-register....

By: Damien Wedhorn (wedhorn) 2007-11-15 17:30:07.000-0600

Yeah, I was only thinking of having it as additional code. But on further thinking, what happens at the transport layer. If there is a temporary network disruption and we send a reset (because of not receiving a keepalive), what happens to the reset tcp packet that we send? Does it just get dropped with no feedback, does it keep on trying until it suceeds?

By: Jason Parker (jparker) 2007-11-15 17:37:57.000-0600

Keep in mind that tcp is a guaranteed delivery protocol.  The connection is up all the time - if it is dropped, you'll hear about it.

By: dea (dea) 2007-11-15 17:51:05.000-0600

Well, yes we will hear about it and the phone will become
unregistered in Asterisk.  But for some reason the phone
itself still thinks that it is registered.

We could send the unregister, and then the phone SHOULD do
the right thing and re-register.  On the other hand if we
are getting keepAlive packets from a phone that we do not
see is registered, why not just register it?  The phone
clearly thinks it is registered to us, and we have a valid
session (otherwise we would not be getting the keepAlive
messages), so the least amount of work would seem to be
to re-register.

If the opinions are strong in favor of forcing an
unregister/register cycle, I'll rework the patch to do it
that way.

By: Damien Wedhorn (wedhorn) 2007-11-15 18:17:38.000-0600

If we wait for a keepalive, we could end up with the device and asterisk being in disagreement about the connection status for _up_ to network_failure_time + keepalivetime. If we can send a reset when the network is corrected, the disagreement will only be for the duration of the network failure.

Also, what happens if the device does another action (such as offhook) during the disagreement?

By: dea (dea) 2007-11-15 18:41:05.000-0600

And if the phone was unplugged, which results in the loss of registration?
Do we end up with Asterisk/chan_skinny spinning its wheels trying to issue
an unregister message to a phone that no longer exists?

Sure we can code around that issue, but additional care is required.
We can also recommend tuning the keepAlive value to a smaller number
of seconds (most phones seem to honor it after the message structure
cleanup).  We could further rework the keepalive logic to request a keepalive
message 1/2 the configured value and track the number of missed messages.
After two missed messages we then consider the phone gone.

The worst problem likely during the disruption is the phone is not
responsive.  The window is actually keepAlive time - time since last
keepAlive message.  Tuning the keepAlive value to a smaller number
of seconds will increase network traffic, albeit a trivial amount, but
does reduce recovery time.

By: Damien Wedhorn (wedhorn) 2007-11-15 19:03:58.000-0600

I'm doing a lot of speculating about how the tcp stuff will be handled. It's trivial to cause this to occur, so I think we should play around with sending the reset packet and see what actually happens when the network has failed. I'll have a go at this tomorrow.

By: sbisker (sbisker) 2007-11-16 13:12:38.000-0600

Shouldn't it be

} else {
        skinny_register(req, s);
        return 1;
}

Compiler throws a warning that nothing is ever returned at this point.

By: dea (dea) 2007-11-16 14:04:39.000-0600

Odd that my compiler does not complain, but yes we need a return
there.

I'd actually suggest:

} else {
        if (skinny_register(req, s))
            return 1;
        else
            return 0;
}

The first patch was more proof of concept, and if the concept is acceptable
the patch can be cleaned up.

By: Damien Wedhorn (wedhorn) 2007-11-16 20:00:46.000-0600

I've had a bit of a play and can't force this issue. With 30VIPs and communicator, if I break the network connection, once a keepalive is missed, when I reconnect the devices reset of their own accord and issue a CM-aborted-TCP information message to asterisk.

On this basis, there doesn't seem to be any reason to issue a reset. However, other devices may behave differently.

I think this problem may be more insiduous then just a network issue. From memory, I vaguely recall a similar issue a while ago and the asterisk box was actually receiving the keepalive (tcpdump) but it was not being passed on to chan_skinny (not sure if asterisk was receiving it). The patch suggested will only mask the issue and set the session state to down when it should actually still be connected (for a short time anyway).

sbisker, could you try this. Set skinny debug and physically disconnect the phone. When you receive a message "NOTICE[31739]: chan_skinny.c:5284 skinny_session: Skinny Session returned: Success", reconnect the phone. Do you then get a message like "Received Alarm Message: Name=SEPxx Load=3.1(0.3)     Parms=Status/IPaddr Last=CM-aborted-TCP"?

If you do, the TCP connection is being maintained when asterisk is unregistering your phone and the keepalive is probably being lost somewhere between the kernel and chan_skinny.

By: sbisker (sbisker) 2007-11-17 06:38:57.000-0600

This type of test isn't possible with the 7921 phone.  It's wireless and there is nothing to "disconnect".  The only way to simulate this would be to pop the battery out, in turn powering off the device.  When you power the device back on, it re-registers, because it knows it couldn't have been registered.

This problem seems to be isolated to the 7920 & 7921 phones.  The problem is reduced by adding more access points, but sometimes even with full strength, asterisk unregisters the phone, with no reason being given.  

I'm hesitant to run skinny debug because the debug code core dumps if a unregistered phone attempts to make a call.  The patch to fix this is in trunk, not 1.4.x and trunk has proven to be unstable for my production asterisk boxes.

By: sbisker (sbisker) 2007-12-05 14:33:46.000-0600

The patch hasn't really helped much, however, I did notice that I'm getting a lot of these in the log

[Dec  5 15:32:05] WARNING[28754]: chan_skinny.c:2796 skinny_indicate:  Don't know how to indicate condition 18
[Dec  5 15:32:07] WARNING[28533]: chan_skinny.c:4387 get_input:  read() returned error: Connection reset by peer
[Dec  5 15:32:07] WARNING[28533]: chan_skinny.c:4391 get_input:  Skinny Client was lost, unregistering:
[Dec  5 15:32:07] NOTICE[28533]: chan_skinny.c:4485 skinny_session:  Skinny Session returned: Connection reset by peer

By: sbisker (sbisker) 2008-03-05 07:48:34.000-0600

You can close this one.  It seems that most of the issues are related to Wireless.  Changing firmware on the phones has helped a lot.  I've also increased my WAP density and the problem has all but gone away.  Occasionally, I run into issues, but nothing like it was before.

By: Michiel van Baak (mvanbaak) 2008-04-23 12:45:20

Closed on reporters request