[Home]

Summary:ASTERISK-26007: res_pjsip: Endpoints deleting early after upgrade from 13.8.2 to 13.9
Reporter:Greg Siemon (gsiemon)Labels:
Date Opened:2016-05-10 07:19:19Date Closed:2016-05-12 17:25:31
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.9.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:FreePBX Distro 10.13.66-11 - Fully patched running Edge Track System runs as a VM on ESXi 6.0u2 - 1 Processor 4 GB RAM Endpoints (connecting using PJSIP) consist of 2 x Polycom 335, 1 x Polycom 450 and 1 x Polycom 670, 1 x Gigaset N510 IP Pro & 1 x Mobotix Doorstation T25. All of the endpoints are connected to the same subnet as the FreePBX box (except the Mobotix).Attachments:( 0) asterisk_full
( 1) git-bisect-bad-13.8.2-13.9.txt
( 2) PJSIP_Config_Files.zip
( 3) pjsip-expiry-error.txt
( 4) pjsip_set_logger_on
Description:System works without issues on Asterisk 13.8.2.

After updating to 13.9 I find all of the endpoints start dropping out of Asterisk (logs and PJSIP config files attached).  

Moving back to 13.8.2 solves the issue.
Comments:By: Asterisk Team (asteriskteam) 2016-05-10 07:19:19.746-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: xrobau (xrobau) 2016-05-10 14:50:20.277-0500

I have also other unconfirmed reports of this happening with 13.9.0 that is resolved when dropping back to 13.8.2. I'm trying to duplicate it here to see if I can get some better information.

By: xrobau (xrobau) 2016-05-10 20:07:58.026-0500

After watching some SIP debugging, it appears that pjsip is expiring 32 seconds too early.    I've attached the relevant parts of the sip trace.

By: xrobau (xrobau) 2016-05-10 20:38:33.078-0500

That 32 seconds is probably a red herring - This is easily able to be replicated..

Set an endpoint to reregister every 60 seconds, and then set contact_expiration_check_interval=5 - you'll see that the contact is removed within 5 seconds, every time.

{code}
Contact 402/ has been deleted
Endpoint 402 is now Unreachable
   -- Added contact 'sip:402@192.168.15.125:5060' to AOR '402' with expiration of 60 seconds
Contact 402/sip:402@192.168.15.125:5060 has been created
Endpoint 402 is now Reachable
Contact 402/sip:402@192.168.15.125:5060 is now Reachable.  RTT: 9.414 msec
Contact 402/ has been deleted
Endpoint 402 is now Unreachable
   -- Added contact 'sip:402@192.168.15.125:5060' to AOR '402' with expiration of 60 seconds
Contact 402/sip:402@192.168.15.125:5060 has been created
Endpoint 402 is now Reachable
Contact 402/sip:402@192.168.15.125:5060 is now Reachable.  RTT: 12.453 msec
{code}

Suggest this is a regression brought on by ASTERISK-25885


By: Mark Michelson (mmichelson) 2016-05-11 14:39:15.666-0500

So far, I'm not able to reproduce the behavior. I set up a SIPp scenario to register every 60 seconds, and placed the contact_expiration_check_interval=5 in my global settings, but it didn't cause the problem. I see the contact time out at 60 seconds as I'd expect.

Right now I'm testing against the 13 branch directly, but I can also try testing against the 13.9.0 tag instead to make sure that something didn't already fix it. I've been looking at the provided pjsip.conf files and trying to cherry-pick the necessary settings to make this happen for me. I may have to move to wholesale copying of the configs though if necessary.

By: xrobau (xrobau) 2016-05-11 15:06:13.495-0500

I've noticed that some phones do NOT display this problem and I've been unable to determine a difference.

Currently known NOT working:
* Sangoma S300
* Aastra 55i
* Grandstream GXP2200

Currently known WORKING:
* Digium D70
* Yealink T41P
* Cisco SPA504

By: Greg Siemon (gsiemon) 2016-05-11 16:51:15.968-0500

Include the following in known NOT working:

* Polycom Soundpoint 335
* Polycom Soundpoint 450
* Polycom Soundpoint 670
* Gigaset N510 IP Pro

By: xrobau (xrobau) 2016-05-11 20:49:57.436-0500

Attached the git bisect, showing where the issue in 13.9 was introduced.  This is fixed in the current 13 branch, with the a01ce2b88912cd802bb045e40fe264906e55bc45 commit that was merged a few hours ago.

There also appears to be some debugging left in the 13 branch, as every time a device re-registers there are two lines output:
{code}
Contact 101/sip:101@139.130.56.101:9110 is now Updated.  RTT: 0.000 msec
Contact 101/sip:101@139.130.56.101:9110 is now Reachable.  RTT: 52.902 msec
Contact 101/sip:101@139.130.56.101:9110 is now Updated.  RTT: 0.000 msec
Contact 101/sip:101@139.130.56.101:9110 is now Reachable.  RTT: 52.661 msec
{code}

On a large system, this could easily flood the logs!


By: Mark Michelson (mmichelson) 2016-05-12 09:35:38.829-0500

Thanks for the bisect, Rob! I'm going to look at that commit and try to figure out what part of it fixed the issue.

By: Mark Michelson (mmichelson) 2016-05-12 15:30:42.098-0500

All right, I have this figured out now. There are a couple of reviews up on gerrit now for this:

https://gerrit.asterisk.org/#/c/2819/

This fixes an issue where we were attempting to retrieve all contacts whose expiration was prior to now. The problem is that the conversion of times from string to float was losing the least significant bits, resulting in wrong values being used. This resulted in contacts matching this criterion well before they were set to be retrieved. However, this is not the only thing that was wrong. Even if a contact were retrieved early, there was a second comparison of the current time to the time on the contact.

https://gerrit.asterisk.org/#/c/2821/

The problem here is that objects retrieved from the astdb were all messed up. We would retrieve a contact and set its expiration to 0, rather than the actual timestamp when it was set to expire. So, because of the first issue, we'd retrieve the contact earlier than we were supposed to. Then we'd perform the second time comparison and see that we're WAY beyond when the contact was supposed to expire, so we'd delete it. This second issue is actually fixed in the 13 branch in the commit you pointed out in your previous comment. The review linked here is for the 13.9 branch and contains only the bug fix without the rest of the content of the commit in the 13 branch.

Once these have been merged, we will create a 13.9.1 regression release.