[Home]

Summary:ASTERISK-27192: res_pjsip: Loss of SIP registrations causing unavailable endpoints
Reporter:Richard Mudgett (rmudgett)Labels:pjsip
Date Opened:2017-08-09 17:37:43Date Closed:2017-10-11 06:34:40
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip_registrar
Versions:13.17.0 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-27147 Either asterisk or pjproject isn't re-using tcp connections (again)
Environment:Attachments:
Description:When {{rewrite_contact}} is enabled, the {{max_contacts}} count can block re-registrations because the source port from the endpoint can be random.  When the re-registration is blocked, the endpoint may give up re-registering and require manual intervention.  While working on ASTERISK-27147 this issue was identified and discussed on IRC.
{quote}
(11:41:44 AM) rmudgett: file: gtjoseph  So...  What is the benefit to having a set number of max_contacts to pjsip AORs?  In light of rewrite_contact this can get in the way if you re-register after a restart.
(11:42:55 AM) gtjoseph: not sure actually other than limit 1
(11:43:39 AM) rmudgett: Setting to one actually makes the issue with rewrite_contact more likely.
(11:43:43 AM) gtjoseph: is the issue having a limit of having more than 1?
(11:44:06 AM) file: the issue is that if it's not a refresh then the REGISTER would be rejected
(11:44:14 AM) file: until a contact expires
(11:44:38 AM) rmudgett: Yes.  And if the register is rejected, the endpoint may give up trying to register.
(11:45:51 AM) gtjoseph: well, we also have "remove_existing"
(11:45:53 AM) file: I think having remove_existing work in that case smarter would be the option, it would still limit to a max but allow registrations
(11:46:07 AM) rmudgett: There is also the chan_sip compatibility option of remove_existing in the mix.
(11:46:55 AM) rmudgett: file: How could remove_existing work smarter?
(11:47:06 AM) file: what does it do right now in that scenario?
(11:49:46 AM) gtjoseph: are we thinking that multiple contacts will be an issue or that having the set limit will be?
(11:50:06 AM) file: the set limit is the problem
(11:50:08 AM) rmudgett: The set limit is the problem.
(11:50:55 AM) file: I think remove_existing should remove the number of contacts required to allow the new REGISTER to succeed
(11:51:31 AM) gtjoseph: but which ones?  Oldest?
(11:51:43 AM) file: the ones that will expire naturally soonest
(11:51:51 AM) gtjoseph: yeah makes sense
(11:52:30 AM) rmudgett: Remove existing removes all contacts that were not just registered.  After it ignores any currently in the contacts container
(11:52:53 AM) rmudgett: remove_existing won't let you add more than max_contacts
(11:53:13 AM) file: I know, I'm saying I think it is reasonable for the behavior to change to the above
(11:53:21 AM) file: which solves the problem
(11:54:35 AM) rmudgett: I'm just wondering why have a set limit.  Either max_contacts=0 to disable registrations or max_contacts=infinity.
(11:55:42 AM) gtjoseph: not sure i can think of an exact use case immediately but we really can't change it now
(11:55:45 AM) rmudgett: Is there some benefit to setting max_contacts=3 for instance?
(11:56:31 AM) rmudgett: It could be documented that max_contacts is not a security feature and is not recommended to set it to a low value.
(11:56:31 AM) file: outright benefit? no - but it can limit exposure
(11:56:49 AM) file: specifying a recommendation is fine
(11:56:56 AM) file: but as gtjoseph said - it's not getting removed
(11:57:35 AM) gtjoseph: it can also be used by a service provider to only allow a certain number of servers to register to an endpoint.
(11:58:56 AM) rmudgett: Which said service provider is likely to need rewrite_contact enabled and cause the problem of hitting the limit because of the random remote port number being registered.
(11:59:18 AM) file: which is why I said I think it's reasonable for remove_existing to behave as I stated, which solves the problem
(11:59:31 AM) gtjoseph: exactly
(12:01:49 PM) rmudgett: That changes remove_existing behavior from chan_sip behavior as it is documented to enable.
(12:02:41 PM) file: in a configuration that mirrors chan_sip it would function the same
(12:02:55 PM) file: the only scenario where it functions differently is if multiple contacts are enabled, which is a chan_pjsip feature
(12:03:09 PM) file: and it would function in a way which is better for the user
(12:03:46 PM) file: and if you are enabling multiple contacts then the existing behavior is against that from the first place - as others would be removed defeating it in the first place
(12:06:58 PM) rmudgett: So.  The new behavior would be:
(12:06:58 PM) rmudgett: 1) Don't check max_contacts if remove_existing is enabled
(12:06:58 PM) rmudgett: 2) Process the register contacts.
(12:06:58 PM) rmudgett: 3) If remove_existing enabled then remove the soonest to expire contacts until there are at most max_contacts.
(12:06:58 PM) rmudgett: 4) send register response
(12:07:35 PM) file: if you are referring to new logic, yes, that sounds about right
(12:12:44 PM) rmudgett: With the new behavior, remove_existing could use a better name.
(12:42:23 PM) gtjoseph: rmudgett: careful with that logic...   depending on Expires,  times the pass that removes the soonest to expire might remove the one just registered.
(12:42:40 PM) gtjoseph: s/times//
(12:42:41 PM) infobot: gtjoseph meant: rmudgett: careful with that logic...   depending on Expires,   the pass that removes the soonest to expire might remove the one just registered.
(12:43:21 PM) gtjoseph: i think you have to remove until there's an open spot, then register.
(12:50:43 PM) rmudgett: We certainly cannot add more than the max.  We already keep track of the unmodified contacts so we can remove the soonest to expire unmodified contacts.
(12:52:43 PM) gtjoseph: ok, so long as it's "unmodified contacts"
{quote}

Proposed new algorythm in res_pjsip_registrar.c:register_aor_core() for {{remove_existing}} option behavior:
{code}
if remove_existing
 if MAX(added + updated - deleted, 0) > max_contacts
    respond forbidden
else
 if MAX(added - deleted, 0) + current_contacts_count > max_contacts
    respond forbidden

process the REGISTER contacts keeping track of which contacts were not affected by this REGISTER

if remove_existing
 while current_contacts_count > max_contacts
   ast_assert(unmodified contact count > 0)
   remove the soonest to expire unmodified contact

respond successful with current contacts
{code}

The fix for ASTERISK-27147 will take care of connection oriented transports like TCP and TLS.  However, the UDP transport will still have this issue.
Comments:By: Scott Pabin (tuxd00d) 2017-08-31 04:07:58.677-0500

I may have read through the IRC too quickly, but it looked liked it was proposed to simply remove the oldest *contact* to make room for the registering device.  This would obviously cause a loop of registrations when *max_contacts* = x and NumOfDevices > x, if the devices are set to reregister automatically.  Such a loop happens now when *remove_existing* = yes is used with *max_contacts* > 1, as it seems (without looking the code) that *remove_existing* removes +all+ contacts for that AOR when a new device attempts to register against that AOR.

The 'bump off the oldest pollable (responding) contact' when a new device is attempting to register in excess of *max_contacts* seems to be the less desirable option as the device should instead be denied with a *403 Forbidden*.  There are scenarios where bumping off the oldest contact (not in a call) is desirable, and that is where a configuration item named *remove_existing* would make sense by allowing the latest registration attempt to always register successfully.

Perhaps it would be advantageous to poll existing contacts +when+ a device attempts to register in excess of *max_contacts*, instead of waiting for them to time out. You can call it *auto_purge* and make it the default.  With this solution, Asterisk can respond within a new device's SIP registration dialog with either a *200 OK* (when there is space) or *403 Forbidden* (when it's full).  A request in excess of *max_contacts* should continue to throw a warning, and it should trigger an event if it does not already.  In this scenario, *rewrite_contact* would work just fine, if I'm understanding its mechanism.  The polling option also allows for the quickest device reregistration when ports have changed, such as is possible after a client-side router restart.  The timeout method results in tech-support calls, as it's not an _instantaneous_ recovery.

By: Richard Mudgett (rmudgett) 2017-08-31 19:00:13.108-0500

{quote}
I may have read through the IRC too quickly, but it looked liked it was proposed to simply remove the oldest *contact* to make room for the registering device.
{quote}

The change would *only* have an effect if *remove_existing* is enabled.  If it is not enabled then you cannot add any more contacts than *max_contacts* allows and the register is rejected with a *403 Forbidden*.  If it is enabled then you still cannot add more contacts than *max_contacts* allows.  However, the proposal is to remove the contacts that are next to expire and that were not updated by the register until there is room.  The proposal will not change the behavior of sending a *403 Forbidden* where a single registration attempt is trying to register more contacts than *max_contacts*.

I think you have a misunderstanding of AOR's.  An AOR is either statically configured for an endpoint or dynamically registered by the endpoint using the SIP REGISTER method.  An AOR is only used to *initiate* new requests (such as calls) to an endpoint.  AOR changes do not affect calls that are already in progress.

{quote}
This would obviously cause a loop of registrations when *max_contacts* = x and NumOfDevices > x
{quote}

This is an invalid situation as you are attempting to register more endpoint contacts than are configured to be allowed by *max_contacts*.  If you have four devices independently registering themselves for an endpoint when *max_contacts* = 3 then it is either a configuration error or the user is trying to abuse the system.

{quote}
Such a loop happens now when *remove_existing* = yes is used with *max_contacts* > 1, as it seems (without looking the code) that *remove_existing* removes +all+ contacts for that AOR when a new device attempts to register against that AOR.
{quote}

This is one of the reasons the issue is discussing a change in behavior for *remove_existing*.  Currently *remove_existing* does effectively remove all contacts and replace them with the new register's contacts.  It is a behavior that doesn't work very well when *max_contacts* is greater than one.  The *remove_existing* option was originally created only as a means to emulate chan_sip's behavior when *max_contacts* = 1.

The initial reason for the issue is the effect *rewrite_contact* has on registered AORs.  The *rewrite_contact* option is one of the behaviors that allow SIP to operate with a NAT/firewall in the picture.  When an endpoint device behind a NAT/firewall registers then it doesn't know the real contact address because the NAT/firewall effectively creates it when the register message goes through.  If the device re-registers then it may have a different contact address than before.  The re-registration would look like a new device and could be blocked by the *max_contacts* count.  In this case the fix for ASTERISK-27147 takes care of connection oriented transports like TCP and TLS.  UDP on the other hand needs some more help in this exception case.

Polling to remove stale AOR contacts is not that viable a solution.
* The poll would happen after the re-registration fails.
* The device may not try re-registering again because it was forbidden.
* Polling is optional.  (qualifying endpoints)
* Polling an invalid contact takes quite awhile.  The query has to wait for all of the poll re-transmissions to complete before the contact can be declared dead.
* Polling is handled by a different module so it is not that accessible.

Removing the next to expire contact is simple and does allow a quick recovery.  The removed next to expire contact is likely the old and invalid contact the re-registration was attempting to refresh anyway.  If it was not then the removed contact needs to be refreshed the soonest.


By: Friendly Automation (friendly-automation) 2017-10-11 06:34:41.356-0500

Change 6630 merged by Jenkins2:
res_pjsip_registrar.c: Update remove_existing AOR contact handling.

[https://gerrit.asterisk.org/6630|https://gerrit.asterisk.org/6630]

By: Friendly Automation (friendly-automation) 2017-10-11 06:43:09.107-0500

Change 6632 merged by Jenkins2:
res_pjsip_registrar.c: Update remove_existing AOR contact handling.

[https://gerrit.asterisk.org/6632|https://gerrit.asterisk.org/6632]

By: Friendly Automation (friendly-automation) 2017-10-11 06:44:54.611-0500

Change 6631 merged by Jenkins2:
res_pjsip_registrar.c: Update remove_existing AOR contact handling.

[https://gerrit.asterisk.org/6631|https://gerrit.asterisk.org/6631]

By: Friendly Automation (friendly-automation) 2017-10-11 06:50:33.684-0500

Change 6633 merged by Jenkins2:
res_pjsip_registrar.c: Update remove_existing AOR contact handling.

[https://gerrit.asterisk.org/6633|https://gerrit.asterisk.org/6633]