Asterisk
  1. Asterisk
  2. ASTERISK-27192

res_pjsip: Loss of SIP registrations causing unavailable endpoints

    Details

    • Frequency of Occurrence:
      Frequent

      Description

      When rewrite_contact is enabled, the max_contacts count can block re-registrations because the source port from the endpoint can be random. When the re-registration is blocked, the endpoint may give up re-registering and require manual intervention. While working on ASTERISK-27147 this issue was identified and discussed on IRC.

      (11:41:44 AM) rmudgett: file: gtjoseph So... What is the benefit to having a set number of max_contacts to pjsip AORs? In light of rewrite_contact this can get in the way if you re-register after a restart.
      (11:42:55 AM) gtjoseph: not sure actually other than limit 1
      (11:43:39 AM) rmudgett: Setting to one actually makes the issue with rewrite_contact more likely.
      (11:43:43 AM) gtjoseph: is the issue having a limit of having more than 1?
      (11:44:06 AM) file: the issue is that if it's not a refresh then the REGISTER would be rejected
      (11:44:14 AM) file: until a contact expires
      (11:44:38 AM) rmudgett: Yes. And if the register is rejected, the endpoint may give up trying to register.
      (11:45:51 AM) gtjoseph: well, we also have "remove_existing"
      (11:45:53 AM) file: I think having remove_existing work in that case smarter would be the option, it would still limit to a max but allow registrations
      (11:46:07 AM) rmudgett: There is also the chan_sip compatibility option of remove_existing in the mix.
      (11:46:55 AM) rmudgett: file: How could remove_existing work smarter?
      (11:47:06 AM) file: what does it do right now in that scenario?
      (11:49:46 AM) gtjoseph: are we thinking that multiple contacts will be an issue or that having the set limit will be?
      (11:50:06 AM) file: the set limit is the problem
      (11:50:08 AM) rmudgett: The set limit is the problem.
      (11:50:55 AM) file: I think remove_existing should remove the number of contacts required to allow the new REGISTER to succeed
      (11:51:31 AM) gtjoseph: but which ones? Oldest?
      (11:51:43 AM) file: the ones that will expire naturally soonest
      (11:51:51 AM) gtjoseph: yeah makes sense
      (11:52:30 AM) rmudgett: Remove existing removes all contacts that were not just registered. After it ignores any currently in the contacts container
      (11:52:53 AM) rmudgett: remove_existing won't let you add more than max_contacts
      (11:53:13 AM) file: I know, I'm saying I think it is reasonable for the behavior to change to the above
      (11:53:21 AM) file: which solves the problem
      (11:54:35 AM) rmudgett: I'm just wondering why have a set limit. Either max_contacts=0 to disable registrations or max_contacts=infinity.
      (11:55:42 AM) gtjoseph: not sure i can think of an exact use case immediately but we really can't change it now
      (11:55:45 AM) rmudgett: Is there some benefit to setting max_contacts=3 for instance?
      (11:56:31 AM) rmudgett: It could be documented that max_contacts is not a security feature and is not recommended to set it to a low value.
      (11:56:31 AM) file: outright benefit? no - but it can limit exposure
      (11:56:49 AM) file: specifying a recommendation is fine
      (11:56:56 AM) file: but as gtjoseph said - it's not getting removed
      (11:57:35 AM) gtjoseph: it can also be used by a service provider to only allow a certain number of servers to register to an endpoint.
      (11:58:56 AM) rmudgett: Which said service provider is likely to need rewrite_contact enabled and cause the problem of hitting the limit because of the random remote port number being registered.
      (11:59:18 AM) file: which is why I said I think it's reasonable for remove_existing to behave as I stated, which solves the problem
      (11:59:31 AM) gtjoseph: exactly
      (12:01:49 PM) rmudgett: That changes remove_existing behavior from chan_sip behavior as it is documented to enable.
      (12:02:41 PM) file: in a configuration that mirrors chan_sip it would function the same
      (12:02:55 PM) file: the only scenario where it functions differently is if multiple contacts are enabled, which is a chan_pjsip feature
      (12:03:09 PM) file: and it would function in a way which is better for the user
      (12:03:46 PM) file: and if you are enabling multiple contacts then the existing behavior is against that from the first place - as others would be removed defeating it in the first place
      (12:06:58 PM) rmudgett: So. The new behavior would be:
      (12:06:58 PM) rmudgett: 1) Don't check max_contacts if remove_existing is enabled
      (12:06:58 PM) rmudgett: 2) Process the register contacts.
      (12:06:58 PM) rmudgett: 3) If remove_existing enabled then remove the soonest to expire contacts until there are at most max_contacts.
      (12:06:58 PM) rmudgett: 4) send register response
      (12:07:35 PM) file: if you are referring to new logic, yes, that sounds about right
      (12:12:44 PM) rmudgett: With the new behavior, remove_existing could use a better name.
      (12:42:23 PM) gtjoseph: rmudgett: careful with that logic... depending on Expires, times the pass that removes the soonest to expire might remove the one just registered.
      (12:42:40 PM) gtjoseph: s/times//
      (12:42:41 PM) infobot: gtjoseph meant: rmudgett: careful with that logic... depending on Expires, the pass that removes the soonest to expire might remove the one just registered.
      (12:43:21 PM) gtjoseph: i think you have to remove until there's an open spot, then register.
      (12:50:43 PM) rmudgett: We certainly cannot add more than the max. We already keep track of the unmodified contacts so we can remove the soonest to expire unmodified contacts.
      (12:52:43 PM) gtjoseph: ok, so long as it's "unmodified contacts"

      Proposed new algorythm in res_pjsip_registrar.c:register_aor_core() for remove_existing option behavior:

      if remove_existing
        if MAX(added + updated - deleted, 0) > max_contacts
           respond forbidden
      else
        if MAX(added - deleted, 0) + current_contacts_count > max_contacts
           respond forbidden
      
      process the REGISTER contacts keeping track of which contacts were not affected by this REGISTER
      
      if remove_existing
        while current_contacts_count > max_contacts
          ast_assert(unmodified contact count > 0)
          remove the soonest to expire unmodified contact
      
      respond successful with current contacts
      

      The fix for ASTERISK-27147 will take care of connection oriented transports like TCP and TLS. However, the UDP transport will still have this issue.

        Issue Links

          Activity

          Hide
          Scott Pabin added a comment -

          I may have read through the IRC too quickly, but it looked liked it was proposed to simply remove the oldest contact to make room for the registering device. This would obviously cause a loop of registrations when max_contacts = x and NumOfDevices > x, if the devices are set to reregister automatically. Such a loop happens now when remove_existing = yes is used with max_contacts > 1, as it seems (without looking the code) that remove_existing removes all contacts for that AOR when a new device attempts to register against that AOR.

          The 'bump off the oldest pollable (responding) contact' when a new device is attempting to register in excess of max_contacts seems to be the less desirable option as the device should instead be denied with a 403 Forbidden. There are scenarios where bumping off the oldest contact (not in a call) is desirable, and that is where a configuration item named remove_existing would make sense by allowing the latest registration attempt to always register successfully.

          Perhaps it would be advantageous to poll existing contacts when a device attempts to register in excess of max_contacts, instead of waiting for them to time out. You can call it auto_purge and make it the default. With this solution, Asterisk can respond within a new device's SIP registration dialog with either a 200 OK (when there is space) or 403 Forbidden (when it's full). A request in excess of max_contacts should continue to throw a warning, and it should trigger an event if it does not already. In this scenario, rewrite_contact would work just fine, if I'm understanding its mechanism. The polling option also allows for the quickest device reregistration when ports have changed, such as is possible after a client-side router restart. The timeout method results in tech-support calls, as it's not an instantaneous recovery.

          Show
          Scott Pabin added a comment - I may have read through the IRC too quickly, but it looked liked it was proposed to simply remove the oldest contact to make room for the registering device. This would obviously cause a loop of registrations when max_contacts = x and NumOfDevices > x, if the devices are set to reregister automatically. Such a loop happens now when remove_existing = yes is used with max_contacts > 1, as it seems (without looking the code) that remove_existing removes all contacts for that AOR when a new device attempts to register against that AOR. The 'bump off the oldest pollable (responding) contact' when a new device is attempting to register in excess of max_contacts seems to be the less desirable option as the device should instead be denied with a 403 Forbidden . There are scenarios where bumping off the oldest contact (not in a call) is desirable, and that is where a configuration item named remove_existing would make sense by allowing the latest registration attempt to always register successfully. Perhaps it would be advantageous to poll existing contacts when a device attempts to register in excess of max_contacts , instead of waiting for them to time out. You can call it auto_purge and make it the default. With this solution, Asterisk can respond within a new device's SIP registration dialog with either a 200 OK (when there is space) or 403 Forbidden (when it's full). A request in excess of max_contacts should continue to throw a warning, and it should trigger an event if it does not already. In this scenario, rewrite_contact would work just fine, if I'm understanding its mechanism. The polling option also allows for the quickest device reregistration when ports have changed, such as is possible after a client-side router restart. The timeout method results in tech-support calls, as it's not an instantaneous recovery.
          Hide
          Richard Mudgett added a comment -

          I may have read through the IRC too quickly, but it looked liked it was proposed to simply remove the oldest contact to make room for the registering device.

          The change would only have an effect if remove_existing is enabled. If it is not enabled then you cannot add any more contacts than max_contacts allows and the register is rejected with a 403 Forbidden. If it is enabled then you still cannot add more contacts than max_contacts allows. However, the proposal is to remove the contacts that are next to expire and that were not updated by the register until there is room. The proposal will not change the behavior of sending a 403 Forbidden where a single registration attempt is trying to register more contacts than max_contacts.

          I think you have a misunderstanding of AOR's. An AOR is either statically configured for an endpoint or dynamically registered by the endpoint using the SIP REGISTER method. An AOR is only used to initiate new requests (such as calls) to an endpoint. AOR changes do not affect calls that are already in progress.

          This would obviously cause a loop of registrations when max_contacts = x and NumOfDevices > x

          This is an invalid situation as you are attempting to register more endpoint contacts than are configured to be allowed by max_contacts. If you have four devices independently registering themselves for an endpoint when max_contacts = 3 then it is either a configuration error or the user is trying to abuse the system.

          Such a loop happens now when remove_existing = yes is used with max_contacts > 1, as it seems (without looking the code) that remove_existing removes all contacts for that AOR when a new device attempts to register against that AOR.

          This is one of the reasons the issue is discussing a change in behavior for remove_existing. Currently remove_existing does effectively remove all contacts and replace them with the new register's contacts. It is a behavior that doesn't work very well when max_contacts is greater than one. The remove_existing option was originally created only as a means to emulate chan_sip's behavior when max_contacts = 1.

          The initial reason for the issue is the effect rewrite_contact has on registered AORs. The rewrite_contact option is one of the behaviors that allow SIP to operate with a NAT/firewall in the picture. When an endpoint device behind a NAT/firewall registers then it doesn't know the real contact address because the NAT/firewall effectively creates it when the register message goes through. If the device re-registers then it may have a different contact address than before. The re-registration would look like a new device and could be blocked by the max_contacts count. In this case the fix for ASTERISK-27147 takes care of connection oriented transports like TCP and TLS. UDP on the other hand needs some more help in this exception case.

          Polling to remove stale AOR contacts is not that viable a solution.

          • The poll would happen after the re-registration fails.
          • The device may not try re-registering again because it was forbidden.
          • Polling is optional. (qualifying endpoints)
          • Polling an invalid contact takes quite awhile. The query has to wait for all of the poll re-transmissions to complete before the contact can be declared dead.
          • Polling is handled by a different module so it is not that accessible.

          Removing the next to expire contact is simple and does allow a quick recovery. The removed next to expire contact is likely the old and invalid contact the re-registration was attempting to refresh anyway. If it was not then the removed contact needs to be refreshed the soonest.

          Show
          Richard Mudgett added a comment - I may have read through the IRC too quickly, but it looked liked it was proposed to simply remove the oldest contact to make room for the registering device. The change would only have an effect if remove_existing is enabled. If it is not enabled then you cannot add any more contacts than max_contacts allows and the register is rejected with a 403 Forbidden . If it is enabled then you still cannot add more contacts than max_contacts allows. However, the proposal is to remove the contacts that are next to expire and that were not updated by the register until there is room. The proposal will not change the behavior of sending a 403 Forbidden where a single registration attempt is trying to register more contacts than max_contacts . I think you have a misunderstanding of AOR's. An AOR is either statically configured for an endpoint or dynamically registered by the endpoint using the SIP REGISTER method. An AOR is only used to initiate new requests (such as calls) to an endpoint. AOR changes do not affect calls that are already in progress. This would obviously cause a loop of registrations when max_contacts = x and NumOfDevices > x This is an invalid situation as you are attempting to register more endpoint contacts than are configured to be allowed by max_contacts . If you have four devices independently registering themselves for an endpoint when max_contacts = 3 then it is either a configuration error or the user is trying to abuse the system. Such a loop happens now when remove_existing = yes is used with max_contacts > 1, as it seems (without looking the code) that remove_existing removes all contacts for that AOR when a new device attempts to register against that AOR. This is one of the reasons the issue is discussing a change in behavior for remove_existing . Currently remove_existing does effectively remove all contacts and replace them with the new register's contacts. It is a behavior that doesn't work very well when max_contacts is greater than one. The remove_existing option was originally created only as a means to emulate chan_sip's behavior when max_contacts = 1. The initial reason for the issue is the effect rewrite_contact has on registered AORs. The rewrite_contact option is one of the behaviors that allow SIP to operate with a NAT/firewall in the picture. When an endpoint device behind a NAT/firewall registers then it doesn't know the real contact address because the NAT/firewall effectively creates it when the register message goes through. If the device re-registers then it may have a different contact address than before. The re-registration would look like a new device and could be blocked by the max_contacts count. In this case the fix for ASTERISK-27147 takes care of connection oriented transports like TCP and TLS. UDP on the other hand needs some more help in this exception case. Polling to remove stale AOR contacts is not that viable a solution. The poll would happen after the re-registration fails. The device may not try re-registering again because it was forbidden. Polling is optional. (qualifying endpoints) Polling an invalid contact takes quite awhile. The query has to wait for all of the poll re-transmissions to complete before the contact can be declared dead. Polling is handled by a different module so it is not that accessible. Removing the next to expire contact is simple and does allow a quick recovery. The removed next to expire contact is likely the old and invalid contact the re-registration was attempting to refresh anyway. If it was not then the removed contact needs to be refreshed the soonest.
          Hide
          Friendly Automation added a comment -

          Change 6630 merged by Jenkins2:
          res_pjsip_registrar.c: Update remove_existing AOR contact handling.

          https://gerrit.asterisk.org/6630

          Show
          Friendly Automation added a comment - Change 6630 merged by Jenkins2: res_pjsip_registrar.c: Update remove_existing AOR contact handling. https://gerrit.asterisk.org/6630
          Hide
          Friendly Automation added a comment -

          Change 6632 merged by Jenkins2:
          res_pjsip_registrar.c: Update remove_existing AOR contact handling.

          https://gerrit.asterisk.org/6632

          Show
          Friendly Automation added a comment - Change 6632 merged by Jenkins2: res_pjsip_registrar.c: Update remove_existing AOR contact handling. https://gerrit.asterisk.org/6632
          Hide
          Friendly Automation added a comment -

          Change 6631 merged by Jenkins2:
          res_pjsip_registrar.c: Update remove_existing AOR contact handling.

          https://gerrit.asterisk.org/6631

          Show
          Friendly Automation added a comment - Change 6631 merged by Jenkins2: res_pjsip_registrar.c: Update remove_existing AOR contact handling. https://gerrit.asterisk.org/6631
          Hide
          Friendly Automation added a comment -

          Change 6633 merged by Jenkins2:
          res_pjsip_registrar.c: Update remove_existing AOR contact handling.

          https://gerrit.asterisk.org/6633

          Show
          Friendly Automation added a comment - Change 6633 merged by Jenkins2: res_pjsip_registrar.c: Update remove_existing AOR contact handling. https://gerrit.asterisk.org/6633

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development