ASTERISK-21194: chan_sip can fail to find a peer during reload

[Home]

Summary: ASTERISK-21194: chan_sip can fail to find a peer during reload

Reporter: Jaco Kroon (jkroon) Labels:

Date Opened: 2013-02-28 15:17:10.000-0600 Date Closed:

Priority: Minor Regression?

Status: Open/New Components: Channels/chan_sip/General

Versions: 11.2.1 13.18.4 Frequency of
Occurrence

Related
Issues:
is related to ASTERISK-21377 After sip reload 80% peers is UNREACHEBLE

Environment: Attachments:

Description: During a global system reload I saw this:

{noformat}
[Feb 28 16:50:26] VERBOSE[2712][C-0000317a] pbx.c: -- Executing [number@prov:5] Dial("Local/number@foo-0000377b;2", "SIP/bar/number,,") in new stack
[Feb 28 16:50:26] VERBOSE[2712][C-0000317a] netsock2.c: == Using SIP RTP CoS mark 5
[Feb 28 16:50:26] ERROR[2712][C-0000317a] netsock2.c: getaddrinfo("bar", "(null)", ...): Name or service not known
[Feb 28 16:50:26] WARNING[2712][C-0000317a] chan_sip.c: No such host: bar
[Feb 28 16:50:26] WARNING[2712][C-0000317a] app_dial.c: Unable to create channel of type 'SIP' (cause 20 - Subscriber absent)
{noformat}

sip show peer (after reload):

{noformat}
* Name : bar
Description :
Secret : <Not set>
MD5Secret : <Not set>
Remote Secret: <Not set>
Context : uls-makecall
Record On feature : automon
Record Off feature : automon
Subscr.Cont. : <Not set>
Language :
Tonezone : <Not set>
Accountcode : bar
AMA flags : Unknown
Transfer mode: open
CallingPres : Presentation Allowed, Not Screened
Callgroup :
Pickupgroup :
Named Callgr :
Nam. Pickupgr:
MOH Suggest :
Mailbox :
VM Extension : 8579
LastMsgsSent : 0/0
Call limit : 2147483647
Max forwards : 0
Dynamic : No
Callerid : "" <>
MaxCallBR : 384 kbps
Expire : -1
Insecure : no
Force rport : Auto (No)
Symmetric RTP: No
ACL : No
DirectMedACL : No
T.38 support : Yes
T.38 EC mode : Redundancy
T.38 MaxDtgrm: -1
DirectMedia : No
PromiscRedir : No
User=Phone : No
Video Support: No
Text Support : No
Ign SDP ver : No
Trust RPID : No
Send RPID : No
Subscriptions: Yes
Overlap dial : No
DTMFmode : rfc2833
Timer T1 : 500
Timer B : 32000
ToHost : 10.0.0.14
Addr->IP : 10.0.0.14:5060
Defaddr->IP : (null)
Prim.Transp. : UDP
Allowed.Trsp : UDP
Reg. exten :
Def. Username:
SIP Options : (none)
Codecs : (g729)
Codec Order : (g729:20)
Auto-Framing : No
Status : OK (1 ms)
Useragent :
Reg. Contact :
Qualify Freq : 60000 ms
Keepalive : 0 ms
Variables :
__noivr = yes
Sess-Timers : Accept
Sess-Refresh : uas
Sess-Expires : 1800 secs
Min-Sess : 90 secs
RTP Engine : asterisk
Parkinglot :
Use Reason : No
Encryption : No
{noformat}

And it would have looked exactly the same just before reload. The section in sip.conf:

{noformat}
[bar]
type=friend
host=10.0.0.14
qualify=yes
disallow=all
allow=g729
context=uls-makecall
directmedia=no
dtmfmode=rfc2833
accountcode=IS
jbforce=no
setvar=__noivr=yes
transport=udp
{noformat}

Comments: By: Michael L. Young (elguero) 2013-02-28 16:46:31.532-0600

I believe that this is correct. _During_ a module reload, non-active peers are destroyed and then reloaded. So, this call is probably coming in while the peer does not exist and has not been reloaded yet.
By: Jaco Kroon (jkroon) 2013-03-01 01:17:44.782-0600

Michael, please clarify what you mean with non-active. The particular peer in question typically carries anything between 30 and 50 concurrent calls between 8:00 and 17:00 so I seriously doubt that it will be non-active (at least, according to my definition) at any given point in time.

I would have expected that the new peers are created and set up and then atomically replaced, or potentially even just using a mark & update mechanism where the config is scanned, if a peer is to be replaced, create the new peer structure, clone any qualify state parameters over (so that not all peers needs to be re-qualified on reload) permitting that the host= line hasn't changed, and then atomically replace the peer, after new peers is loaded, purge peers that's no longer in the new config ... ?
By: Michael L. Young (elguero) 2013-03-01 10:09:08.100-0600

Sure... this is the documentation I looked at briefly before replying. Not sure if it is very clear though.

{code}
/*! \brief Re-read SIP.conf config file
\note This function reloads all config data, except for
active peers (with registrations). They will only
change configuration data at restart, not at reload.
SIP debug and recordhistory state will not change
*/
static int reload_config(enum channelreloadreason reason)
{
{code}

If I am understanding the code correctly (I am sure someone will chime in if I am wrong), I see the peers are marked, the default settings are set, config settings then are set, peers are then built, if a peer is found it is retrieved and unlinked from the peers table, settings are then applied to it and then added (linked) back in to the peers table, peers that were marked and but not found during the build_peer process (user deleted it) will be unlinked at the end, then peers are poked (qualified).

So I would think that as things currently are, there is room for this to occur when the peer is not linked in the peers table while its settings are being checked and applied before being linked back in.

I am still learning chan_sip.so module's code. We can wait for some confirmation as to whether my assessment is correct or not.
By: Jaco Kroon (jkroon) 2013-03-01 11:00:18.020-0600

Michael,

According that definition the particular peer in question isn't active since the remote end doesn't register to us. It's a static peer with host=10.0.0.14.

Anyway, the way I look at the code a simple "sip reload" command is an extremely dangerous thing to do.

Firstly, the global settings are actually reset to default (whilst other threads may be using them). So for starters there are massive potential issues with stuff going on in the sip_cfg structure. It then iterates through the options from the config file and config options are updated appropriately. This seems sensible until you start thinking about high-volume systems (and I don't consider my systems with 30-50 concurrent calls to be that high volume even)! But they do get issued with reloads from time to time to add new clients, or change parameters for specific clients. Even without the load the race remains, it's just much less likely to strike. The point being that there is a brief interval during the reload where the settings may not be what I want it to be - and this CAN influence (and plainly does) my clients.

Out of hand, unfortunately I don't see a simple way to make the global settings change atomic, but a better way may be to load the config into an on-stack structure, and to then just update those settings that are not the same. Not sure how atomic a single write really is, but to the best of my understanding (on x86 and x86_64 at least) a typical variable assignment to a 32-bit (or 64 on x86_64) is atomic to the memory bus. This changing setting by setting will already be a heck of a lot better than current.

Secondly, it seems a lot of error checking happens *after* the updates to the global config structures. The above strategy would deal with that too - invalid config? Ok, no harm done, just refuse to clone into global config (or as in some cases from the code, just restrict the values appropriately).

If the above suggestion is acceptable I'm willing to attempt a patch.

Regarding the peers themselves - that code looks like black magic to me. From what little I do understand the build_peer function is used to construct a peer. It would seem that you're correct in that any non-realtime peers, and peers that isn't rt-cached (don't quite follow the logic of that first iff, but realtime is false for the peers coming from the config file, so the condition ends up being true) are *removed* from the peer list at this time. After this, if the peer is NOT marked, we set firstpass to false (ie, peers are marked before we scan the config file, and if we bump into them multiple times in the config file it's dealt with differently.

At this point if this is a firstpass over the peer some variables are stored, peer is reset to default, and then we start updating it again. Point being that whilst the peer is unlinked (nothing about active vs non-active - it's according to the code about realtime vs non-realtime) the update to the config is happening. Only after the config is rebuilt does the peer get re-added. A better mechanism might be to *clone* the peer, work on the cloned copy, and then atomically replace the original with the clone. Never had to deal with the a2o voodoo but this might be something that the a2o containers might be perfectly capable of dealing with. Perhaps again, build up a whole new a2o container and swap the containers?

Disclaimer: Whilst I've written my fair share of code I honestly have no idea of the scope of my suggestions.
By: Richard Mudgett (rmudgett) 2013-03-01 13:59:09.982-0600

Jaco,

The reload issue you have run into is why the configuration framework was created. The configuration framework behaves like what you describe the configuration reload should work. The new configuration is swapped into use atomically. It does take advantage of ao2 reference counted objects so you will need to become familiar with them. Look at ConfBridge for an example. (apps/confbridge/conf_config_parser.c:conf_load_config()) It is not an easy task to retrofit large modules like chan_sip to use the configuration framework and to access their parameters safely.
By: Matt Jordan (mjordan) 2013-03-01 14:39:22.652-0600

There is a [wiki article|https://wiki.asterisk.org/wiki/display/AST/Using+the+Configuration+Framework] on using the Configuration Framework as well.

I should note that {{chan_sip}} has several design limitations that make it difficult to fix this particular problem:
* Use of {{users.conf}}
* Support for realtime (sorcery was created to address this, however, that is not available in 11)
* Use of {{[general]}} options as opposed to only using templated configuration items

I'd also note that refactoring the reload operation in {{chan_sip}} in a release branch would be a very intrusive change that would be discouraged. Since a new SIP channel driver is being written for 12 that uses sorcery (which in turn uses the Configuration Framework under the hood), these kinds of reload operations won't be a problem in 12.

In general, the reload operation in {{chan_sip}} is what it is. If it doesn't crash, deadlock, or otherwise permanently harm a system, I'm inclined not to try and change its behavior.
By: Jaco Kroon (jkroon) 2013-03-02 07:18:04.645-0600

It's good to know work on this is happening. The risk seems very small for the config being in limbo at the time it's required.

Matt, I hear, and respect what you're saying about the reload operation being what it is. My question becomes this - how far is asterisk 12 off? I have to agree with you that such a patch would be extremely intrusive based on what I've seen of the code. Especially if we're switching to a new config mechanism/framework.

Let me ask this - if I (or someone else for that matter) puts in the effort to generate the patch -

a) would it at least be given some consideration?
b) would I be able expect some level of assistence from #asterisk(-dev)?
c) who would be the best person to assist?
d) should the patch not be merged - would someone at least still be willing to review it for me for obvious errors so that I can carry it on my own installations?

Assuming that I might not be able to get full atomicity, would a compromise where each peer is at least loaded/rejected atomically, and the global config done atomically be acceptable? In other words - rather than getting it 100%, get similar *results* as is current (ie, some peers can load whilst others fail), but at least make the portions that succeed more thread safe?
By: Matt Jordan (mjordan) 2013-03-04 09:50:40.999-0600

{quote}
It's good to know work on this is happening. The risk seems very small for the config being in limbo at the time it's required.

Matt, I hear, and respect what you're saying about the reload operation being what it is. My question becomes this - how far is asterisk 12 off? I have to agree with you that such a patch would be extremely intrusive based on what I've seen of the code. Especially if we're switching to a new config mechanism/framework.
Let me ask this - if I (or someone else for that matter) puts in the effort to generate the patch -
a) would it at least be given some consideration?
{quote}

So, first let's establish some of what would need to happen in order to ensure that during a reload, operations currently in-flight complete successfully, operations issued during a reload block until the reload is finished, and all blocked and subsequent operations get the new information.

# Global information is still tracked via static variables in {{chan_sip}}. Those would all have to be moved into a global ao2 object, and all references to that global information would have to obtain that object properly and dispose of it properly.
# There are multiple containers tracking dialogs and peers. They have to be properly synchronized and the lifetimes of the objects properly managed during a reload.
# Information has to be loaded from multiple configuration sources during a reload: {{sip.conf}}, {{users.conf}}, as well as realtime backends. That information has to be properly merged and done so atomically.

The question isn't really "will it be given consideration" - the question is, can the things done above be performed in such a fashion such that there is a high level of confidence that the patch does not introduce a regression? That is, can we do all of those things and know that:
# There are no crashes
# There are no deadlocks
# There are no resource/memory leaks
# Information is properly loaded/reloaded

My guess is: No. And that's not an indictment on anyone's abilities; that's just the result of 35000 lines of code being difficult to maintain.

We actually did think about tackling {{chan_sip}} using the configuration framework, and realized we'd be better off devoting development resources on the new SIP channel driver and making sure we made use of the configuration frameworks available there, rather than trying to refactor {{chan_sip}} some more (and probably breaking everything in the process).

{quote}
b) would I be able expect some level of assistence from #asterisk(-dev)?
{quote}

By assistance, would you be able to ask questions and get pointers from developers in #asterisk-dev? Absolutely!

By assistance, do you mean people will contribute code? Unknown.

If you're interested in this problem, I wouldn't start with {{chan_sip}}. Start with something that has similar problems, i.e., something that can be reloaded but isn't completely thread-safe, but is much more manageable. A few candidates:
* dnsmgr
* cdr
* res_rtp_asterisk

Note that these are still plenty challenging, but will get you involved in a refactoring effort that will give you familiarity with the various frameworks involved.

{quote}
c) who would be the best person to assist?
{quote}

In general, just ask a question. A fair number of people are familiar with the frameworks/code involved.

{quote}
d) should the patch not be merged - would someone at least still be willing to review it for me for obvious errors so that I can carry it on my own installations?
{quote}

Yes, it'd be put up for review. If the patch was deemed too intrusive for a release branch, it could still be evaluated for trunk.

{quote}
Assuming that I might not be able to get full atomicity, would a compromise where each peer is at least loaded/rejected atomically, and the global config done atomically be acceptable? In other words - rather than getting it 100%, get similar results as is current (ie, some peers can load whilst others fail), but at least make the portions that succeed more thread safe?
{quote}

I think you're going to find that doing each peer atomically and the global config atomically is just as hard as doing the whole thing. The configuration framework is designed to take care of that aspect of it for you - if you attempt to do it without the framework's assistance, you'll end up re-inventing a lot of what it already did.

By: Jaco Kroon (jkroon) 2013-03-04 10:07:41.218-0600

Thanks. Think I'll probably start with something simpler then. Probably dnsmgr which just about only has two knobs. From that I should then be able to get OK from someone as to what I'm doing is correct. From there I have no choice but to tackle chan_sip. It's just too serious for me to leave alone, unless you're telling me * 12 is less than a month away and that an upgrade from 11=>12 will be seamless.

Just to make a note: I *am* seeing sporadic lockups with sip reload. Had one on Friday. Don't have sufficient information for a proper bugreport though, pretty much only that chan_sip stopped servicing the udp socket bound to port 5060 as indicated by a massive (2MB) recv-q (netstat). That and even peer monitoring went away (ie, INVITE packets would be transmitted on channel creation, and if I explicitly executed a sip qualify peer I'd see an outbound OPTIONS packet).

And yea, by assistance I meant exactly that - getting people to point me in the right directions on #asterisk-dev. Since I don't ever change the configs for the three modules you mentioned they're not particularly serious for me, but I agree that something a little simpler is probably good for figuring out how it works before tackling the big project.
By: Matt Jordan (mjordan) 2013-03-04 11:11:14.655-0600

{quote}
From there I have no choice but to tackle chan_sip. It's just too serious for me to leave alone, unless you're telling me * 12 is less than a month away and that an upgrade from 11=>12 will be seamless.
{quote}

# The [Asterisk versions|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions] page shows the expected timeline of all major versions. (It isn't a month away :-))
# The upgrade from 11 to 12 will *not* be seamless. Given the scale of the things being tackled, major functionality is expected to change. If nothing else, the new SIP channel driver will not be 100% backwards compatible with {{chan_sip}} (doing so would be silly, given the poor configuration schema that {{chan_sip}} adopted)

By: Rusty Newton (rnewton) 2013-03-06 16:38:34.511-0600

Ack'ing this and assigning to Jaco since it looks like he'll take a crack at it.
By: Jaco Kroon (jkroon) 2013-04-03 07:47:34.445-0500

Been sitting on this ... fixing other more critical * issues atm (crashes are worse than call failures). Haven't forgotten.
By: Jaco Kroon (jkroon) 2013-12-30 15:06:29.096-0600

Re-looked at the code, and it's above my skill level (well, I actually just don't know the asterisk internals well enough, but from the looks of it we have to remove the object from the ao2 container and then re-add it, which is what causes the problem). In the meantime I've managed to reduce my reloads quite a bit and then only reload portions of asterisk as per required (ok fine, I still do a full asterisk reload but managed to prevent the config file timestamps from updating unless the content of the config files actually change, so most of the time the individual reloads just short-out due to the config file not changing) so my risk of this has also reduced significantly. Also managed to track the issue with stopping to service the udp port to a combination of DNS time-outs (which is outside of control, and non-trivial to fix but shouldn't affect 12) and blocking of astdb (sqlite) which I've since moved onto ramfs (I don't care if it goes lost, it's SIP and IAX/2 registry information only in there), DNS is highly sporadic and mostly cached, so honestly, for me the risk of this triggering again is incredibly low now, to the point where I'd rather address other concerns. If someone else would like to take a wack at this, please feel free, or close, however you see fit.