ASTERISK-22833: channel: Channel reference leak

[Home]

Summary: ASTERISK-22833: channel: Channel reference leak

Reporter: Mitch Rodrigues (twitch) Labels:

Date Opened: 2013-11-08 17:49:09.000-0600 Date Closed: 2014-01-16 16:12:14.000-0600

Priority: Critical Regression?

Status: Closed/Complete Components: Core/Channels

Versions: 11.5.1 11.6.0 Frequency of
Occurrence Constant

Related
Issues:

Environment: centos Attachments: ( 0) extensions.txt
( 1) fd_after_1000.txt
( 2) fd_at_idle.txt
( 3) refcounter_output.txt
( 4) refs
( 5) refs.txt
( 6) sip.txt

Description: fd pipes linger from channel internals after call has been terminated. Seems to to also leave around alot of anon_inodes when doing lsof

Comments: By: Mitch Rodrigues (twitch) 2013-11-08 17:52:08.846-0600

This is a big one, as given any high volume calls you can hit your fd limit for the server in a couple days, as we are noticing.

-Steps to Reproduce:
-Turn on asterisk
-make some calls
-- Potential could be audiohook related its unclear (Use mixmonitor if tests fail)
-wait for scheduler threads to die
-view open file descriptors
-lsof see open anon inodes

List of FD's that remain open: http://pastebin.com/DdgM3W1s

Lsof of process: http://pastebin.com/yZHjmAxU

We have Investigated AGI for fd leaks, all pipes are closed, also looked at scheduler for
leaks, that checked out, leaks seem to be in the channel_internals

By: Matt Jordan (mjordan) 2013-11-10 18:26:19.821-0600

The problem is not really a leak of the channel alert pipe file descriptors, although that is a symptom. Those file descriptors are properly closed out via {{ast_channel_internal_alertpipe_close}} in {{ast_channel_destructor}}.

What you actually have is a channel reference leak, that is, the channel object itself is never being properly disposed of. If {{core show channels}} does not actually show the channel (which is probably won't, as the channels are probably be hung up), then you'll need to use reference count debugging to determine what channels are leaking.

[https://wiki.asterisk.org/wiki/display/AST/Reference+Count+Debugging]

You will probably want to create a reproducable scenario that recreates the problem. Otherwise, it will be very difficult to determine where the channel reference leak lies.
By: Mitch Rodrigues (twitch) 2013-11-12 11:08:42.077-0600

Getting alot of these:
==============
Problem: net Refcount not zero for object 7eb0df8
Object 7eb0df8 history:
0x7eb0df8 =1 taskprocessor.c:449:ast_taskprocessor_get ()
0x7eb0df8 +1 taskprocessor.c:477:ast_taskprocessor_get () [@1]
==============
Problem: net Refcount not zero for object 7eafff8
Object 7eafff8 history:
0x7eafff8 =1 chan_iax2.c:2788:create_callno_pools ()
0x7eafff8 +1 chan_iax2.c:2797:create_callno_pools () [@1]
0x7eafff8 -1 chan_iax2.c:2800:create_callno_pools () [@2]
==============

And alot of bad ref counts in (Im truncating the paste.)
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
0x2aaaaf7f4c38 +1 res_timing_pthread.c:301:find_timer () [@1]
0x2aaaaf7f4c38 -1 res_timing_pthread.c:217:pthread_timer_ack () [@2]
==============

By: Mitch Rodrigues (twitch) 2013-11-12 12:48:40.232-0600

Without digging two much, is their a path in which ast_channel_destructor does not get called on channel destroy. IE some set of circumstances like say with ast_queue_hangup or any other mechanism that can skip the ast_channel_destructor from being called. Or where the onis is on the user with whom needs to call it them
selves.
By: Mitch Rodrigues (twitch) 2013-11-12 14:11:20.850-0600

We also seem to be getting timeing threads hanging around which could be attributing the the pipe growth

25 res_timing_pthread.c:129 (pthread_timer_open ): pipe({25,26})
26 res_timing_pthread.c:129 (pthread_timer_open ): pipe({25,26})

By: Mitch Rodrigues (twitch) 2013-11-18 11:15:26.462-0600

Feed back provided
By: Matt Jordan (mjordan) 2013-11-18 11:52:26.379-0600

{quote}
Without digging two much, is their a path in which ast_channel_destructor does not get called on channel destroy. IE some set of circumstances like say with ast_queue_hangup or any other mechanism that can skip the ast_channel_destructor from being called. Or where the onis is on the user with whom needs to call it them
{quote}

No. If the reference count on a channel reaches 0, it will always call {{ast_channel_destructor}}. If the reference count on a channel does not reach 0, then it will not be called.

By: Matt Jordan (mjordan) 2013-11-21 09:18:38.221-0600

I don't think we have sufficient information to diagnose or reproduce this problem.

Do you have a particular dialplan and configuration that results in the growth of file descriptors that someone can use to attempt to reproduce what you are seeing?
By: Mitch Rodrigues (twitch) 2013-11-21 12:59:45.731-0600

{noformat}
[testlines]
exten => 120,1,Answer()
exten => 120,n,Playback(please-wait)
exten => 120,n,Dial(SIP/10001@10001)
exten => 120,n,Set(loopc=0)
exten => 120,n(mloop),Set(loopc=$[${loopc} + 1])
exten => 120,n,SayNumber(${loopc},f)
exten => 120,n,Wait(1)
exten => 120,n,Goto(mloop)
{noformat}

By: Mitch Rodrigues (twitch) 2013-11-21 12:59:57.122-0600

Information is above.
By: Matt Jordan (mjordan) 2013-12-07 20:13:00.168-0600

The dialplan is helpful, but I'm unclear how you're really using this.

# When you dial SIP 10001@10001, what is it doing? Is it answering, thus putting the channel into a bridge?
# What channel enters into 120@testlines? Are you generating a SIP call into this, a Local channel, or something else?
# When looping over the SayNumber, how does the channel exit?
By: Mitch Rodrigues (twitch) 2013-12-16 17:52:37.463-0600

Sorry about the delayed result here is a better example of a dialplan, without business logic nested in it.

{noformat}
exten => 127,1,Answer()
exten => 127,n,Set(loopc=0)
exten => 127,n(spyloop127),Set(loopc=$[${loopc} + 1])
exten => 127,n,SayNumber(2,f)
exten => 127,n,Wait(2)
exten => 127,n,Goto(spyloop127)
exten => 127,n,Dial(SIP/<PHONENUMBER>@<Outboundprox>)
exten => 127,n,Hangup()
{noformat}

By: Mitch Rodrigues (twitch) 2013-12-16 17:52:46.290-0600

Information above.
By: Walter Doekes (wdoekes) 2013-12-17 02:12:54.925-0600

Mitch: you have an infinite loop. How does it ever get to the Dial() part?
By: Walter Doekes (wdoekes) 2013-12-17 02:25:30.037-0600

-But nevertheless, you are right.-

-It uses 4 extra fds for a call and it releases only 2.-

-Remove the Answer() and the leak is gone.-

I spoke too soon. I was looking at objects that were properly destroyed after a while anyway.
By: Mitch Rodrigues (twitch) 2013-12-17 14:53:04.598-0600

That is an infinite loop shoot i pasted the wrong thing, apologies, Basically the point i was attempting and failing at illustrating is was doing something then dialing, then ending the call. Seemed to leave the pipes and internal fds behind. This seems to be a bigger issue on Centos 6 as opposed to Ubuntu. (Not seeing the same bad behavior on Centos 5.5 however)
By: Rusty Newton (rnewton) 2013-12-17 16:30:17.375-0600

1. Mitch, Sounds like Walter was unable to reproduce using the dialplan above.

2. To confirm, you only saw the behavior on CentOS 6?

3. Do you think if I used the dialplan you showed above (modified to not be an infinite loop) on Asterisk installed on CentOS 6 that I would see the issue?
By: Mitch Rodrigues (twitch) 2013-12-18 11:38:26.207-0600

Theoretically yes, but honestly, this has been a pita to track down on our end as well, we can see it happening, but as far as trying to find an explanation is a bit dubious.
By: Mitch Rodrigues (twitch) 2013-12-21 14:55:47.908-0600

Replied above.
By: Mitch Rodrigues (twitch) 2013-12-21 15:16:52.939-0600

So this is a server we have been running for 8 weeks, we moved a very small portion of our load on it. It seems that the OS and asterisk differ as to the type of FD it wants to say is open. it isn't exhibiting more of a leak in PIPEs (I dont have debug mode running on it unfortunately as that crashed at about 50simultaneous calls. and this is a production box)

http://pastebin.com/LMazGkWH

One thing to note, is the most extensions are being xfered from an inbound hub. We also have monitoring that connects and runs core show channels count constantly to make sure we don't have issues across the board. We did notice that calling core show channels without the count option causes locking issues that backup asterisk. A behavior changed from 1.4.43 to now.

This is all i got so far.

Edit: update 1/8/2014
We notice that not all calls exhibit this behavior at the rate it is leaking it takes about 1000 calls to get a noticeable amount of pipes lingering on the system.

By: Rusty Newton (rnewton) 2014-01-09 16:31:04.740-0600

Attempted to reproduce, using your suggestions.

* Fresh install of CentOS 6.5
* Using your example dialplan loop, modified a little bit (see extensions.txt)
* Using simply configured SIP extensions (see sip.txt)

Test was to make a thousand dial attempts using the dialplan outbound call loop.

After starting Asterisk, I looked at output of "*lsof -a -p 19194 | wc -l*" and it was at 324
A few minutes after making a thousand dial attempts using the loop, the open file count was the same.

I checked "*core show fd*" before and after the test, and the open fds shown were the same. See fd_at_idle.txt and fd_after_1000.txt.

I also setup ref count debugging per the wiki and logged /tmp/refs for chan_sip. See refs.txt and the output of utils/refcounter at refcounter_output.txt

From debug I didn't notice a growth of file descriptors. Though refcounter output quite a number of "Problem"s. I'll ping a dev to take a look at the /tmp/refs and refcounter output, as I'm no Asterisk dev.
By: Rusty Newton (rnewton) 2014-01-09 17:40:21.754-0600

@Mitch,

I had a couple developers look at the refcounter output, and it appears it is all benign, some of it from only ref debugging chan_sip and not all modules.

At this point we are unable to reproduce the issue and therefore can't move forward with diagnosis.

Would you be able to provide a full /tmp/refs file from a system where the issue occurs (and where ref count debugging is enabled for all Asterisk modules)? If possible, you'll probably want to set that up on a non-production system.

Otherwise we'll need more specific instructions on exactly how to reproduce the issue, as what we have so far didn't work.
By: Mitch Rodrigues (twitch) 2014-01-14 17:19:06.467-0600

This one is leaking pipes however has a minimal amount of pipes leaking, due to the number of calls we can generate in our QA stack.
By: Mitch Rodrigues (twitch) 2014-01-14 17:19:16.298-0600

Attached
By: Mitch Rodrigues (twitch) 2014-01-14 17:19:39.478-0600

Attached refs file.
By: Mitch Rodrigues (twitch) 2014-01-14 17:20:59.205-0600

You know if i can be pointed in the right direction i can possible patch it, however the amount of uptake to follow these pipes through the system is more bandwidth then i have currently.
By: Rusty Newton (rnewton) 2014-01-16 16:12:00.606-0600

Matt Jordan analyzed your provided refs file, but it still turned out to be benign, so still nothing else to go on unfortunately.

He did notice that you were using res_timing_pthread.so, which is extended support and is known to have issues. He and others recommend using res_timing_timerfd.so if possible. [You can find additional info on timing interfaces here|https://wiki.asterisk.org/wiki/display/AST/Timing+Interfaces].

I re-tested with your described conditions and dialplan using res_timing_pthread.so to see what would happen. After around 1200 call attempts I still didn't see any growth in file descriptors, stuck channels, or anything strange for that matter.

I suppose that in the scenarios where you reproduce the issue, there is likely something the channel is doing that isn't happening in my test.

I'm going to close this out for now as Cannot Reproduce. If you can provide an exact configuration scenario to reproduce the issue then we can re-open, but otherwise the refs info and other debug hasn't been helpful to any of the developers looking at it so far. At least the issue is logged here, so if someone else reports a similar issue then we have more knowledge on it. I know this is frustrating, but we don't have anything else to go on.
By: Rusty Newton (rnewton) 2014-01-16 16:13:04.863-0600

I wanted to also note that I was testing with the very latest SVN of the 11 branch, so you might try with that if you haven't already.
By: Mitch Rodrigues (twitch) 2014-01-24 18:23:39.792-0600

I totally understand i will try against SVN latest, see if the issue is resolved within that