ASTERISK-24837: chan_sip calls to Asterisk result in file descriptors growing exponentially while channels remain up

[Home]

Summary: ASTERISK-24837: chan_sip calls to Asterisk result in file descriptors growing exponentially while channels remain up

Reporter: Private Name (falves11) Labels:

Date Opened: 2015-02-26 23:05:32.000-0600 Date Closed: 2015-11-10 19:20:06.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Resources/res_rtp_asterisk

Versions: 13.2.0 Frequency of
Occurrence

Related
Issues:
is related to ASTERISK-25460 UDP leak

Environment: Linux 64 Attachments: ( 0) asterisktest.sh
( 1) core_show_fd_120_channels.txt
( 2) sip.conf
( 3) trace.txt
( 4) trace.txt
( 5) trace.txt
( 6) valgrind.txt
( 7) valgrind.txt
( 8) valgrind.txt
( 9) valgrind.txt

Description: [Edit by Rusty - Please use the wiki formatting available to make reports easy to read - I'm going to clean this up for now.]

When I originate several hundred calls using a call file, no dialplan, using app Echo() or app MusicOnHold, the calls connect, but when the other side starts to send media, after some 200+ calls I get several errors:
{noformat}
"ERROR[2433] res_rtp_asterisk.c: RTCP SR transmission error to 208.78.162.174:34443, rtcp halted Operation not permitted"
{noformat}
and the handle count explodes, measured at the end
{noformat}
lsof | grep asterisk | grep FIFO | wc -l
1025046
{noformat}
yes one million and change. The handle count never decreases as long as the channels are open. There are no active calls, only 500 channels.

The call files are all identical:
{noformat}
Channel: SIP/0000000000@demo
CallerID: "0000000000" <>
WaitTime: 45
MaxRetries: 0
RetryTime: 0
Application: Echo
Data:
Archive: no
{noformat}
where demo is a simple peer like this
{noformat}
[demo]
host=xxx.xxx.xxx.xx
type=peer
insecure=port,invite
context=reject
disallow=all
allow=ulaw
allow=g729
session-timers=accept
port=5060
faxdetect=no
transport=udp
directmedia=yes
{noformat}
*Note 1:* the caller ID may vary call by call it makes no difference. The issue here is the very high handle count, which slows down and kills the machine, and the errors which show that many calls do not connect.

*Note 2:* The calls do no across the internet, they go to a local box.
I need to send as many calls in real life, with media, so this is a killer for my business model.
If I set up debug=10 and verbose=20, I get thousands of lines identical like this ones
{noformat}
Feb 26 23:47:13] DEBUG[2433]: acl.c:963 ast_find_ourip: Attached to given IP address
[Feb 26 23:47:13] DEBUG[5763]: res_rtp_asterisk.c:3958 ast_rtcp_read: Got RTCP report of 64 bytes
[Feb 26 23:47:13] DEBUG[5763]: acl.c:958 ast_find_ourip: Not an IPv4 nor IPv6 address, cannot get port.
[Feb 26 23:47:13] DEBUG[5763]: netsock2.c:172 ast_sockaddr_split_hostport: Splitting 'dasaro' into...
[Feb 26 23:47:13] DEBUG[5763]: netsock2.c:226 ast_sockaddr_split_hostport: ...host 'dasaro' and port ''.
[Feb 26 23:47:13] DEBUG[5763]: acl.c:958 ast_find_ourip: Not an IPv4 nor IPv6 address, cannot get port.
[Feb 26 23:47:13] DEBUG[5763]: acl.c:963 ast_find_ourip: Attached to given IP address
{noformat}

{noformat}
ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1048576
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
{noformat}

*Note 3:* Please note how the handles start to grow with each additional call
{noformat}
Count idle
46
1 call
lsof | grep asterisk | grep FIFO | wc -l
104
2 calls
lsof | grep asterisk | grep FIFO | wc -l
168
3 calls
lsof | grep asterisk | grep FIFO | wc -l
240
4 calls
lsof | grep asterisk | grep FIFO | wc -l
320
{noformat}
As you can see, the handles do not grow linearly, but close to exponentially.

Comments: By: Rusty Newton (rnewton) 2015-02-27 06:50:42.186-0600

To analyze this further, we'll need the following:
* Relevant Asterisk configuration files, including channel configuration files and dialplan configuration (for example your complete sip.conf)
* A scenario with step-by-step instructions that reproduces this behavior
* Asterisk logs covering the time of operation that exhibit the issue

You will need to recompile Asterisk using the DEBUG_FD_LEAKS compiler flag. You can then observe and record the output of 'core show fd' [1] before and after the issue begins.

In this case you might get the output after one call, two calls, three calls, etc.

When collecting debug information, please follow the instructions on the wiki [2].

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/CLI+commands+useful+for+debugging?src=search#CLIcommandsusefulfordebugging-coreshowfd
[2] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
By: Rusty Newton (rnewton) 2015-02-27 06:51:47.165-0600

Be sure to use the "Enter Feedback" button to provide feedback to make sure we don't miss the response.
By: Private Name (falves11) 2015-02-27 08:19:37.514-0600

Note: This is not a chan_sip problem. The issue is identical with chan_pjsip. This an RTP issue, I guess.

By: Private Name (falves11) 2015-02-28 07:44:21.777-0600

This is is my sip.conf and the command "core show fd" for one to 20 channels.

By: Private Name (falves11) 2015-02-28 07:53:03.491-0600

I just uploaded my sip.conf and the results of the command "core show fd", from 1 open channel to 20, one by one
lsof | grep asterisk | grep FIFO | wc -l
2440
However, once I sent 100 calls at once, I got
lsof | grep asterisk | grep FIFO | wc -l
62640
Please see the file
core_show_fd_120_channels
Note: there is no dialplan, all the calls are originated with a call file and the channel dropped into Echo(). On the other side, there a music on hold that send back RTP

By: Private Name (falves11) 2015-02-28 07:53:45.197-0600

This is the command core show fd with 120 channels open
By: Private Name (falves11) 2015-02-28 09:41:36.518-0600

This with 183 channels open. I try to open more channels but it is as far as it goes. It kicks me out, but it does not crash.
lsof | grep asterisk | grep FIFO | wc -l
143152

By: Private Name (falves11) 2015-02-28 09:43:53.301-0600

I just uploaded a core show fd trace with 183 channels. It is as far as it lets me go.
lsof | grep asterisk | grep FIFO | wc -l
143152

By: Private Name (falves11) 2015-03-01 08:00:33.853-0600

Every half hour or so my Asterisk 13 crashes. It may be related to this bug or not. I think it may be the same bug. I am uploading the while file. I cannot do debugging in production.

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
/usr/sbin/safe_asterisk: line 164: 5222 Aborted (core dumped) nice -n $PRIORITY "${ASTS BINDIR}/asterisk" -f ${CLIARGS} ${ASTARGS} > /dev/null 2>&1 < /dev/null
*** buffer overflow detected ***: /sbin/asterisk terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f80971ba9c7]
/lib64/libc.so.6(+0x10bb90)[0x7f80971b8b90]

By: Private Name (falves11) 2015-03-01 08:01:44.607-0600

This is what comes out in my screen on these crashes that happen every 30+ minutes
By: Private Name (falves11) 2015-03-05 11:17:18.216-0600

I wonder if is there anything else I may do to help identify the issue.
By: Corey Farrell (coreyfarrell) 2015-03-05 11:32:59.466-0600

This is a long shot, could this be related to ASTERISK-22417? If you can easily reproduce this issue it might be worth testing that patch. Disclaimer: I have not tried the patch, I don't know it it's safe or effective.
By: Private Name (falves11) 2015-03-05 11:38:09.905-0600

The patch is for a function that executes when a call or a channel closes.
My issue is when they are up, I never close any call, just create the channels, and the machine collapses.
Please correct me if I am wrong.
I certainly see the case reported on the other issue.
I start 10 channels, do a netstat -tunap, the RTP ports show. On the other side I do a core restart now, and sure enough, the count "core show channels" goes to zero. But the same ports seem to be allocated, the output for netstat -tunap is identical. They never get freed.
I applied the patch for the issue noted, and it does solve this issue, neither solves that issue.
With or without the patch, the output of netstat -tunap is identical after the channels do close. However, the RTP ports do close after a few seconds, not immediately. I guess that makes the other issue a non-issue.
This issue, instead, is real.
By: Private Name (falves11) 2015-03-05 13:16:27.014-0600

Asterisk idle
lsof | grep asterisk | grep FIFO | wc -l
32
10calls.sh
1
2
3
4
5
6
7
8
9
10
lsof | grep asterisk | grep FIFO | wc -l
572
10calls.sh
1
2
3
4
5
6
7
8
9
10
lsof | grep asterisk | grep FIFO | wc -l
1512
10calls.sh
1
2
3
4
5
6
7
8
9
10
lsof | grep asterisk | grep FIFO | wc -l
2852

As you can see, this means
0 channels ----------> 32 handles
10 channels--------> 572
20 channels------->1512
30 channels------->2852
40 channels------->4952
..
60 channelss----->9272
Note: this is SVN-branch-11-r432484M
But all version show the same behavior. It is exponential

By: Private Name (falves11) 2015-03-09 15:42:26.376-0500

I just checked on my IVR server and with 250 open calls, real calls, playing prompts, getting digits
asterisk -rx 'core show channels count'
249 active channels
249 active calls
116652 calls processed
lsof | grep asterisk | grep FIFO | wc -l
127757
It seems to be stable, up and down, but it makes the server way slower than it should,making scalability impossible.
I would like to offer a fee to somebody who can produce a fix. It is affecting my ability to do business.

By: ibercom (ibercom) 2015-03-10 07:44:12.182-0500

[~falves11]: Remember to use "Enter Feedback" to send the issue back. It is "Waiting for Feedback" in this moment.
By: Private Name (falves11) 2015-03-10 07:48:30.433-0500

I did already upload all the requested information. Is there anything else I may do to help fix this issue?
By: Corey Farrell (coreyfarrell) 2015-03-13 01:31:05.386-0500

Can you reproduce this issue with [REF_DEBUG|https://wiki.asterisk.org/wiki/display/AST/Reference+Count+Debugging] enabled? I don't recommend running with REF_DEBUG any longer than it takes to reproduce the issue.

Lifetime of RTP instances is controlled by AO2. If the RTP sockets are not being closed after calls end, it probably means you are leaking references to the RTP instances.
By: Corey Farrell (coreyfarrell) 2015-03-13 01:40:24.621-0500

I forgot to mention, since you are running {{chan_sip}} on Asterisk 13, please disable {{res_pjsip}} in your REF_DEBUG build. Not all pjsip modules shutdown, so loading them causes many leaks to be reported. Not normally a production issue, but it would interfere with REF_DEBUG results.
By: Private Name (falves11) 2015-03-13 05:26:47.709-0500

Dear Corey
I already compiled Asterisk with ref_debug, and uploaded the result. Kindly look at the files I uploaded many days ago. It seems nobody noticed it.
The issue is not related to chan_sip or chan_pjsip, since it happens identically with both channels.
Please let me know what else can I do to help making Asterisk a robust telecom development tool.
By: Corey Farrell (coreyfarrell) 2015-03-13 11:29:44.246-0500

I just looked at the attachments again and you have not uploaded any refs logs. Please take a look at the REF_DEBUG wiki page [https://wiki.asterisk.org/wiki/display/AST/Reference+Count+Debugging]. Follow all 8 steps under the heading "Enable Reference Count Log".

I know this is not specifically chan_sip or chan_pjsip, but it will be easier to troubleshoot this using chan_sip and excluding chan_pjsip.
By: Private Name (falves11) 2015-03-19 22:02:11.271-0500

I find it impossible to debug the refs count because as per the instructions,
"Shutdown Asterisk using asterisk -rx 'core stop gracefully'. Other methods of shutdown produce hundreds or thousands of false positives."

My asterisk never stops with "core stop gracefully".

Please give me some advice.

By: Corey Farrell (coreyfarrell) 2015-03-19 22:39:13.316-0500

I think you just need to shorten your test. Memory leaks don't immediately bring you from 0 leaks to a crashed server - I assume it takes hundreds of calls (more?) to get to the point of crashing. You want to reproduce the leak with the minimum number of calls. So instead of running calls until you start getting errors, try running only 25 calls. Shutdown gracefully, check the refs debug results to see if you produced leaks. If that doesn't give you any leaks then repeat the test but with 50 calls. The point at which you start leaking is VERY likely to happen before Asterisk becomes unrecoverable, likely well before you even see any errors in the normal logs.

It might be possible that you are leaking on every single call (given the number of FD's you are leaking). Might not hurt to try running a test with a single call to see if that produces any ref_debug results.
By: Private Name (falves11) 2015-03-19 23:20:42.478-0500

Very good idea: this is for a single call
==== Leaked Object 0x2c7f868 history ====
[185] res_odbc.c:881 load_odbc_config: +1 - [**constructor**]

==== Leaked Object 0x2c7f3c8 history ====
[185] res_odbc.c:865 load_odbc_config: +1 - [**constructor**]
[185] res_odbc.c:1029 odbc_register_class: +1 - [1]
[185] res_odbc.c:1243 _ast_odbc_request_obj2: +1 - [2]
[185] res_odbc.c:424 odbc_obj_destructor: -1 - [3]
[185] res_odbc.c:920 load_odbc_config: -1 - [2]
[2982] res_odbc.c:1243 _ast_odbc_request_obj2: +1 - [1]
[2982] res_odbc.c:1243 _ast_odbc_request_obj2: +1 - [2]
[2982] res_odbc.c:1288 _ast_odbc_request_obj2: -1 - [3]
[2982] res_odbc.c:1243 _ast_odbc_request_obj2: +1 - [2]
[2982] res_odbc.c:1288 _ast_odbc_request_obj2: -1 - [3]
[2982] res_odbc.c:1243 _ast_odbc_request_obj2: +1 - [2]
[2982] res_odbc.c:1288 _ast_odbc_request_obj2: -1 - [3]

==== Leaked Object 0x7f4fc4005b88 history ====
[2982] res_odbc.c:1259 _ast_odbc_request_obj2: +1 - [**constructor**]
[2982] res_odbc.c:1280 _ast_odbc_request_obj2: +1 - [1]
[2982] res_odbc.c:1103 odbc_release_obj2: -1 - [2]
[2982] res_odbc.c:1252 _ast_odbc_request_obj2: +1 - [1]
[2982] res_odbc.c:1103 odbc_release_obj2: -1 - [2]
[2982] res_odbc.c:1252 _ast_odbc_request_obj2: +1 - [1]
[2982] res_odbc.c:1103 odbc_release_obj2: -1 - [2]
[2982] res_odbc.c:1252 _ast_odbc_request_obj2: +1 - [1]
[2982] res_odbc.c:1103 odbc_release_obj2: -1 - [2]

==== Leaked Object 0x2c78f58 history ====
[185] res_odbc.c:1885 load_module: +1 - [**constructor**]

But I am now doing my original test.

By: Corey Farrell (coreyfarrell) 2015-03-19 23:51:12.253-0500

Are you using ODBC in any way? If not it will help to put {{noload=res_odbc.so}} into {{modules.conf}}. I'm almost positive these leaks have nothing to do with your issue, and is a simple matter of res_odbc not unloading at exit.

Based on the descriptions you've given there may be a minimum threshold for producing the leak - a single call didn't do it. As I said 25 is probably a good number to start with, setup your call files to do originations at the same rate you would for a larger test. It seems like it will take a certain amount/intensity of activity to produce any leaks. Too much activity and Asterisk fails, causing the debug information to be incomplete (unusable). Unfortunately you will have to experiment to find an activity level which produces usable information.

I believe we are looking for leaked objects that are touched by one or more of: {{rtp_engine.c}}, {{res_rtp_asterisk.c}}, {{chan_sip.c}}, or any of the files in {{channels/sip}}.
By: Private Name (falves11) 2015-03-19 23:54:32.269-0500

with 200 channels and the application echo returning audio.
lsof | grep asterisk | grep FIFO | wc -l
9129
The result below, does not indicate the problem. I believe that 9129 FIFO objects for 200 channels, is too much. I already uploaded the FD_LEAKS files, do they indicate any issue?

==== Leaked Object 0x223fbd8 history ====
[5335] res_odbc.c:1885 load_module: +1 - [**constructor**]

==== Leaked Object 0x2245fd8 history ====
[5335] res_odbc.c:865 load_odbc_config: +1 - [**constructor**]
[5335] res_odbc.c:1029 odbc_register_class: +1 - [1]
[5335] res_odbc.c:1243 _ast_odbc_request_obj2: +1 - [2]
[5335] res_odbc.c:920 load_odbc_config: -1 - [3]

==== Leaked Object 0x2246478 history ====
[5335] res_odbc.c:881 load_odbc_config: +1 - [**constructor**]

==== Leaked Object 0x2246148 history ====
[5335] res_odbc.c:1259 _ast_odbc_request_obj2: +1 - [**constructor**]
[5335] res_odbc.c:1280 _ast_odbc_request_obj2: +1 - [1]
[5335] res_odbc.c:1103 odbc_release_obj2: -1 - [2]

By: Private Name (falves11) 2015-03-19 23:58:53.923-0500

with 500 channels
lsof | grep asterisk | grep FIFO | wc -l
101821
As you can see, 200 channels=9000 FIFO, 500 channels, 100.000
The question is, what part of Asterisk is producing these objects.

By: Corey Farrell (coreyfarrell) 2015-03-20 00:24:01.639-0500

So it looks like {{core show fd}} doesn't match {{lsof|grep asterisk|grep FIFO}}. {{core show fd}} is saying that you have 4 open FD's per active channel (approximately). This would be correct, 2 for the alertpipe, 1 for RTP, 1 for RTCP. So about 800 FD's should be open when you have 200 active channels. The exact number may vary depending on how many channels have been hung up but not yet freed.

So about other parts of the system, can you attach your {{extconfig.conf}}? Can you try testing with valgrind? Be sure Asterisk was compiled with {{DONT_OPTIMIZE}} and without {{MALLOC_DEBUG}} menuselect Compiler Flags. Start Asterisk with {{valgrind --track-fds=yes --show-leak-kinds=all --log-file=valgrind.txt aserisk -c}}. This will run very slow, you should only run a limited number of calls. Assuming you can run it without crashing it should give a complete backtrace for any FD's that were left open. Once complete please attach {{valgrind.txt}} - do not post it as a comment.

From a previous post:
{quote}
0 channels ----------> 32 handles
10 channels--------> 572
20 channels------->1512
30 channels------->2852
40 channels------->4952
{quote}

This tells we that the problem occurred as a result of 10 calls (there should not be 572 handles). So for this reason I think 10-20 calls for a test should be adequate.
By: Private Name (falves11) 2015-03-20 00:41:56.587-0500

You hit it right on the nail. I just went back and enabled FD leaks debug and connected 200 calls, with audio coming and returining with echo(). core show fd shows106 lines, but
lsof | grep asterisk | grep FIFO | wc -l
86832
note: my extconfig.conf is empty, and there is no dialplan, nada, zero, I reproduce the issue with no dialplan. I am posting here my script, in the hope that somebody with better training can crack this issue. This script will connect any number of calls, defined in variable K, with a delay of one second between calls.
----------------------
#!/bin/bash
random-string()
{
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w ${1:-32} | head -n 1
}

mdir=/var/spool/asterisk/outgoing
delay=1
K=200
y="0"

while [ $y -lt $K ]
do
y=$[$y+1]
#echo -e "$delay seconds $i\n"
#b=$(shuf /usr/src/src.txt | awk 'NR==1 {print}');b=${b:1:10}
b=s
printf -v x "Channel: SIP/%s@demo\n\
CallerID: \"%s\" <%s>\n\
WaitTime: 45\n\
MaxRetries: 0\n\
RetryTime: 10\n\
Application: Echo\n\
Data:\n\
Archive: no\n\n" $i $b $b
f=/tmp/$(random-string 16 ).call
echo -e "$y"
rm $f -f
echo -e "$x" > $f
touch -d "+ $delay seconds" $f
mv $f $mdir
delay=$(( $delay + 1))
done
---------------
on the far end, just play music on hold.

By: Corey Farrell (coreyfarrell) 2015-03-20 01:11:50.674-0500

Do you have highpriority=yes in asterisk.conf? If so please turn it off for the valgrind test. Valgrind could slow asterisk to much causing canary to think it froze, canary would then kill asterisk.
By: Private Name (falves11) 2015-03-20 01:16:55.772-0500

I solved the "killing" issue, but my script is running so slow that it may take the whole night to open 20 calls. Should I then use "core stop gracefully" to stop asterisk or how should I stop valgrind?
By: Corey Farrell (coreyfarrell) 2015-03-20 01:30:46.243-0500

Definitely do {{core stop gracefully}}. Anything else short-circuit's the shutdown process, leaving tons of leaked FD's and memory behind. 20 is just an arbitrary number of calls that (based on what you've said) should produce leaks. It's possible that 10 calls would be enough (or even 5). Either way since the other methods of leak debugging have failed I think we need the valgrind output. One limitation of Asterisk DEBUG options is that they don't know anything about FD's opened by external libraries. So valgrind is capable of telling us if libodbc is leaking FD's, and what part of Asterisk made the call that resulted in the leak.
By: Private Name (falves11) 2015-03-20 01:39:17.894-0500

This is with 20 calls established using ulaw and using core stop now on the far end and the core stop gracefully on this side. The command "core stop gracefully" hangs on until the calls close by themselves.
Do you see anything relevant?

By: Private Name (falves11) 2015-03-20 01:43:09.483-0500

I already uploaded the valgrind trace. without valgrind, 20 calls connected in the manner produce these objects while they are connected:
lsof | grep asterisk | grep FIFO | wc -l
1470

By: Private Name (falves11) 2015-03-20 02:05:49.946-0500

under valgrind I established 100 calls.
lsof | grep memcheck | grep FIFO | wc -l
23780
I uploading the valgrind trace shortly
By: Corey Farrell (coreyfarrell) 2015-03-20 02:12:31.187-0500

This valgrind.txt shows that after graceful shutdown most FD's are closed - it's not showing the leak you are experiencing. Now we know that the FD's do not survive shutdown. If I understand correctly, you have 1470 FD's when 20 calls are active. What does that number drop to after the calls hangup?

Please redo the valgrind test, but this time do {{core stop now}} without hanging up any channels. It will produce a ton of results but since something in shutdown or call hangup is cleaning the FD's we will need it. No need for 100 calls, 20 is fine.
By: Private Name (falves11) 2015-03-20 02:19:05.600-0500

This is a valgrind log file for 20 calls and "core stop now"

By: Private Name (falves11) 2015-03-20 02:21:55.418-0500

I just uploaded the requested file. I wonder if there is a valgrind option that shows real allocations one by one, when they happen. These options do not uncover the issue. I know they happen under valgrind because I can count the FIFO objects generated by memcheck. So valgrind knows about them, but it is now showing ownership.
By: Private Name (falves11) 2015-03-20 02:26:39.822-0500

I found this, please evaluate its importance
"Also, there are certain space leaks that aren't detected by traditional leak-checkers, such as Memcheck's. That's because the memory isn't ever actually lost -- a pointer remains to it -- but it's not in use. Programs that have leaks like this can unnecessarily increase the amount of memory they are using over time. Massif can help identify these leaks."
http://valgrind.org/docs/manual/ms-manual.html
By: Private Name (falves11) 2015-03-20 02:31:27.022-0500

valgrind --tool=massif --track-fds=yes --log-file=valgrind.txt asterisk -c

maybe this will show who is allocating the objects.

By: Private Name (falves11) 2015-03-20 02:38:03.475-0500

valgrind --leak-check=full --show-reachable=yes --leak-resolution=high --num-callers=100 --trace-children=yes --track-fds=yes --log-file=valgrind.txt asterisk -c

this command was used by somebody with a similar problem.
http://stackoverflow.com/questions/11242795/how-to-get-the-full-call-stack-from-valgrind
By: Ingenierie (ingenierie) 2015-03-20 06:22:38.447-0500

You can appy this patch to fix that https://issues.asterisk.org/jira/secure/attachment/51948/timer.patch
By: Private Name (falves11) 2015-03-20 08:09:29.339-0500

I am confused as how to apply this patch, since it references
a/res/res_timing_pthread.c b/res/res_timing_pthread.c
which I dont' have.
I did this, from inside the top asterisk source directory
patch -p0 < ../timer.patch
where timer.patch has the contents below:

--- res/res_timing_pthread.c
+++ res/res_timing_pthread.c
@@ -153,6 +153,12 @@ static void pthread_timer_close(void *data)
{
struct pthread_timer *timer = data;

+ ao2_lock(pthread_timers);
+ ast_mutex_lock(&timing_thread.lock);
+ ao2_unlink(pthread_timers, timer);
+ ast_mutex_unlock(&timing_thread.lock);
+ ao2_unlock(pthread_timers);
+
ao2_ref(timer, -1);
}

patching file res/res_timing_pthread.c
Hunk #1 succeeded at 148 with fuzz 2 (offset -5 lines).
make
CC="gcc" CXX="g++" LD="" AR="" RANLIB="" CFLAGS="" LDFLAGS="" make -C menuselect CONFIGURE_SILENT="--silent" makeopts
make[1]: Entering directory `/usr/src/asterisk/menuselect'
make[1]: `makeopts' is up to date.
make[1]: Leaving directory `/usr/src/asterisk/menuselect'
Generating input for menuselect ...
menuselect/menuselect --check-deps menuselect.makeopts
menuselect/menuselect --check-deps menuselect.makeopts
Generating embedded module rules ...
[CPP] res_timing_pthread.c -> res_timing_pthread.i
[CCi] res_timing_pthread.i -> res_timing_pthread.o
[LD] res_timing_pthread.o -> res_timing_pthread.so
[CCi] cdr_mysql.i -> cdr_mysql.o
[LD] cdr_mysql.o -> cdr_mysql.so
[CCi] format_mp3.i -> format_mp3.o
[LD] format_mp3.o mp3/common.o mp3/dct64_i386.o mp3/decode_ntom.o mp3/layer3.o mp3/tabinit.o mp3/interface.o -> format_mp3.so
/usr/bin/xmllint --dtdvalid doc/appdocsxml.dtd --noout doc/core-en_US.xml
+--------- Asterisk Build Complete ---------+
+ Asterisk has successfully been built, and +
+ can be installed by running: +
+ +
+ make install +
+-------------------------------------------+
[root@fedora-1 asterisk]# make install
..send calls..
lsof | grep asterisk | grep FIFO | wc -l
1470
so I don't think this code working. Maybe the code is never executed

By: Private Name (falves11) 2015-03-20 08:36:05.262-0500

I was using version 11, and the patch is not for that version.
Then I downloaded version 13, and the patch fails to apply.
Can somebody help me understand what version is the patch for and maybe produce a patch that references the common source directory tree?

By: Private Name (falves11) 2015-03-20 09:45:12.036-0500

Although this patch was not for the timing object that I use, which is res_timing_timerfd.so, I applied the patch to Asterisk 13 and disabled res_timing_timerfd.so, such as to force Asterisk to use res_timing_pthread.so. In 50 calls it raises to 12648 FIFO objects. The functions executes only at the end of the call, so it does not solve the issue.
What we need is a patch that makes an open channel consume the necessary number of FIFO objects. It must be a patch to res_timing_timerfd.so
By: Private Name (falves11) 2015-03-20 11:34:28.503-0500

in version 12 I added the patch manually, and sent 50 calls, setting up a 30 seconds timer on the other end, so the calls would end normally. The FIFO counts goes up up and down, to 46, normal. However, the function never executes, since I don't see the fake error message. In my opinion this functions has never executed in any version of Asterisk, that is why the FIFO count is out of control, while the calls are open.
By: ibercom (ibercom) 2015-03-20 12:52:09.880-0500

Probably you're using res_timing_timerfd and not res_timing_pthread as timing module. Therefore, this patch does nothing for you.

{code}
asterisk*CLI> module show like timing
Module Description Use Count
res_timing_timerfd.so Timerfd Timing Interface 1

1 modules loaded
{code}

By: Private Name (falves11) 2015-03-20 13:08:20.802-0500

Matthias gave me that patch, why would he indicate that it would work. He saw the valgrind trace
By: Private Name (falves11) 2015-03-20 13:09:44.276-0500

ibercom is right, I am using res_timing_timerfd.so

Maybe somebody could look into the trace and write the patch for that module?
By: Private Name (falves11) 2015-03-20 19:26:42.394-0500

Kindly look at my latest notes on the issue.
By: Private Name (falves11) 2015-04-16 20:23:39.969-0500

This issue is affecting my business dramatically. This is production
lsof | grep asterisk | grep FIFO | wc -l
41892
[root@survey asterisk]# asterisk -rx "core show channels count"
139 active channels
all channels are Echo()
I am using Asterisk 11, if somebody wants to test some patch, please do not hesitate and post it here.

By: Private Name (falves11) 2015-04-24 18:05:45.847-0500

With 859 calls, all in Echo(), the box collapses under the weight of FIFO handles.
Please help.

859 NUMBER IS MINE "1" Call from XXX Channel SIP/ricardo-00016b49
[Apr 24 19:00:28] ERROR[7303]: res_rtp_asterisk.c:3019 ast_rtcp_write_sr: RTCP SR transmission error to 54.85.108.173:12779, rtcp halted Operation not permitted

By: Rusty Newton (rnewton) 2015-07-09 19:09:48.340-0500

We haven't been able to reproduce this. I'm concerned our configuration or the way we do things may be different than yours. Can you provide a step by step instruction guide to reproduce the issue from a clean install? Please detail in the guide the configuration, operation and steps to identify the issue.
By: Private Name (falves11) 2015-07-09 23:09:46.736-0500

I can set up a test bed, it has to be two machines back to back, and use a script to start the channels. I already did provide the script, however, I am uploading it again.
If you agree, then contact me for the credentials and you can see it.
As I said before, there is no dialplan. Just a simple script that generates calls and when they connect, calls Echo().
Once the channels are up, check the issue with lsof.
I cannot imagine a simpler description than this, but maybe I am wrong.
In case yo want to try again, please set up two machines, one sends and one receives calls and plays music on hold, two lines of code.
This is what I get in the receiving machine

SIP/XXXXXXXXX-000 19545551212@inbound: Up Playback(campaign)
221 active channels
221 active calls
801 calls processed
system233*CLI> exit
Asterisk cleanly ending (0).
Executing last minute cleanups
[root@system233 sbin]# lsof | grep asterisk | grep FIFO | wc -l
42072
As you can see, 42072 handles is insane. This is Asterisk 11 SVN-branch-11-r434708M
By: Private Name (falves11) 2015-07-09 23:12:09.278-0500

This the sending script. On the receiving machine simply
Answer()
MusicOnHold()

By: Rusty Newton (rnewton) 2015-07-10 16:27:34.879-0500

Thanks, I'll try again next week.
By: Private Name (falves11) 2015-07-10 17:42:14.673-0500

Note: The issue shows up in the leg that plays the Music on Hold, not in the leg that is using Echo(). You can alter the logic and play the music a the sender. The problem follows Music On Hold.
By: Private Name (falves11) 2015-08-11 15:26:19.930-0500

I wonder if there are any news on this bug. It seems to me that Asterisk is famous for "not supporting many simultaneous calls", precisely for this issue. This morning I lost a potential client when I was forced to disclose that I use Asterisk as my underlying "telecom library", for my switching platform. The guy almost hanged up on me. A long time has passed and nothing moves on this bug.
By: Private Name (falves11) 2015-10-08 18:08:53.459-0500

The issue of the handle leak remains identical in Asterisk GIT-11-966265dM
I have an application that plays IVRs and dials out, only a few calls, waits for DTMF,etc.
The calls are always almost exactly 1300 calls
core show channels count
1292 active channels
1284 active calls
core show uptime
System uptime: 2 hours, 29 minutes, 32 seconds
Last reload: 1 hour, 27 minutes, 20 seconds

The operating system shows
ss -s
Total: 29574 (kernel 30296)
TCP: 28 (estab 9, closed 0, orphaned 0, synrecv 0, timewait 0/0), ports 22

Transport Total IP IPv6

30296 - -
RAW 0 0 0
UDP 29441 29433 8
TCP 28 20 8
INET 29469 29453 16
FRAG 0 0 0

And it keeps growing until no more ports can be open, and the app needs to be restarted, which may happen again in 3 hours.
This app cannot be debugged because there are too many calls per second and it is in production.
However, if there is anything I may do to get to the bottom of this, let me know.

core show settings

PBX Core settings
-----------------
Version: GIT-11-966265dM
Build Options: AST_DEVMODE, LOADABLE_MODULES, G711_NEW_ALGORITHM, G711_REDUCED_BRANCHING
Maximum calls: Not set
Maximum open file handles: Not set
Root console verbosity: 0
Current console verbosity: 0
Debug level: 0
Maximum load average: 0.000000
Minimum free memory: 0 MB
Startup time: 16:27:47
Last reload time: 17:30:00
System: Linux/2.6.32-573.3.1.el6.x86_64 built by root on x86_64 2015-10-08 19:31:32 UTC
System name: X2
Entity ID: 00:50:56:b1:24:65
Default language: en
Language prefix: Enabled
User name and group: /
Executable includes: Disabled
Transcode via SLIN: Enabled
Transmit silence during rec: Disabled
Generic PLC: Enabled
Min DTMF duration:: 80

Subsystems
-------------
Manager (AMI): Enabled
Web Manager (AMI/HTTP): Disabled
Call data records: Enabled
Realtime Architecture (ARA): Disabled

Directories
-------------
Configuration file:
Configuration directory: /etc/asterisk
Module directory: /usr/lib/asterisk/modules
Spool directory: /var/spool/asterisk
Log directory: /var/log/asterisk
Run/Sockets directory: /var/run/asterisk
PID file: /var/run/asterisk/asterisk.pid
VarLib directory: /var/lib/asterisk
Data directory: /var/lib/asterisk
ASTDB: /var/lib/asterisk/astdb
IAX2 Keys directory: /var/lib/asterisk/keys
AGI Scripts directory: /var/lib/asterisk/agi-bin

By: Private Name (falves11) 2015-10-09 02:59:23.849-0500

[Oct 9 03:56:22] ERROR[3424][C-0006f354]: res_rtp_asterisk.c:2516 ast_rtp_new: Oh dear... we couldn't allocate a port for RTP instance '0x7f09c440e538'
[Oct 9 03:56:22] ERROR[3424][C-0006f355]: res_rtp_asterisk.c:2516 ast_rtp_new: Oh dear... we couldn't allocate a port for RTP instance '0x7f09c440e538'
[Oct 9 03:56:22] ERROR[3424][C-0006f356]: res_rtp_asterisk.c:2516 ast_rtp_new: Oh dear... we couldn't allocate a port for RTP instance '0x7f09c440e538'
bridge4xx*CLI> exit
Asterisk cleanly ending (0).
Executing last minute cleanups
[root@bridge4xx asterisk]# ss -s
Total: 57913 (kernel 57938)
TCP: 27 (estab 4, closed 4, orphaned 0, synrecv 0, timewait 1/0), ports 17

Transport Total IP IPv6
* 57938 - -
RAW 0 0 0
UDP 57779 57771 8
TCP 23 15 8
INET 57802 57786 16
FRAG 0 0 0

By: Private Name (falves11) 2015-10-09 23:04:40.005-0500

I decided to create a script to reset Asterisk, that is, kill -9 asterisk, since the UDP handles reach 55000 in about 1 hour.
I wonder if anybody would have any idea how to fix this or troubleshoot the issue.
By: Private Name (falves11) 2015-10-10 20:50:58.849-0500

I found a very simple way to reproduce this bug. In sip.conf, set contex=reject, where "reject" is a context that does not exist.
The UDP port count goes 60.000 in less than 3 minutes, at least in my box. Asterisk responds with 404 Not Found.
Basically, Asterisk is not closing any of the allocated ports,ever.
I can give access to my box to anybody who wants to see it.
Please contact me to
v e n e f a x
g m a i l
By: Rusty Newton (rnewton) 2015-11-10 18:06:40.995-0600

I've had some time to attempt reproduction of your issue. I'm unable to reproduce it by following your instructions (though the instructions are not very clear).

{quote}
I can set up a test bed, it has to be two machines back to back, and use a script to start the channels. I already did provide the script, however, I am uploading it again.
If you agree, then contact me for the credentials and you can see it.
As I said before, there is no dialplan. Just a simple script that generates calls and when they connect, calls Echo().
Once the channels are up, check the issue with lsof.
I cannot imagine a simpler description than this, but maybe I am wrong.
In case yo want to try again, please set up two machines, one sends and one receives calls and plays music on hold, two lines of code.
{quote}

I setup two machines,

Asterisk A : Asterisk 13.6.0
Asterisk B : Asterisk 11.20.0

On A I used your script to generate calls with the PJSIP channel driver.
B receives the calls and uses the sip.conf you provided with only a change to the host address for the demo peer.

I tested in two cases.

Case one: Calls received by B are answered and put into music on hold using dialplan as you described.
{noformat}
[reject]
exten = s,1,Answer()
exten = s,2,Musiconhold()
{noformat}
Case two: B has empty extensions.conf , Calls received by B are rejected by asterisk as the context 'reject' does not exist.

In both cases, with hundreds of calls up and established, and thousands of calls processed, the file handles and UDP ports allocated both appear sane and *do not* increase exponentially. As the calls hangup the UDP ports and file handles decrease as expected.

I viewed them using the same commands you use to be sure we are measuring the same way.

Since no one else is reporting these issues and we cannot reproduce the issues following your instructions the burden is on you to produce a step by step guide demonstrating how to reproduce the issue from scratch. There is likely a missing element in your environment or configuration that is required for the problem to occur. It'll be up to you to figure this out and demonstrate what that element is.

I'll leave these issues in Waiting on Feedback for a couple more weeks to give you a chance to hunt down this missing element and communicate to us what it is.

I'll also repeat that we do not provide tech support via this issue tracker. We will not be logging onto your machine to look at the issue. We don't have the time or resources to do that. If you need help locating the issue I encourage you to hire a developer or consultant to assist you in tracking down the crucial details needed to reproduce the problem and demonstrate the bug in a way where we can go fix it.

Please discuss the file handle leak issue *only* on this issue ASTERISK-24837.

Please discuss the UDP port leak issue *only* on ASTERISK-25460.
By: Private Name (falves11) 2015-11-10 18:51:31.565-0600

I have not been able to find a developer who can spend a few minutes to gather whatever is necessary to feed this case. Digium does not provide any service like this for a fee.
If you can find somebody who can help for an hour, for a fee, I can show it.
The test is much simpler, just send thousands of calls to an Asterisk that rejects them with 404 because in sip.conf the default context does not exist, and the UDP handles go through the roof.

By: Rusty Newton (rnewton) 2015-11-10 19:19:57.607-0600

The Asterisk community does provide a path for bug bounties to be offered. https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties

You might also ask on the asterisk-biz mailing list which is available for the purpose of Asterisk related job offers and commercial discussions.

I'll go ahead and close out this issue for now. If you find more information on how to reproduce the issue in the future, please open up a new issue at that time with all the relevant details on reproduction.
By: Rusty Newton (rnewton) 2015-11-10 19:22:05.592-0600

{quote}
The test is much simpler, just send thousands of calls to an Asterisk that rejects them with 404 because in sip.conf the default context does not exist, and the UDP handles go through the roof.
{quote}
I tested exactly this using your script and sip.conf. The issue could not be reproduced. You'll have to describe what else is needed to reproduce the issue.

Feel free to pursue this issue on the mailing list if you need help troubleshooting. This issue is now closed. If you have new information in the future please open up a new issue.