[Home]

Summary:ASTERISK-17339: [patch] memory leak with IAX in 1.8
Reporter:Tzafrir Cohen (tzafrir)Labels:
Date Opened:2011-02-02 12:43:28.000-0600Date Closed:2012-01-09 15:56:01.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
is related toASTERISK-17825 [patch] dynamic thread exits with joinable state, which leaves resource of thread (250kb stack per thread etc.) unreleased
Environment:Attachments:( 0) 18735_pbx_free_callback_162.diff
( 1) 18735_pbx_free_callback.diff
( 2) asterisk-1.8.4.4-chan_iax2-detach-thread-on-non-stop-exit.patch
( 3) live_ast_valgrind_v2.diff
( 4) live_ast_valgrind.diff
( 5) maps-11560.txt.gz
( 6) maps-11560-2.txt.gz
( 7) valgrind_compare
Description:I'm trying to track a certain memory leak: the sizes (rss, size, vsize) keep increasing, whereas the memory size shown by 'memory show summary' and 'astobj2 show allocations' don't inflate.

When I stop the remote Asterisk server, the memory inflation stops. Thus it must be triggered by the registration requests.

I managed to reproduce this on several Centos5 systems, but not on my laptop (x86_64, Debian Squeeze). It has been around for a number of months, at least.

Update: seems to be easy to reproduce on other 32bit systems. I'm not exactly sure if it happens in 64bit systems (the rss increases the same, but size and vsize seem to increase in much larger steps).

****** ADDITIONAL INFORMATION ******

Configuration:

modules.conf:
{code}
[modules]
autoload = no
load => chan_iax2.so
{code}

iax.conf:
{code}
[general]
[template](!)
type=friend
host=dynamic
deny=0.0.0.0/0.0.0.0
permit=0.0.0.0/0.0.0.0
qualify=yes
secret=400

[400](template)
[401](template)
[402](template)
[403](template)
[404](template)
[405](template)
[406](template)
[407](template)
[408](template)
[409](template)
[410](template)
[411](template)
[412](template)
[413](template)
[414](template)
[415](template)
[416](template)
{code}

A remote asterisk (1.4.31, in my current tests) system connects here. its iax.conf:
{code}
[general]
disallow=all
allow=ulaw
allow=alaw
allow=gsm
mailboxdetail=yes
tos=ef
register=400:400@xr3-0
register=401:400@xr3-0
register=402:400@xr3-0
register=403:400@xr3-0
register=404:400@xr3-0
register=405:400@xr3-0
register=406:400@xr3-0
register=407:400@xr3-0
register=408:400@xr3-0
register=409:400@xr3-0
register=410:400@xr3-0
register=412:400@xr3-0
register=413:400@xr3-0
register=414:400@xr3-0
register=415:400@xr3-0
register=416:400@xr3-0

[template](!)
type=peer
context=from-pstn
host=192.168.0.50
secret=400
[400](template)
[401](template)
[402](template)
[403](template)
[404](template)
[405](template)
[406](template)
[407](template)
[408](template)
[409](template)
[410](template)
[411](template)
[412](template)
[413](template)
[414](template)
[415](template)
[416](template)
{code}

Logging is done using the following cron job:
{code}
# cat /etc/cron.d/asterisk_ps
* * * * * root ps h -o rss,size,vsize `cat /var/run/asterisk/asterisk.pid` >>$HOME/ast_size.log
{code}

Example output:
{code}
18560 23152  31136
18572 23392  31376
18584 23632  31616
18584 23632  31616
18592 23872  31856
18592 23872  31856
18592 23872  31856
18624 25136  33120
{code}
Comments:By: Leif Madsen (lmadsen) 2011-02-07 12:40:37.000-0600

Any chance of providing backtraces and perhaps valgrind output?

By: Tzafrir Cohen (tzafrir) 2011-02-14 13:35:25.000-0600

I ran valgrind twice. Once I stopped it quickly. The second time I stopped it after an hour (or more, not sure).

valgrind was run with:  -v --leak-check=full

The main difference between the logs (besides obvious ones, such as process IDs)

+==18036== 167,264 bytes in 1,101 blocks are possibly lost in loss record 139 of 143
+==18036==    at 0x4021E22: calloc (vg_replace_malloc.c:397)
+==18036==    by 0x4010BAB: _dl_allocate_tls (in /lib/ld-2.7.so)
+==18036==    by 0x430C672: pthread_create@@GLIBC_2.1 (in /lib/i686/cmov/libpthread-2.7.so)
+==18036==    by 0x81AC655: ast_pthread_create_stack (utils.c:1064)
+==18036==    by 0x6C8593E: find_idle_thread (chan_iax2.c:1437)
+==18036==    by 0x6C9CE0B: socket_read (chan_iax2.c:9609)
+==18036==    by 0x8122BF0: ast_io_wait (io.c:288)
+==18036==    by 0x6C91CDA: network_thread (chan_iax2.c:12193)
+==18036==    by 0x81AA72A: dummy_start (utils.c:1010)
+==18036==    by 0x430D4BF: start_thread (in /lib/i686/cmov/libpthread-2.7.so)
+==18036==    by 0x516184D: clone (in /lib/i686/cmov/libc-2.7.so)
+==18036==
+==18036==

By: Tzafrir Cohen (tzafrir) 2011-02-15 11:04:07.000-0600

I suspect the previous one is not relevant. Here's another one. In one run:

24,920 bytes in 1,916 blocks are definitely lost in loss record 78 of 85
  at 0x4021E22: calloc (vg_replace_malloc.c:397)
  by 0x814DD42: device_state_cb (utils.h:495)
  by 0x80F3866: handle_event (event.c:1415)
  by 0x819AB32: tps_processing_function (taskprocessor.c:310)
  by 0x81AA72A: dummy_start (utils.c:1010)
  by 0x430D4BF: start_thread (in /lib/i686/cmov/libpthread-2.7.so)
  by 0x516184D: clone (in /lib/i686/cmov/libc-2.7.so)

And in another one:

17,861 bytes in 1,373 blocks are definitely lost in loss record 75 of 86
  at 0x4021E22: calloc (vg_replace_malloc.c:397)
  by 0x814DD42: device_state_cb (utils.h:495)
  by 0x80F3866: handle_event (event.c:1415)
  by 0x819AB32: tps_processing_function (taskprocessor.c:310)
  by 0x81AA72A: dummy_start (utils.c:1010)
  by 0x430D4BF: start_thread (in /lib/i686/cmov/libpthread-2.7.so)
  by 0x516184D: clone (in /lib/i686/cmov/libc-2.7.so)

(Again: this is the diff between a short run and a long run under valgrind)

By: Tzafrir Cohen (tzafrir) 2011-02-15 15:27:36.000-0600

The attached diff seems to fix at least some of the problem: the callback pbx.c:handle_statechange() did not release the data it was handed.

By: Tzafrir Cohen (tzafrir) 2011-02-16 05:40:38.000-0600

BTW: I only now noticed contrib/valgrind.supp . After fixing this and running with --suppressions=contrib/valgrind.supp, I get only the following warnings on a short run:

==1267== 738 bytes in 15 blocks are possibly lost in loss record 43 of 49
==1267==    at 0x402377E: operator new(unsigned) (vg_replace_malloc.c:224)
==1267==    by 0x5001C03: std::string::_Rep::_S_create(unsigned, unsigned, std::
allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.10)
==1267==    by 0x5002864: (within /usr/lib/libstdc++.so.6.0.10)
==1267==    by 0x50029D5: std::string::string(char const*, std::allocator<char>
const&) (in /usr/lib/libstdc++.so.6.0.10)
==1267==    by 0x4D43593: (within /usr/lib/libpt.so.1.10.10)
==1267==    by 0x4D7B6D4: (within /usr/lib/libpt.so.1.10.10)
==1267==    by 0x4BD46F7: (within /usr/lib/libpt.so.1.10.10)
==1267==    by 0x400DD13: (within /lib/ld-2.7.so)
==1267==    by 0x400DE43: (within /lib/ld-2.7.so)
==1267==    by 0x400084E: (within /lib/ld-2.7.so)
==1267==
==1267==
==1267== 9,344 bytes in 4 blocks are possibly lost in loss record 49 of 49
==1267==    at 0x4021E22: calloc (vg_replace_malloc.c:397)
==1267==    by 0x80869B5: __ao2_container_alloc (utils.h:495)
==1267==    by 0x819A341: ast_tps_init (taskprocessor.c:124)
==1267==    by 0x8081884: main (asterisk.c:3212)

By: Tzafrir Cohen (tzafrir) 2011-02-17 12:16:43.000-0600

The respective patch for 1.6.2 and 1.8. Somehow this seems less of an issue.

By: Tzafrir Cohen (tzafrir) 2011-03-01 11:35:10.000-0600

The attached patch helped on trunk, but not on 1.6.2 (and on 1.8)


The issue can still be reproduced on a a Centos 5 system (though no longer on a Debian one). So there seem to have been more than one issue.

When comparing the output of valgrind from a short run to that of a longer run, I see the following significant extra additions:

12,376 bytes in 17 blocks are still reachable in loss record 760 of 7

  at 0x4004C42: calloc (vg_replace_malloc.c:418)
  by 0x4A34A07: iax_frame_new (utils.h:462)
  by 0x4A13791: iax2_send (chan_iax2.c:6166)
  by 0x4A1B265: registry_authrequest (chan_iax2.c:7178)
  by 0x4A2C7CB: socket_process (chan_iax2.c:10858)
  by 0x4A31A8C: iax2_process_thread (chan_iax2.c:11263)
  by 0x815523A: dummy_start (utils.c:970)
  by 0x52A831: start_thread (in /lib/libpthread-2.5.so)
  by 0x469F6D: clone (in /lib/libc-2.5.so)


12,597 bytes in 17 blocks are still reachable in loss record 761 of 7

  at 0x4004C42: calloc (vg_replace_malloc.c:418)
  by 0x4A34A07: iax_frame_new (utils.h:462)
  by 0x4A13791: iax2_send (chan_iax2.c:6166)
  by 0x4A143B3: send_command_final (chan_iax2.c:7178)
  by 0x4A1CF94: update_registry (chan_iax2.c:8536)
  by 0x4A2C6F7: socket_process (chan_iax2.c:10846)
  by 0x4A31A8C: iax2_process_thread (chan_iax2.c:11263)
  by 0x815523A: dummy_start (utils.c:970)
  by 0x52A831: start_thread (in /lib/libpthread-2.5.so)
  by 0x469F6D: clone (in /lib/libc-2.5.so)




15,672 bytes in 1,306 blocks are possibly lost in loss record 763 of

  at 0x4004C42: calloc (vg_replace_malloc.c:418)
  by 0x807D346: __ao2_link (utils.h:462)
  by 0x807E311: _ao2_link (astobj2.c:579)
  by 0x49FF048: replace_callno (chan_iax2.c:2510)
  by 0x8143732: ast_sched_runq (sched.c:621)
  by 0x814391D: sched_run (sched.c:118)
  by 0x815523A: dummy_start (utils.c:970)
  by 0x52A831: start_thread (in /lib/libpthread-2.5.so)
  by 0x469F6D: clone (in /lib/libc-2.5.so)

Current version: SVN-branch-1.6.2-r308990


When I build it with dev-mode, I still get basically the same memory leak, but the output from 'astobj2 show stats' does not inflate like the output of ps.

By: Tzafrir Cohen (tzafrir) 2011-05-22 10:57:38

Issue reproduced with 1.8 (1.8.4 as well as 1.8 r319469 ). Noticable and constant increase in memory size (with the above simple and minimal configuration).

Ran it under valgrind twice. Here's the diff in the memory leak summary from valgrind between a short run (2 minutes or so) and a long run (1 hour):

HEAP SUMMARY:
-    in use at exit: 3,092,515 bytes in 70,312 blocks
-  total heap usage: 272,153 allocs, 201,841 frees, 11,183,749 bytes allocated
+    in use at exit: 3,092,559 bytes in 70,300 blocks
+  total heap usage: 336,041 allocs, 265,741 frees, 26,464,754 bytes allocated

-Searching for pointers to 70,312 not-freed blocks
-Checked 10,831,304 bytes
+Searching for pointers to 70,300 not-freed blocks
+Checked 11,069,116 bytes

...

LEAK SUMMARY:
   definitely lost: 12,584 bytes in 143 blocks
   indirectly lost: 13 bytes in 1 blocks
-     possibly lost: 2,814,215 bytes in 67,014 blocks
-   still reachable: 265,703 bytes in 3,154 blocks
+     possibly lost: 2,814,551 bytes in 67,031 blocks
+   still reachable: 265,411 bytes in 3,125 blocks
        suppressed: 0 bytes in 0 blocks
Reachable blocks (those to which a pointer was found) are not shown.
To see them, rerun with: --leak-check=full --show-reachable=yes

Does is it a real issue?

Anyway, I needed to run valgrind, so attached are:
1. a patch to live_ast ro run valgrind
2. A small script I used to diff two outputs of valgrind.

By: Ronald Chan (loloski) 2011-05-23 06:47:42

one of our asterisk install badly hit with this issue, i might be wrong too, returning to SIP fix the issue but not the cause :D

By: Noach Sumner (nsumner) 2011-05-24 23:54:38

I am also hit by this issue. Using hylafax etc we have a good number of IAX trunks and have a memory leak causing us to need to restart about once a week.

By: Tzafrir Cohen (tzafrir) 2011-05-25 04:29:46

I played a bit with the number of parameters. It seems that I needed at least 12 different peers here for this to be reproduced.

By: Ian Anderson (seadweller) 2011-05-31 13:26:54

Seeing this on CentOS as well.  Have 3 boxes running 1.6.2.18... only the one built with and running chan_iax2.so is showing huge VIRT numbers in top.

The other two are running chan_sip.so only and fine.

IAX2 trouble box is running 32-bit CentOS, the other two without IAX2 are running both 32 and 64-bit CentOS.



By: Alex Villacís Lasso (a_villacis) 2011-06-13 12:26:21.729-0500

I have seen this bug too, at asterisk-1.8.4.2 32-bits. We have noticed that setting the LOW_MEMORY macro define considerably reduces (but does not eliminate) the memory leak, by around an order of magnitude.

By: Paul Belanger (pabelanger) 2011-06-14 15:13:59.977-0500

An update to your valgrind patch, logs output to valgrind.txt

By: Alex Villacís Lasso (a_villacis) 2011-06-28 17:41:09.564-0500

Now testing with asterisk 1.8.4.3 32 bits.

I have seen an interesting clue that might help pinpoint the cause of the memory leak. I have been looking /proc/ASTERISK_PID/maps for an asterisk process that is currently using over 550 Mb of virtual memory while idle (after running for one day), and with 20 instances of iaxmodem running on localhost and connected to asterisk. I see the following memory map:
...(executable mappings omitted)...
09283000-092b5000 r-xp 00000000 fd:00 131332 /usr/lib/asterisk/modules/app_voicemail.so
092b5000-092b7000 rwxp 00031000 fd:00 131332 /usr/lib/asterisk/modules/app_voicemail.so
092b7000-092b8000 rwxp 092b7000 00:00 0
09a0f000-09b8a000 r-xp 00000000 fd:00 8924151 /usr/lib/libnetsnmpmibs.so.10.0.3
09b8a000-09b93000 rwxp 0017a000 fd:00 8924151 /usr/lib/libnetsnmpmibs.so.10.0.3
09b93000-09bc2000 rwxp 09b93000 00:00 0
0a0cf000-0a0f0000 r-xp 00000000 fd:00 131299 /usr/lib/asterisk/modules/app_meetme.so
0a0f0000-0a0f2000 rwxp 00020000 fd:00 131299 /usr/lib/asterisk/modules/app_meetme.so
0a0f2000-0a0f3000 rwxp 0a0f2000 00:00 0
0a11c000-0ba7c000 rw-p 0a11c000 00:00 0 [heap]
9c754000-9c755000 ---p 9c754000 00:00 0
9c755000-9c790000 rw-p 9c755000 00:00 0
9c790000-9c791000 ---p 9c790000 00:00 0
9c791000-9c7cc000 rw-p 9c791000 00:00 0
9c7cc000-9c7cd000 ---p 9c7cc000 00:00 0
9c7cd000-9c808000 rw-p 9c7cd000 00:00 0
9c808000-9c809000 ---p 9c808000 00:00 0
9c809000-9c844000 rw-p 9c809000 00:00 0
...3680 lines of repeating pattern omitted...
b7ef6000-b7ef7000 ---p b7ef6000 00:00 0
b7ef7000-b7f32000 rw-p b7ef7000 00:00 0
b7f32000-b7f33000 ---p b7f32000 00:00 0
b7f33000-b7f6e000 rw-p b7f33000 00:00 0
b7f6e000-b7f6f000 ---p b7f6e000 00:00 0
b7f6f000-b7fb0000 rw-p b7f6f000 00:00 0
b7fb2000-b7fb6000 rw-p b7fb2000 00:00 0
bfb64000-bfb79000 rw-p bffe9000 00:00 0 [stack]

Several things I noticed here:
1) The mapping for [heap] is a mere 25 Mb (0a11c000-0ba7c000) after one day. From what I understand, this is the region from which ordinary memory allocations (malloc) are served.
2) The (main) process stack is stashed at the top of the memory, at bfb64000 and spanning just around 84Kb.
3) There are thousands of small memory mappings that account for the ballooning of the virtual memory usage. These allocations follow the pattern of being paired: one small mapping (4k) with attributes ---p, followed by a big mapping of exactly 266240 bytes (65 pages) with attributes rw-p. This pattern is repeated over and over, starting a bit below the bottom of the stack, and growing downwards.
When the process balloons a bit more, I notice the following change:
--- maps-11560.txt 2011-06-28 16:26:29.000000000 -0500
+++ maps-11560-2.txt 2011-06-28 16:26:29.000000000 -0500
@@ -593,6 +593,8 @@
0a0f0000-0a0f2000 rwxp 00020000 fd:00 131299 /usr/lib/asterisk/modules/app_meetme.so
0a0f2000-0a0f3000 rwxp 0a0f2000 00:00 0
0a11c000-0ba7c000 rw-p 0a11c000 00:00 0 [heap]
+9c718000-9c719000 ---p 9c718000 00:00 0
+9c719000-9c754000 rw-p 9c719000 00:00 0
9c754000-9c755000 ---p 9c754000 00:00 0
9c755000-9c790000 rw-p 9c755000 00:00 0
9c790000-9c791000 ---p 9c790000 00:00 0

This confirms that
1) the above-described pattern is followed
2) that the process heap is not involved in the memory leak

My guess about this issue is that this is somehow related to a thread-related resource leak. An even bigger guess is that this resource leak is stack-based. This ties in with the fact that chan_iax2 is heavily threaded. I will upload my map files as evidence of this. Please comment on this.


By: Alex Villacís Lasso (a_villacis) 2011-06-28 17:45:08.283-0500

Map files for ballooning asterisk process. Gzipped because the 137-kb file is being rejected.

By: Alex Villacís Lasso (a_villacis) 2011-06-28 18:00:49.609-0500

In asterisk 1.8.4.2, only users of ast_sched_thread_create() are:
main/ccss.c
res/res_stun_monitor.c
channels/chan_iax2.c

Maybe the bug affect all of the above, but the only major user is IAX2...?

By: Alex Villacís Lasso (a_villacis) 2011-06-28 18:09:30.439-0500

Another thing I have just noticed: an attempt to unload chan_iax2 from the asterisk console with the 20 iaxmodem instances and MALLOC_DEBUG compiled in results in asterisk crashing.

By: Alex Villacís Lasso (a_villacis) 2011-06-28 18:43:00.487-0500

The size of one of the mappings (266240) is very close to AST_STACKSIZE (245760 in 32 bits). I see this as further evidence of thread resource mismanagement.

By: Arnd Schmitter (arnd) 2011-06-28 21:46:25.359-0500

It seems it's related to the following bug: ASTERISK-17825


By: Alex Villacís Lasso (a_villacis) 2011-06-29 14:40:04.997-0500

Patch to auto-detach a thread when it is about to exit and there is no explicit stop. I am testing this patch right now for about 2 hours, and it seems to work.

By: Rick Dale (rjdalejr) 2011-06-29 19:47:05.632-0500

Would this patch be applicable for the 1.6 branch as well? I have the same issue on 1.6.2.17.2.

By: Alex Villacís Lasso (a_villacis) 2011-06-30 11:05:50.893-0500

Rick Dale: it seems to apply correctly, even if at a different offset:

[alex@srv64 asterisk-memleak]$ tar -xzf asterisk-1.6.2.19.tar.gz
[alex@srv64 asterisk-memleak]$ cd asterisk-1.6.2.19
[alex@srv64 asterisk-1.6.2.19]$ patch -p1 --dry-run < ../asterisk-1.8.4.4-chan_iax2-detach-thread-on-non-stop-exit.patch
patching file channels/chan_iax2.c
Hunk #1 succeeded at 11313 (offset -360 lines).

Thread-wise, chan_iax2 seems identical in 1.6.2.x and 1.8.4.4, so I guess it fixes the bug in 1.6.2.x as well.

Now testing patched 1.8.4.4 for a full day, with no ill effects and with the memory leak cured for good.