ASTERISK-24555: Memory leak with T.38 fax and SLINEAR format

[Home]

Summary: ASTERISK-24555: Memory leak with T.38 fax and SLINEAR format

Reporter: James Lamanna (jlamanna1576) Labels:

Date Opened: 2014-11-24 22:11:02.000-0600 Date Closed: 2018-01-04 05:24:45.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_iax2 Core/General Resources/res_fax

Versions: 11.14.0 13.18.4 Frequency of
Occurrence Constant

Related
Issues:
is related to ASTERISK-19133 Memory leak using asterisk T.38 to/from T.30 gateway

Environment: Ubuntu 11.10 x64 Kernel 3.0.0-32 cat /proc/cpuinfo: $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz stepping : 7 cpu MHz : 2400.074 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow bogomips : 4800.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz stepping : 7 cpu MHz : 2400.074 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow bogomips : 4800.18 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz stepping : 7 cpu MHz : 2400.074 cache size : 4096 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow bogomips : 4800.18 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz stepping : 7 cpu MHz : 2400.074 cache size : 4096 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow bogomips : 4800.18 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Attachments: ( 0) ASTERISK-24555-11.diff

Description: The memory allocation of Asterisk 11 seems to constantly increase.
Specifically, the cache allocated in frame.c seems to increase without bound. I haven't let it go this far, but it seems it would likely max out system memory and be killed by the OOM killer.

{noformat}
$ asterisk -rx "memory show summary" | sort -rn
3950471103 bytes allocated (3923172359 in caches) in 5289609 allocations
3922949255 bytes (3922882279 cache) in 5184358 allocations in file frame.c
..

$ asterisk -rx "core show channels"
8 active channels
4 active calls
14745 calls processed

$ asterisk -rx "core show uptime"
System uptime: 2 days, 14 hours, 17 minutes, 32 seconds
Last reload: 2 days, 14 hours, 17 minutes, 32 seconds
{noformat}

Comments: By: James Lamanna (jlamanna1576) 2014-11-28 13:33:35.961-0600

A quick valgrind run after attempting a couple of faxes left several entries like this in the valgrind log:

{noformat}
==30640== 211,203 bytes in 279 blocks are definitely lost in loss record 1,772 of 1,789
==30640== at 0x4C267CC: calloc (vg_replace_malloc.c:467)
==30640== by 0x4DC50E: ast_frdup (utils.h:523)
==30640== by 0x47125F: __ast_queue_frame (channel.c:1284)
==30640== by 0x1EF75589: __do_deliver (chan_iax2.c:3102)
==30640== by 0x1EF76C5A: schedule_delivery (chan_iax2.c:4374)
==30640== by 0x1EF8F497: socket_process_helper (chan_iax2.c:12010)
==30640== by 0x1EF99C37: iax2_process_thread (chan_iax2.c:12030)
==30640== by 0x56C458: dummy_start (utils.c:1192)
==30640== by 0x5E359C9: start_thread (pthread_create.c:300)
==30640== by 0x271A66FF: ???
{noformat}
By: Mark Michelson (mmichelson) 2014-12-01 09:51:47.053-0600

I have to admit that frame cache internals and chan_iax2 are not two of my strongest points in the Asterisk code-base, but I may be able to shed a bit of light on things here.

The frame code operates by having a thread-local cache of up to 10 frames (unless a compile-time constant is changed to a different value). This way, when a thread wants to duplicate a frame, it may be able to grab a cached structure instead of having to allocate a new one. Similarly, when a frame is freed, if there is room in the cache, then the frame is not freed but instead added to the cache. The cache is limited to 10 frames, so any thread that processes frames could presumably "leak" 10 frames since they are purposely not freed when requested. Since the cache is thread-local, a destructor function to free the frames is called when the thread is reclaimed.

Interestingly, taking a glance at chan_iax2 code, it appears that the iax2_process_threads are only reclaimed when the module is unloaded. By default, the max thread count is 100 (and can be configured up to 256 if desired). If each of these threads is queuing frames, then you could end up with 10 * 100 = 1000 cached frames, or 10 * 256 = 2560 cached frames. The problem is, even with those numbers, the numbers you're seeing are much larger than the memory that would be allocated by 2560 frames. Plus, as you have shown in your "memory show summary" output, there are apparently over 5 million allocations in frame.c that are contributing to the count.

My guesses here are either:
1) There are lots of threads that are queuing frames beyond just the chan_iax2 threads, and these threads are not being reclaimed. If you connect to your running Asterisk process with gdb and issue a "thread apply all backtrace" command, then you should see a list of all running threads. If the number of threads is ridiculous, then this may be a reasonable explanation of what's happening, and looking into why threads are not being reclaimed would be worthwhile.
2) Somewhere in the code path, a frame is being duplicated without a corresponding call to free it. In this case, the frame cache code isn't really the issue. The problem is that there's a straight memory leak that needs to be fixed up.

My guess here is that number 2 is the more reasonable explanation, especially given the rate at which the memory is being allocated. As you've noted on the -dev list though, discovering this sort of thing can be tricky.
By: James Lamanna (jlamanna1576) 2014-12-01 10:30:08.116-0600

This morning with an uptime of 12 hours and only 3468 calls processed I have 69 threads.
Allocations: 357744087 bytes ( 357543159 cache) in 472538 allocations in file frame.c
I'll check back this evening after processing a full day of faxes.

By: Mark Michelson (mmichelson) 2014-12-02 09:46:36.665-0600

I think it's safe to say that the problem isn't orphaned threads then. It sounds like a missing ast_frfree() or ast_frame_free() to me.
By: James Lamanna (jlamanna1576) 2014-12-02 11:12:30.732-0600

Definitely looks that way. I'll take a look but I'm not familiar with the code at all.
I didn't see this behavior in 10.x so maybe I can find something that changed.

Hopefully someone can help get this resolved. Makes this Asterisk completely unusable for me..
By: James Lamanna (jlamanna1576) 2014-12-03 14:22:26.792-0600

This appears to be an issue with using SIP/ulaw -> IAX/slin.
When I changed by IAXModems to use ulaw rather than slin (letting IAXModem do the conversion) the huge memory leakage appears to have subsided.

By: James Lamanna (jlamanna1576) 2014-12-03 15:07:05.033-0600

Wondering if this is similar to ASTERISK-19133
By: Matt Jordan (mjordan) 2014-12-03 16:26:14.664-0600

I can't recall. Are you using T.38 gateway?
By: Matt Jordan (mjordan) 2014-12-03 16:50:18.472-0600

So, I took a look at that area of the code again. I think there is another frame leak here that I missed when I wrote the patch for ASTERISK-19133. The code looks a bit different, but this is suspicious:

{code}
/* framehooks are called in __ast_read() before frame format
* translation is done, so we need to translate here */
if ((f->frametype == AST_FRAME_VOICE) && (f->subclass.format.id != AST_FORMAT_SLINEAR)
&& (readtrans = ast_channel_readtrans(active))) {
if ((f = ast_translate(readtrans, f, 1)) == NULL) {
f = &ast_null_frame;
ao2_ref(details, -1);
return f;
}
/* XXX we ignore the return value here, perhaps we should
* disable the gateway if a write fails. I am not sure how a
* write would fail, or even if a failure would be fatal so for
* now we'll just ignore the return value. */
gateway->s->tech->write(gateway->s, f);
ast_frfree(f);
} else {
gateway->s->tech->write(gateway->s, f);
}

f = &ast_null_frame;
ao2_ref(details, -1);
return f;
{code}

A few comments on this block:
* When we call {{ast_translate}}, we're calling it with the {{consume}} parameter set to true. That will automatically call {{ast_frfree}} on {{f}} when the function returns the translated frame. If we're calling this here, then we think that {{f}} has the potential to be {{malloc}}'d.
** Given how often a frame is dup'd or isolated, this is a safe bet.
* In that case, let's assume we don't call {{ast_translate}} here - which would be the case if we had a {{SLINEAR}} frame. In that case, we write the frame out to the fax technology, and then *immediately* blow over it with the {{ast_null_frame}}. That seems wrong. The gateway is never going to free the frame for us.
* We then return the {{ast_null_frame}}. The caller of this will be invoking the framehooks as such:
{code}
/* Perform the framehook write event here. After the frame enters the framehook list
* there is no telling what will happen, how awesome is that!!! */
if (!(fr = ast_framehook_list_write_event(ast_channel_framehooks(chan), fr))) {
res = 0;
goto done;
}
{code}
So {{fr}} is definitely nuked out by the invocation here. We're not going to free it anywhere else.

I've attached a patch that *may* fix this memory leak in your original setup. If you can test it out, that'd be hugely appreciated. Thanks!
By: James Lamanna (jlamanna1576) 2014-12-03 16:53:50.749-0600

I am using the T.38 gateway when the provider supports T.38. It is still passing through res_fax however, even in V.21 mode.
I'll apply this patch tonight, revert back to slin, and let you know how it goes.
By: Matt Jordan (mjordan) 2014-12-03 17:15:22.310-0600

By the way: if this doesn't completely resolve all of the memory leaks, I suspect there are more in this section of code. Let me know if it makes a bit of a difference at least however - I'm running some tests on a more aggressive patch now.
By: Corey Farrell (coreyfarrell) 2014-12-03 19:35:56.177-0600

The frame cache interferes with the results of valgrind. The leak backtrace you see likely isn't the code path that called {{ast_frdup}} without calling {{ast_frfree}}, it's more likely just the first caller of {{ast_frdup}} in the thread. Disabling saving frames to the cache in {{__frame_free}} should fix this. It will have some effect on performance but that shouldn't matter since you'd only do this for testing with valgrind.

[~mjordan] how would you feel about having frame.c support disabling cache through defined(LOW_MEMORY) || defined(DEBUG_VALGRIND)? Or something similar, some way to add a #define separate from LOW_MEMORY for testing purposes. DEBUG_VALGRIND would be for conditional code changes that could effect performance but fixes faulty results like incorrect leak backtraces caused by caches.
By: James Lamanna (jlamanna1576) 2014-12-04 07:39:58.699-0600

The attached patch results in Asterisk crashes.
By: Matt Jordan (mjordan) 2014-12-04 09:42:53.233-0600

[~jlamanna]: Nuts. Sorry about that. Running it through some fax tests didn't result in a crash on my system, but I might have just gotten lucky.

I'll take another look at this. I still suspect something isn't free'ing the frames appropriately in the framehook, but the usage of the frames isn't what I expected.

[~coreyfarrell]: I'm not against the notion - any tools we have to make debugging things easier is always good - but I'd want to be pretty careful about that. Generally, I'm against new compilation directives for simplicity alone, but I'm also not sure what to combine such an idea with. Can you think of other {{DEBUG_VALGRIND}} variants? If so, a discussion on the -dev list may be a good idea.
By: James Lamanna (jlamanna1576) 2014-12-05 08:34:50.951-0600

The key with this memory leak is that it goes through res_fax AND there is a codec translation. In my case, there was a translation from ulaw to slin. Once I stopped the codec translation in Asterisk, the memory leak went away.
By: dtryba (dtryba) 2015-01-15 10:23:02.768-0600

I think I'm affected by this bug, patching with ASTERISK-24392 was not enough to stop leaking. But fun fact is that source and destination agree on alaw (the only joint capability), but while FAXOPT(gateway) is attached the read/write format is slin. This machine runs out of memory every 2 weeks since FAXOPT(gateway) was enabled.

{quote}
res_fax.c: channel 'SIP/inas1sbc01-00003fe6' setting FAXOPT(gateway) to 'yes,10'
res_fax.c: Selected FAX technology module (Spandsp FAX Driver) does not support reserving sessions.
res_fax.c: Attached T.38 gateway to channel SIP/inas1sbc01-00003fe6.
features.c: Removing dialed interfaces datastore on SIP/sip.pocos.nl-00003fe8 since we're bridging
channel.c: Set channel SIP/inas1sbc01-00003fe6 to read format slin
channel.c: Set channel SIP/inas1sbc01-00003fe6 to write format slin
channel.c: Set channel SIP/sip.pocos.nl-00003fe8 to read format slin
channel.c: Set channel SIP/sip.pocos.nl-00003fe8 to write format slin
res_rtp_asterisk.c: Setting the marker bit due to a source update
res_rtp_asterisk.c: Setting the marker bit due to a source update
res_rtp_asterisk.c: 0x7f70efc06130 -- Probation learning mode pass with source address 109.235.32.40:35260
res_rtp_asterisk.c: Ooh, format changed from unknown to alaw
res_rtp_asterisk.c: Created smoother: format: alaw ms: 20 len: 160
res_rtp_asterisk.c: Starting RTCP transmission on RTP instance '0x7f70f08f1418'
chan_sip.c: Stopping retransmission on 'SDsefm801-282901960c8ebe0ac043c1da4b1dc7df-a8b8583' of Response 98699154: Match Found
res_rtp_asterisk.c: 0x7f70f0c8eb90 -- Probation learning mode pass with source address 139.156.204.193:23766
res_rtp_asterisk.c: Ooh, format changed from unknown to alaw
res_rtp_asterisk.c: Created smoother: format: alaw ms: 20 len: 160
res_fax.c: no fax activity between SIP/inas1sbc01-00003fe6 and SIP/sip.pocos.nl-00003fe8 after 10000 ms, disabling gateway
channel.c: Set channel SIP/inas1sbc01-00003fe6 to read format alaw
channel.c: Set channel SIP/sip.pocos.nl-00003fe8 to read format alaw
channel.c: Set channel SIP/sip.pocos.nl-00003fe8 to write format alaw
channel.c: Set channel SIP/inas1sbc01-00003fe6 to write format alaw
{quote}
By: Corey Farrell (coreyfarrell) 2017-12-19 20:40:49.018-0600

A reference leak related to v21 was fixed by ASTERISK-26141 in 2016, is this still an issue in current versions of asterisk?
By: Etienne Allovon (etienne_pf) 2018-01-04 03:34:53.109-0600

Hello,

We were tracking this issue as we saw the problem on several asterisk at customer site.
Following [~coreyfarrell] comment I checked on several production asterisk :
- the problem is no longer seen on two asterisk that were upgraded in 13.10.0
- whereas the problem is still seen on an asterisk that is still in 13.7.2

So, IMHO, I would answer, to [~coreyfarrell] above question: _No, it is no longer an issue in current asterisk 13 version_