ASTERISK-25905: Memory leak during perf testing

[Home]

Summary: ASTERISK-25905: Memory leak during perf testing

Reporter: Robert McGilvray (rmcgilvray) Labels:

Date Opened: 2016-04-07 16:41:03 Date Closed: 2016-04-19 11:07:08

Priority: Minor Regression?

Status: Closed/Complete Components: Applications/app_confbridge pjproject/pjsip

Versions: 13.8.0 Frequency of
Occurrence Constant

Related
Issues:
is related to ASTERISK-26020 memory leak

Environment: Red Hat Enterprise Linux Server release 7.2 (Maipo) Linux ykt1cfbprd1 3.10.0-327.13.1.el7.x86_64 certified/13.8-cert1-rc1 pjproject-2.4.5 Attachments: ( 0) loadtest.txt
( 1) memory-summary.txt

Description: ** I've been testing against the certified branch, last cloned yesterday with certified/13.8-cert1-rc1. It would not allow me to select that as a version however **

While using sipp as a generator to load test Asterisk I've come across a memory leak that very quickly exhausts the host of resources.

The testing methodology is pretty simple: use sipp to launch 1500 concurrent calls to asterisk with a call rate of 25/sec. On the asterisk side use the RAND function to generate two numbers, one of which is the confbridge number and the other (either 0 or 1) is to determine whether to use the moderator profile or participant. The call is then dropped into a ConfBridge for 60s and Hungup.

After a few thousand completed calls the memory usage grows and eventually exhausts the host resources. I recompiled with MALLOC_DEBUG enabled, the output of memory show allocations is attached. It looks like the allocations are in stasis_channels, well after all channels have been disconnected.
{noformat}
ykt1cfbprd1:/home/netops# ps -C asterisk u
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
asterisk 20927 113 9.2 6292308 3029204 ? Sl 16:32 24:47 /home/asterisk/asterisk-cert-13.8/sbin/asterisk -f -C /home/asterisk/asterisk-cert
root 32052 0.0 0.0 47428 2840 pts/0 S+ 16:43 0:00 rasterisk risk/asterisk-cert-13.8/sbin/asterisk -r
{noformat}

Please let me know if you need any further information.

Thanks!!

Comments: By: Asterisk Team (asteriskteam) 2016-04-07 16:41:04.080-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Robert McGilvray (rmcgilvray) 2016-04-07 21:38:18.099-0500

I was able to reproduce the issue on certified/13.1-cert4.

Asterisk processed around 22k calls before consuming 38g of RAM - things fell apart shortly after.
{noformat}
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9842 asterisk 20 0 37.780g 0.029t 1684 S 64.2 95.8 58:14.80 asterisk
{noformat}

By: Richard Mudgett (rmudgett) 2016-04-13 22:21:41.139-0500

In my setup, I was able to reproduce what seems to be the same thing. The output of “core show taskprocessors” demonstrates that the CDR engine is working off a backlog of tasks due to the continuous stream of calls from the stress test. Working off that backlog can take quite some time once the bombardment of calls stops. Unfortunately, “core show taskprocessors” only gives meaningful task processor names in newer Asterisk versions, whereas it’s a lot of meaningless UUID names in v13.1-cert.

Once the CDR engine has worked off the backlog, the memory reported by MALLOC_DEBUG goes back to a normal level. However, the memory is not given back to the operating system by glibc's allocator. The glibc allocator does not normally return memory to the OS. So when Asterisk gets in this situation and consumes a large portion of the system memory, the memory is not returned to the OS until asterisk is restarted. There is a malloc_trim() function specific to glibc that tells the allocator to return free memory at the end of the heap back to the OS. For testing purposes, I created a temporary CLI command to invoke the malloc_trim() function. My test machine has 3GB of memory. After a stressful pounding that grabbed 62% of available memory, malloc_trim() gave back all but 2.6% to the OS when all the channels and taskprocessor backlog were gone. The 2.6% is about the expected size of the heap before the backlog started to explode. The size of the memory released shows that there isn't a systemic memory leak as a leak like that would cause severe heap fragmentation and prevent the release of that much memory.

Please note that MALLOC_DEBUG does have a performance penalty. The backlog shown in the task processor queues will go down faster when MALLOC_DEBUG is disabled.

# There is no memory leak. All that extra memory is being consumed by the backlog of tasks in the task processors. Once the backlog clears, the memory usage goes back to normal. But the memory isn't returned to the OS by the glibc allocator.
# It is certain though that Asterisk is being delivered a higher number of calls than it can handle for a sustained period. This is not a particular fault with Asterisk.
By: Robert McGilvray (rmcgilvray) 2016-04-14 09:22:40.326-0500

Richard,

First, thank you for looking at this. Your analysis makes sense but it does raise an issue for me - I need high concurrency and I also need CDR logs. Once you pointed out that CDR was at the root of the issue I did more testing - without CDR I can easily run 1800 concurrent calls and Asterisk is happy never going beyond 500M RES.

Granted my load tests are generating a higher call rate than normal (60s call duration) so under normal circumstances we wouldn't have as many calls in the same time period, however, it seems CDR and high concurrency don't mix at all.

I just did a test where I ran 1500 concurrent calls for a total of around 15k. The task processor grew to a depth of over 9 million (9127584 to be exact) , almost 25 minutes later after having paused all new calls the the queue is still higher than the total amount pcoessed.

93777b48-6756-4e4b-a723-65ffb8358e2d 4758483 5909429 9127584

I see around 1500 PJSIP channels in 'cdr show active' but I'm not sure what the box is actually doing. It's not writing the CDRs anywhere (csvs are empty, no backends loaded), but one of my CPUs is pegged at 100% while the other 7 are idle. It seems the code that handles the CDR records / processing the backlog is single threaded, cpu bound and incredibly inefficient.

If this isn't a memory leak can anything be done about the single threaded code / improving the speed at which Asterisk processes the backlog? I don't think I'm doing anything crazy here - 1k-1500 concurrency shouldn't be an issue with the hardware I'm throwing at Asterisk. (8) vCpus @ 3.30 Ghz and 32G of ram.

In my production environment we peak at 400 concurrency daily with chan_sip and Meetme. We have 3 Call manager clusters (>5000 users) that all feed into Asterisk for conferencing so I need this to scale. Any help you guys can provide to make that happen is much appreciated.

Regards

By: Robert McGilvray (rmcgilvray) 2016-04-14 09:56:29.546-0500

update: it took about an hour for Asterisk to clear the task processor - keep in mind this was only 15k calls with zero new in the last hour.

93777b48-6756-4e4b-a723-65ffb8358e2d 10667913 0 9127584
By: Richard Mudgett (rmudgett) 2016-04-14 12:29:34.729-0500

If you still have MALLOC_DEBUG enabled, you should disable it for performance testing. MALLOC_DEBUG writes over every freed block of memory with a fill pattern and it holds onto the block for awhile in a deferred free pool. Access to the deferred free pool serializes freeing memory. The necessary actions of MALLOC_DEBUG reduce performance because Asterisk allocates and frees a lot of memory blocks from a lot of threads.

Unless you disable CDRs in the {{cdr.conf}} file, CDRs are generated. If you don't have a backend enabled then the generated records have nowhere to go but the bit bucket. The CDR engine's task processor queue is the raw events from calls monitored by the engine to generate CDRs. There can be only one thread processing that queue to generate CDRs.

I don't think CDRs are a good fit for your situation because you wind up with a combinatorial explosion of CDRs to represent every pairing in a conference. I suggest you look at CEL to create your own records of conferences.

https://wiki.asterisk.org/wiki/display/AST/Asterisk+12+CDR+Specification
By: Richard Mudgett (rmudgett) 2016-04-14 12:57:20.177-0500

Also for CEL information:
https://wiki.asterisk.org/wiki/display/AST/Asterisk+12+CEL+Specification
By: Robert McGilvray (rmcgilvray) 2016-04-19 10:06:24.748-0500

Hey Richard,

You're exactly right. I guess the behavior of CDR changed from 1.8, I don't think the old CDR logic created entries for every pairing and that's what's killing it.

Anyway I replaced CDR with CEL and the issue has more or less gone away. I ran 1500 concurrency through the box overnight for a total of 1M calls. The RES size did grow to around 4G but this was with the extended cel_sqlite3_custom module. I'm running the test again with just cel_custom to see if the same thing occurs.

Either way probably not a big deal, our weekly restart will clear up that small amount of extra memory.

{noformat}
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 19138 0.0 0.0 47348 2796 pts/3 S+ 10:27 0:00 rasterisk r
asterisk 29127 39.9 13.3 9120652 4360128 ? Sl Apr14 2899:04 /home/asterisk/asterisk/certified-13.1-cert4/sbin/asterisk -f -C /home/asterisk/asterisk/certified-13.1-cert4

ykt1cfbprd1*CLI> core show calls
2413 of 5000 max active calls (48.26% of capacity)
1064261 calls processed
{noformat}

Thanks for your help.