[Home]

Summary:ASTERISK-24307: Unintentional memory retention in stringfields
Reporter:Etienne Lessard (hexanol)Labels:
Date Opened:2014-09-08 10:51:38Date Closed:2014-11-06 03:07:37.000-0600
Priority:CriticalRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/General Core/AstMM
Versions:SVN 1.8.31.1 11.12.0 12.6.1 13.0.0-beta3 Frequency of
Occurrence
Related
Issues:
causesASTERISK-24535 stringfields: Fix regression from fix for unintentional memory retention and another issue exposed by the fix
is related toASTERISK-24152 memory leak and segfault with in libpthread-2.17.so
Environment:Attachments:( 0) ASTERISK-24307.patch
( 1) ASTERISK-24307-more-agressive-r2.patch
( 2) sip.conf
Description:I could also have called this ticket "memory leak in chan_sip on sip reload", but since this not the typical memory leak where the reference to the memory is defintely lost, I used another name. Anyway.

Here's how I'm reproducing it:

{noformat}
Given I have a Debian wheezy i686 with asterisk 11.12.0
Given I have a modules.conf with only permits loading of chan_sip.so (optional)
Given I have a sip.conf with 500 peers
Then the output of "ps -p $(pidof asterisk) --no-headers -o rss,vsz" is:
  21072  30616
After 100 "sip reload", it is:
  25056  34676
After 500 "sip reload", it is:
  39700  50656
After 1000 "sip reload", it is:
  73728 110704
After a "module unload chan_sip.so", it is:
  20952  29716
{noformat}

Note that there's nothing else going on on the asterisk process -- no AMI connection, no SIP packet received, etc.

I've taken a look at the code and done some testing, and the problem comes from how chan_sip manipulates the string fields in the sip_peer structure during a reload for a peer that hasn't changed. In the build_peer function, it first tries to reset these values to their default by calling the "set_peer_defaults" function, and then sets the new values. The current implementation of string fields doesn't handle well that scenario and this is leading to an unintentional memory retention, i.e. the string fields pool for each peer keeps growing on each sip reload.

One naive way to "fix" this problem would be to call "ast_string_field_init(peer, 512)" just before calling set_peer_defaults, but this clears all string fields, which is probably not something we want to do since they are not all repopulated on sip reload, like for example, the "useragent" or "fullcontact" field.

I found this problem while looking for a suspected memory leak we were observing on a production asterisk server.  The (generated) sip.conf on this server has a lot of peers, and most of the time only one peer gets added/modified/deleted for each sip reload done.  The sip reload operation happens quite frequently, i.e. more than 250 times per week. After a few weeks of uptime, there's a noticeable increase in the memory used by the process.
Comments:By: Etienne Lessard (hexanol) 2014-09-08 10:53:47.163-0500

Attached example sip.conf with 500 peers

By: ibercom (ibercom) 2014-09-09 07:38:14.135-0500

I think that this problem also affects too much "realtime peers", because the build_peer function is used a lot of in this case. And not just "sip reload".


By: dominic paquette (dpaquette) 2014-09-09 08:40:30.592-0500

We also use realtime peers extensivelly and are having massive memory leaks and build_peer was if I recall coming up in valgrind as a possbile source of the leak. I will have to redo a few tests to confirm, but I will posts my results.

By: ibercom (ibercom) 2014-10-01 13:57:23.356-0500

[~hexanol] Have you anything more ? Any solutions ?

By: Etienne Lessard (hexanol) 2014-10-06 12:32:40.655-0500

[~ibercom] Nope. I did not work on finding a solution to the problem.

By: Corey Farrell (coreyfarrell) 2014-10-07 05:23:38.911-0500

I've only briefly looked into this issue, I do not believe {{chan_sip}} is the issue.  Comment before {{add_string_pool()}} in {{main/utils.c}}:
{code}
/*! \brief add a new block to the pool.
* We can only allocate from the topmost pool, so the
* fields in *mgr reflect the size of that only.
*/
{code}

I suspect this is the cause of continuous expansion of memory in use by stringfields, and in theory would effect all code that uses stringfields for data that changes.  I don't have time to work on this now, I hope to have a chance soon.

As a work-around, one thing that might help is to increase the size of the allocation used by chan_sip peer's.  For example if you increased from 512 to 1024 or 2048, that might make the first memory pool large enough to prevent needing to add more pools.  I'm unsure, if someone wanted to test this I'd be interested in the results.  It would either allow sip peers to stay within a single memory pool, or make the problem worse by allocating more memory per pool.

By: ibercom (ibercom) 2014-10-15 12:43:05.251-0500

[~coreyfarrell]: I tried the work-around and it didn't improve. I increased from 512 to 2048 in "ast_string_field_init(peer, 512)". The system continues to leak MORE memory.


By: Corey Farrell (coreyfarrell) 2014-10-20 01:07:28.815-0500

I've just tested this using a sip.conf with 3 peers and a loop that did {{sip reload}} 1000 times, touching sip.conf before each reload.  I can confirm that this issue exists for static peers.  Not every reload caused increased allocations, this could be due to space still being available in the existing allocations.  I did this test with {{MALLOC_DEBUG}}, unfortunately all stringfield operations appear to be coming from the line that called ast_string_field_init, so this doesn't give helpful information.  So unfortunately I have not determined the root cause or a fix yet.

By: Etienne Lessard (hexanol) 2014-10-20 06:45:23.122-0500

I originally created a small asterisk module to confirm that the leak was in the stringfield usage. Here it is, if it can be useful: https://github.com/elessard1/asterisk-lab/blob/master/examples/lab_stringfields_leak.c

The idea is to repetitively call the asterisk CLI command "lab test", which calls ast_string_field_set twice (similar to what chan_sip does when reloading a static peer) and print some debug information about the stringfield pool usage.

By: Corey Farrell (coreyfarrell) 2014-10-21 03:55:09.676-0500

[~hexanol]: Thanks for the test code, it's helped narrowed the issue further.  The issue is not caused by setting the stringfield twice, caused by setting it empty.  After it's empty (where active == 0), stringfields are failing to reclaim space.

Please try [^ASTERISK-24307.patch].  I've only tested with code derived from lab_stringfields_leak.c.  WIth this patch 'used' increased until the first pool was completely used.  A second pool was allocated, then the next operation caused the first pool to be freed.  This updated code has slightly higher overhead, but if it avoids continuously increasing memory usage then it's worth it.  This patch likely has effects on *many* parts of Asterisk.

By: Corey Farrell (coreyfarrell) 2014-10-21 04:10:26.912-0500

[~hexanol]: please also check out [^ASTERISK-24307-more-agressive.patch].  This patch has a higher potential to cause crashes or memory corruption, but it also does a better job (against your test) at keeping the memory to an absolute minimum.

By: Etienne Lessard (hexanol) 2014-10-21 06:32:21.571-0500

I'll probably have the time to test the patches tomorrow.

By: Corey Farrell (coreyfarrell) 2014-10-26 23:39:18.150-0500

[^ASTERISK-24307-more-agressive-r2.patch] fixed for segfault caused by use of {{pool}} after free.

By: Corey Farrell (coreyfarrell) 2014-10-27 03:20:51.327-0500

Review for latest patch posted to https://reviewboard.asterisk.org/r/4114/

By: Etienne Lessard (hexanol) 2014-10-27 14:05:17.095-0500

I've just done a few tests. Looks positive. Memory is not "leaking" anymore when reloading sip peers. I've ran asterisk under valgrind and this looks fine.

By: Etienne Lessard (hexanol) 2014-10-27 14:33:19.856-0500

That said, it looks like it is still "leaking" a little, but there's now an upper bound. When doing the same test I originally described (with the sip.conf I attached), I get these results:

{noformat}
After 0 "sip reload", the output of "ps -p $(pidof asterisk) --no-headers -o rss,vsz" is:
  20568  30652
After 30 "sip reload", it is:
  24548  34576
After 400 "sip reload", it is:
  24548  34576
{noformat}

I don't have an explanation for this behaviour.

By: ibercom (ibercom) 2014-10-27 15:00:20.441-0500

I have used ASTERISK-24307-more-agressive.patch "version 1" without problems.
1200 realtime peers for 4 days and 18000 processed calls without segfault.
It looks like it is still "leaking" although less, more slowly.

Thank you Corey, you are "leak's man" :-)

By: Corey Farrell (coreyfarrell) 2014-10-27 15:22:22.214-0500

You should switch to version 2 immediately.  The testsuite easily caused segfaults (crashes) with version 1 of the patch.

By: ibercom (ibercom) 2014-11-03 05:06:33.596-0600

Corey, is there another version of the patch ?
I have seen https://reviewboard.asterisk.org/r/4114/ but it has the same version than here. Although it has new suggestions for the problem.


By: Corey Farrell (coreyfarrell) 2014-11-04 07:21:48.591-0600

I've changed the summary of this bug so release summaries accurately reflect that this effects the system as a whole, not just chan_sip.