[Home]

Summary:ASTERISK-22945: [patch] Memory leaks in chan_sip.c with realtime peers
Reporter:ibercom (ibercom)Labels:
Date Opened:2013-12-06 07:16:22.000-0600Date Closed:2014-10-01 04:57:40
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:1.8.28.2 11.6.0 11.10.0 Frequency of
Occurrence
Constant
Related
Issues:
is duplicated byASTERISK-23999 (Memory leak) Realtime peers are never unref
is duplicated byASTERISK-24317 Crash preceded by potential memory leak.
is related toASTERISK-24152 memory leak and segfault with in libpthread-2.17.so
Environment:Attachments:( 0) asterisk11.patch
( 1) asterisk11-chan_sip-simplifies.patch
( 2) cli-command.txt
( 3) memory-allocations.gz
Description:I have 2 servers with SIP realtime config and asterisk 1.8 and 11 each one.

Server 1.8 have about 2000 register peers without obvious problems.

Server 11 have about 20 register peers and chan_sip.c is constantly leaking memory with each call to the other peers.

The SIP realtime config is the same:

rtcachefriends=yes
rtsavesysname=yes
rtupdate=yes
rtautoclear=yes << I think than here is the problem
ignoreregexpire=yes

When Server 11 look for some peer (not register peer) in sippeers table, it works with this info. Another (+1) static (realtime?) object. When the registration expires, memory leak. You can look for the same peer after some time and asterisk allocates memory again and it isn't freed.

I have attached cli-command.txt which shows the asterisk's state with only 112 calls processed.
Comments:By: ibercom (ibercom) 2013-12-06 07:27:32.646-0600

In memory-allocations.gz file the line:

    18352 bytes allocated by           build_peer() line 30504 of chan_sip.c

is the key. One by each static object.


By: ibercom (ibercom) 2014-02-19 12:15:25.848-0600

Can I do anything to help here ?
It is difficult to me debug the sip code.
Where can i look for ?


By: ibercom (ibercom) 2014-03-10 09:13:07.328-0500

I have managed detect a reference leak related to realtime peers in chan_sip.

{noformat}
       /* Startup regular pokes */
       if (!devstate_only && realtime && peer->lastms > 0) {
-               sip_ref_peer(peer, "schedule qualify");
+               /* sip_ref_peer(peer, "schedule qualify"); */
               sip_poke_peer(peer, 0);
       }
{noformat}

If I comment this line of code, rtautoclear works ok. sip show objects shows correct counters and destructor function for these peers (objects) is called.

But I don't know exactly the functionality of it. Although this works.


By: Matt Jordan (mjordan) 2014-03-10 09:30:09.063-0500

That is probably a bad idea.

The bumping of the peer's ref count is done to ensure that the peer exists during the poke. By removing the ref bump, you're potentially creating a situation where a poke gets scheduled on a peer that may get destroyed before the scheduler fires.

While there may be a ref leak here - something needs to clear the ref bump at some point - simply removing the reference bump is most likely not the correct answer.

By: ibercom (ibercom) 2014-03-10 12:45:15.379-0500

You're right.
I haven't been able to analyze correctly /tmp/refs.
The comment (schedule qualify) has me confused.

{noformat}
==============
Problem: net Refcount not zero for object 1b12bb8
Object 1b12bb8 history:
  0x1b12bb8 =1   chan_sip.c:30504:build_peer (allocate a peer struct)
  0x1b12bb8 +1   chan_sip.c:31137:build_peer (schedule qualify) [@1]
  0x1b12bb8 +1   chan_sip.c:29534:sip_poke_peer (setting the relatedpeer field in the dialog) [@2]
  0x1b12bb8 +1   chan_sip.c:29550:sip_poke_peer (adding poke peer ref) [@3]
  0x1b12bb8 +1   chan_sip.c:5610:realtime_peer (add registration ref) [@4]
  0x1b12bb8 +1   chan_sip.c:5612:realtime_peer (link peer into peers table) [@5]
  0x1b12bb8 +1   chan_sip.c:5614:realtime_peer (link peer into peers_by_ip table) [@6]
  0x1b12bb8 -1   chan_sip.c:22097:function_sippeer (sip_unref_peer from function_sippeer, just before return) [@7]
  0x1b12bb8 +1   chan_sip.c:5661:sip_find_peer_full (ao2_find in peers table) [@6]
  0x1b12bb8 -1   chan_sip.c:22097:function_sippeer (sip_unref_peer from function_sippeer, just before return) [@7]
  0x1b12bb8 +1   chan_sip.c:5661:sip_find_peer_full (ao2_find in peers table) [@6]
  0x1b12bb8 -1   chan_sip.c:22097:function_sippeer (sip_unref_peer from function_sippeer, just before return) [@7]
  0x1b12bb8 -1   chan_sip.c:23543:handle_response_peerpoke (removing poke peer ref) [@6]
  0x1b12bb8 +1   chan_sip.c:23543:handle_response_peerpoke (adding poke peer ref) [@5]
  0x1b12bb8 -1   chan_sip.c:6511:__sip_destroy (unsetting a dialog relatedpeer field in sip_destroy) [@6]
  0x1b12bb8 -1   chan_sip.c:3352:unlink_peer_from_tables (ao2_unlink of peer from peers table) [@5]
  0x1b12bb8 -1   chan_sip.c:3354:unlink_peer_from_tables (ao2_unlink of peer from peers_by_ip table) [@4]
  0x1b12bb8 -1   chan_sip.c:15711:expire_register (removing peer ref for expire_register) [@3]
  0x1b12bb8 -1   chan_sip.c:15726:sip_poke_peer_s (removing poke peer ref) [@2]
==============
{noformat}

But here, who is responsible for this ref leak ?


By: ibercom (ibercom) 2014-03-17 13:26:24.715-0500

Asterisk is working fine for a week without problems. It has processed 16284 calls and it has between 180 and 200 peers.

The patch works, more or less. I don't see memory leaks in chan_sip.

Matt: In chan_sip code, the sip_poke_peer() function is called four times and all of them are different. Which is the correct form ?

I have reached my limit in chan_sip.c

By: ibercom (ibercom) 2014-04-25 13:24:10.661-0500

Can anyone look into this bug ?

The patch has been working for a month without problems. I don't see memory leaks in chan_sip.

I have attached another small patch which simplifies the source code. It deletes a function (unlink_peer_from_tables) which is used only once.

By: Matt Jordan (mjordan) 2014-06-11 20:02:51.825-0500

From the issue description and your comments, I'm not sure I understand the cause of the memory leak, so analyzing the patches isn't going to be terribly easy.

Looking at your refs, I attempted to pair them up:

{noformat}
  0x1b12bb8 +1   chan_sip.c:31137:build_peer (schedule qualify) [@1]


  0x1b12bb8 =1   chan_sip.c:30504:build_peer (allocate a peer struct)
  0x1b12bb8 -1   chan_sip.c:22097:function_sippeer (sip_unref_peer from function_sippeer, just before return) [@7]

  0x1b12bb8 +1   chan_sip.c:5661:sip_find_peer_full (ao2_find in peers table) [@6]
  0x1b12bb8 -1   chan_sip.c:22097:function_sippeer (sip_unref_peer from function_sippeer, just before return) [@7]

  0x1b12bb8 +1   chan_sip.c:5661:sip_find_peer_full (ao2_find in peers table) [@6]
  0x1b12bb8 -1   chan_sip.c:22097:function_sippeer (sip_unref_peer from function_sippeer, just before return) [@7]

  0x1b12bb8 +1   chan_sip.c:5612:realtime_peer (link peer into peers table) [@5]
  0x1b12bb8 -1   chan_sip.c:3352:unlink_peer_from_tables (ao2_unlink of peer from peers table) [@5]

  0x1b12bb8 +1   chan_sip.c:5614:realtime_peer (link peer into peers_by_ip table) [@6]
  0x1b12bb8 -1   chan_sip.c:3354:unlink_peer_from_tables (ao2_unlink of peer from peers_by_ip table) [@4]

  0x1b12bb8 +1   chan_sip.c:29550:sip_poke_peer (adding poke peer ref) [@3]
  0x1b12bb8 -1   chan_sip.c:23543:handle_response_peerpoke (removing poke peer ref) [@6]

  0x1b12bb8 +1   chan_sip.c:29534:sip_poke_peer (setting the relatedpeer field in the dialog) [@2]
  0x1b12bb8 -1   chan_sip.c:6511:__sip_destroy (unsetting a dialog relatedpeer field in sip_destroy) [@6]

  0x1b12bb8 +1   chan_sip.c:5610:realtime_peer (add registration ref) [@4]
  0x1b12bb8 -1   chan_sip.c:15711:expire_register (removing peer ref for expire_register) [@3]

  0x1b12bb8 +1   chan_sip.c:23543:handle_response_peerpoke (adding poke peer ref) [@5]
  0x1b12bb8 -1   chan_sip.c:15726:sip_poke_peer_s (removing poke peer ref) [@2]
{noformat}

The "odd man out" looks to be this ref:
{code}
/* Startup regular pokes */
if (!devstate_only && enablepoke) {
sip_ref_peer(peer, "schedule qualify");
sip_poke_peer(peer, 0);
}
{code}

Note that this only happens if {{realtime}} is enabled:

{code}
if (realtime) {
int enablepoke = 1;

if (!sip_cfg.ignore_regexpire && peer->host_dynamic) {
time_t nowtime = time(NULL);

if ((nowtime - regseconds) > 0) {
destroy_association(peer);
memset(&peer->addr, 0, sizeof(peer->addr));
peer->lastms = -1;
enablepoke = 0;
ast_debug(1, "Bah, we're expired (%d/%d/%d)!\n", (int)(nowtime - regseconds), (int)regseconds, (int)nowtime);
}
}

/* Startup regular pokes */
if (!devstate_only && enablepoke) {
sip_ref_peer(peer, "schedule qualify");
sip_poke_peer(peer, 0);
}
}
{code}

It doesn't appear as if your patch deals with this particular ref.

You'd like some more input on your proposed patch, you may want to post it to review-board for further peer review. That being said, ref leaks are tricky, and removing a ref that is not contributing to the leak may result in a crash - so you may want to dig in a bit further to determine what is causing the issue.

By: ibercom (ibercom) 2014-06-12 13:15:08.474-0500

The first one, my English is bad.

My patch is for asterisk 11.7 and it has worked for asterisk 11.9 and 11.10 now.
I haven't had problems and no crash. At present, my system has a big load:

asterisk*CLI> core show uptime
System uptime: 1 week, 5 days, 1 hour, 21 minutes, 58 seconds
Last reload: 1 week, 3 days, 5 hours, 35 minutes, 59 seconds

asterisk1*CLI> core show calls
33 active calls
165771 calls processed

Now "sip show peers" and "sip show objects" are synchronized:

2265 sip peers [Monitored: 2265 online, 0 offline Unmonitored: 0 online, 0 offline]
= Peer objects: 2265 static, 0 realtime, 1 autocreate =

The patch is very simple but it works.


By: ibercom (ibercom) 2014-06-12 13:30:02.737-0500

Matt:
The first line is the problem. It isn't paired:

0x1b12bb8 +1   chan_sip.c:31137:build_peer (schedule qualify) [@1]

Yes, realtime is enabled and the memory leak is at peers autoclear and it isn't at registered peers. I use 2 asterisk systems with the same DB. You can call a registered peer on another server.

Inside the chan_sip.c, the sip_poke_peer() function is called four times and it never needs previously calls to sip_ref_peer().

Is it clearer now ?

By: dominic paquette (dpaquette) 2014-08-21 09:10:18.069-0500

Hi, we had a very similar issue, we have 5 production servers running on 1.8.25,  and 1 production server running on 11.11.0 (plus 3 non-production on 11.11.0 aswell). We use SIP realtime aswell and have memory leaks on the asterisk 11 box and none on the 1.8. The Asterisk11 box has to be restarted nightly. I have applied the above patches to the non-production asterisk 11 boxes, and now to the production server (11.11.0) and although memory leak has perhaps slowed down, I still need to restart asterisk nightly.

Ibercom: did this above patches solve all your memory leak issues?

Thanks

By: ibercom (ibercom) 2014-08-26 12:43:29.233-0500

dominic: You only need comment out this line

{noformat}
sip_ref_peer(peer, "schedule qualify");
{noformat}

My memory leaks have decreased considerably. Although I still have memory leaks but I don't know where. Now, my server can run for several weeks (3 or 4) without reboot.

I need help from developers to solve this problem correctly.

By: dominic paquette (dpaquette) 2014-08-27 08:28:52.595-0500

Thanks Ibercom. I did try this fix, and unfortunately, for me I still have considerable memory leaks. Yesterday, I deployed asterisk 11.12.0 (with the commented out line above) on my test servers. I will see the results in a few days, but so far, looks like there is still a leak.





By: ibercom (ibercom) 2014-09-09 03:27:43.135-0500

dominic: About your other leaks you can look at this bug [ASTERISK-24307] too. You can research about this. If you find anything, comment it.


By: dominic paquette (dpaquette) 2014-09-09 07:11:53.285-0500

sure will ibercom thx

By: dominic paquette (dpaquette) 2014-09-09 07:16:53.469-0500

ibercom, I read through the bug description although I probably have that leak aswell, we probably dident notice it since we don't do that many reloads. thanks though

By: ibercom (ibercom) 2014-09-09 07:34:18.134-0500

dominic: I think that this problem [ASTERISK-24307] also affects to realtime peers, because the build_peer function is used a lot of in this case.


By: dominic paquette (dpaquette) 2014-09-22 08:12:41.840-0500

Ibercom: I found my main leak last friday and ran it in prod through the week end, looks like my main memory leak was comming from one of our custom patches. I just ran the fixed version through valgrind, and, Ièm probably seeing the same leak you have (before I had to restart my server daily, and now, looking at my mem usage graphs I probably will have to restart every 2 weeks or so). I see this in valgrind:
==18320== 20,628,228 (14,412,096 direct, 6,216,132 indirect) bytes in 786 blocks are definitely lost in loss record 2,159 of 2,159
==18320==    at 0x4C2AAE8: calloc (vg_replace_malloc.c:618)
==18320==    by 0x5A3A66: _ast_calloc (utils.h:514)
==18320==    by 0x44EA71: internal_ao2_alloc (astobj2.c:565)
==18320==    by 0x44ECC3: __ao2_alloc (astobj2.c:641)
==18320==    by 0xC09E81C: build_peer (chan_sip.c:30673)
==18320==    by 0xC017A1D: realtime_peer (chan_sip.c:5434)
==18320==    by 0xC018058: sip_find_peer_full (chan_sip.c:5529)
==18320==    by 0xC01814F: sip_find_peer (chan_sip.c:5568)
==18320==    by 0xC05290B: register_verify (chan_sip.c:16880)
==18320==    by 0xC090423: handle_request_register (chan_sip.c:28128)
==18320==    by 0xC091C62: handle_incoming (chan_sip.c:28436)
==18320==    by 0xC09262F: handle_request_do (chan_sip.c:28604)

I ran a sipp script that does a bunch of registrations (also does subscribes, which was where my custom patch was leaking). I can reproduce at will, on an asterisk running in valgrind. I'll keep you posted if I find anything.


By: Rusty Newton (rnewton) 2014-09-24 15:04:21.285-0500

[~ibercom] , as [~mjordan] said:

{quote}
You'd like some more input on your proposed patch, you may want to post it to review-board for further peer review
{quote}

Make sure to follow the [Code Review|https://wiki.asterisk.org/wiki/display/AST/Code+Review] process and get the patch on Reviewboard so others can help get this patch worked out.

You might note on the review that the patch has been tested by the reporter on the issue I just linked - ASTERISK-23999