[Home]

Summary:ASTERISK-16710: memory leak in chan_sip.c
Reporter:Eugene M. Zheganin (drookie)Labels:
Date Opened:2010-09-22 00:06:42Date Closed:2012-10-17 09:00:50
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 2010-01-27_growth.pdf
( 1) bt-coredump.txt
( 2) invitetest.xml
( 3) leaking.txt
( 4) leaking-additional-info-no-T.38.txt
( 5) leaking-allocations.zip
( 6) leaking-allocations-additional-no-T.38.zip
( 7) subscribetest.inf.txt
Description:chan_sip.c is still constantly leaking memory.
Below 2 files are provided: one, leaking.txt, showing the output of 'memory show summary/core show calls uptime', in which allocations count and size in chan_sip.c grow over time; second, leaking-allocations.txt, showing the final output of 'memory show allocations'.

The 'summary' file is started with 293358 bytes and 300 allocations, and is ended with 55402745 bytes and 105513 allocations. The RSS of the asterisk process is 187M at the end of observation and continues to grow. From my experience, memory usage willgrow until all of the server memory is exhausted, and asterisk will be killed/restarted by the watchdog, which in my case is snmpd.

Seems like most allocations are done around lines 23905 and 23908 in build_peer() in chan_sip.c

Second file is provided in the zip archive, due to its extremely large size.


****** ADDITIONAL INFORMATION ******

asterisk is running on the FreeBSD 8.0-RELEASE-p2. this is a production pbx, with more than 2K calls per day.
Comments:By: Stefan Schmidt (schmidts) 2010-09-22 01:18:56

could you please add the output of sip show objects (at beginning and end), sip show channels and also sip show sched.

do you use t38? maybe you could deacitivate the t38pt_udptl option on sip.conf an see if it change something.

By: Eugene M. Zheganin (drookie) 2010-09-22 01:21:24

Yes, I do use T38.

Well... it's not that easy to just decide and kill faxes on a production box... I'll think about the way I'll do it.

By: Eugene M. Zheganin (drookie) 2010-09-22 01:22:21

Additional info on the way, will restart * in the evening, now it's like the middle of the day.

By: Stefan Schmidt (schmidts) 2010-09-22 01:34:14

the problem is on every call with udptl activated a udptl structure will be defined and allocated. there was an issue about this which i doesnt find by now (see it here in the last days) but this allocation should be freed after the call and also its dialogs is finished.

By: Leif Madsen (lmadsen) 2010-09-23 09:56:35

I think schmidts is referring to issue ASTERISK-16698 which says that the udptl structure uses more memory than is necessary, and is used on every SIP channel regardless whether T.38 is being used or not. This may be part of the issue, so I've marked it as related to this issue.

By: Eugene M. Zheganin (drookie) 2010-09-24 07:36:01

Here's new dumps.
Now asterisk was restarted, udptl was switched off, and asterisk was '-rx reload'ed.

Looks to me like still leaking.

By: Stefan Schmidt (schmidts) 2010-09-24 09:00:51

what do you have with realtime? i see that the realtime count which is showed in "sip show objects" counts up (from 300 first to 40000!)

do you have allowguests or something on and realtime peers?

i have looked at my testsystem with Trunk version and i have
125982725 bytes in 60135 allocations in file 'chan_sip.c' but with 20.000 peers
with only 20 peers its like this:
276085 bytes in 99 allocations in file 'chan_sip.c'



By: Eugene M. Zheganin (drookie) 2010-09-24 13:24:41

I'm using realtime cdr, extensions and sip peers:

asterisk=> select count(*) from sip;
count
-------
  191
(1 row)

asterisk=> select count(*) from extensions;
count
-------
  191
(1 row)

'allowguest' is defaulted and thus is 'yes', however default context includes only local peers (so there's not much to call to). Plus, I don't see that much calls from anonymous peers; to be exact, I don't see them at all, spontaneous SIP calls from scanners usually crash with authentication, I get lesser than dozen attempts each day.

I turned 'allowguest' to 'no', and will say tomorrow if the memory consumption is continuing to increase. Right now it seems like it does, but I may be wrong.

I turned back T.38 by the way, seems like it has nothing to do with this issue.



By: Stefan Schmidt (schmidts) 2010-09-24 14:26:01

i dont think it has to do with calls, more with peers cause the realtime count of the sip peer objects increase that much. it should be stable or atleast not increase in that big amount.

By: Leif Madsen (lmadsen) 2010-10-04 12:15:51

Or maybe it has something to do with the patterns in the realtime extensions? Tilghman was mentioning something like this in another email thread in that because of the way the database works with realtime extensions, it can use a LOT of memory if you are matching on pattern matches.

If you use a static dialplan and not from realtime, does the memory usage change?

By: Eugene M. Zheganin (drookie) 2010-10-21 12:30:34

It's hard to get 3K calls on a test pbx, and it's REALLY hard to rollback all the realtime configuration to static files on a production one.

Any ideas on how to simplify the test case ? I can't just throw away realtime, it actually contains all the data.



By: Eugene M. Zheganin (drookie) 2010-10-21 13:01:15

Follow-up: found a production box with 200 calls per day, but it's easy there to convert to static. Will post new data after some weeks.

By: Stefan Schmidt (schmidts) 2010-10-21 13:10:46

if you have a test system i can give you a sipp test scenario to create calls.
or if you want i can also use my test system to send sip messages to your test system (this system is only on a 100mbit switch but this would be enough, i think ;)

By: Eugene M. Zheganin (drookie) 2010-10-22 13:35:05

Yeah, scenario would be nice. I suppose there has to be an easy way to simulate calls, but right now nothing comes to mind. I can build a test system easily.

By: Stefan Schmidt (schmidts) 2010-10-22 14:25:53

the easy way is called sipp

have a look at the attached scenario and inf file. this scenario sends an invite and after the 200 ok it waits for 2 seconds and then send a bye. you can change this easily the way you want.

you only have to install sipp (http://sipp.sourceforge.net/) and after installing just start it with this command:
sipp -r 1 -m 1 -sf invitetest.xml -inf subscribetest.inf DESTINATIONIP.
the param after r is the call rate per second and m is the amount of calls until sipp will stop.

By: Ilya (justmann) 2010-11-10 06:39:23.000-0600

This memory leak occurs only on i386 arch. After migrating to amd64 asterisk works perfect for me :)

By: Chris Young (cyoung92612) 2010-12-03 13:16:17.000-0600

I'm seeing the same problem on 1.8.0.  I know where the problem is, but I'm not sure the best way to fix it. It's a reference count related memory leak.

Every 5 minutes, my endpoints re-register and register_verify() is called.  There are a couple places where the reference count for the 'peer' object is incremented.  The problem is that the 'unref_peer' is done through AST_SCHED_xxx and peer->expire is -1, causing the unref_peer() to never be called.  The reference count never goes down to zero and the object is never destroyed.

This happens every time a re-register occurs and the memory allocation in build_peer() continues to grow.

By: Chris Young (cyoung92612) 2010-12-04 20:57:13.000-0600

I've found that setting rtcachefriends=yes in sip.conf 'fixes' this problem on my system.  The reason being that when caching is enabled, there's isn't the repeated creation and failed-destruction of the peer objects in memory every time a client re-registers.  There is just a single object that is created, and maintained, in memory for each client that registers.

By: Andrew L. Davydov (aldtass) 2011-01-02 07:42:50.000-0600

Yes, rtcachefriends=yes is fixes this trouble but in this way real time for sip.conf is works staticaly like peer in sip.conf
If you need dynamic peers this method (rtcachefriends=yes) is not possible.

I steel waiting to fixing this bug.



By: Eugene M. Zheganin (drookie) 2011-01-11 06:45:02.000-0600

I have two more things to say. Or even three. :)

1) This isn't related with architecture, at all. I have now a couple * running on FreeBSD/amd64 - they still leak.

2) Actually rtcachefriends=yes isn't a silver bullet. It helps, * leaking slows down a bit, but it still leaks.
Those are VSZ/RSS for an poorly loaded * (taken between 5K and 7.5K calls count), throught 2 days (rtcaching is on):

403356 125268
403356 125276
403356 125308
403356 125340
403356 125324
403356 125308
403356 125292
403356 125340
403356 125528
403356 125528
403356 125596
407708 126892
407708 126892
407708 127076
407708 128732
407708 128788
407708 128776
409756 129880
409756 129864
411804 133236
413852 134300
413852 134312
413852 134388

Constantly leaks. And yeah, those are kilobytes.

Although I know plenty of guys with FreeBSD/* running it without leaks. Can't say what is the difference, but it's not that obvious that it's realtime, 'cause some of their reported installations use it.

By: Eugene M. Zheganin (drookie) 2011-01-11 07:27:43.000-0600

Finally (sorry for keep bothering and not finding the time) I found time to get sipp and play with the INVITE scenario. Got 2K calls on the * above (RSS/VSZ) and it's definitely not it. Not leaking.

It's more and more likely that it's REGISTER-related realtime leak. 'Cause I'm running HA-clusters on carp, and in case of node failure I got reregistering interval set to 3 minutes, instead of the common default (seen on various SIP equipment) of one hour, and may be that's why my setups are affected that badly. I will keep on playing with sipp and I'll try to create a REGISTER scenario for a realtime peer, or at least something less simple that just the INVITE sequence.



By: Andrew L. Davydov (aldtass) 2011-01-13 11:43:20.000-0600

Today I check behaviour of asterisk with on\off rtcachefriends=yes.
(I used realtime for sip peers, FreeBSD-8.1-STABLE, i386)  

If rtcachefriends=yes Asterisk 1.6.2.15 play like 1.6.0.26 - after processing 16k calls in 5 days it's grow up to 80Mb

I think it is good behaviour...

If rtcachefriends=no Asterisk 1.6.2.15 grow up to 500Mb in 4 days processing only 1k calls.

By: Eugene M. Zheganin (drookie) 2011-01-13 23:03:19.000-0600

In my turn I did all that I promised and ran a couple of tests with sipp.
- sipp calling a static Answer()/Hangup extension (no leak here).
- sipp calling non registered realtime peer
- sipp calling registered realtime peer
- sipp calling registered realtime peer with another sipp instance answering

Although I was capable to rise RSS from 50M to 320M, I didn't figure out the exact sequence to recreate memory leaking. Because during the time from 50V to 320M asterisk had an RSS more than 320M and then it was able to drop it down. So I'm  pretty much stuck here.

By: Shaheryar S. Sheikh (shaheryarkh) 2011-01-22 21:07:13.000-0600

I have a busy asterisk box with lots of realtime sip peers (3K+). This box is now crashing almost daily due to this issue i guess, after processing roughly 10K call.

I have attached bt-coredump.txt which shows back trace of Asterisk generated core dump file (some information such as DNS name of server is censored for security reasons).

One strange thing i found however is that this server also has a large number (500+) of static sip peers with IP based provisioning, which are there for emergency fail-over of another asterisk box. I see reference count for these static peers are constantly increasing (25+ per peer) though these peers can not send or receive anything to this asterisk box since the other asterisk box is working fine. I wonder if this has anything to do with bug.

On other hand, dynamic sip peers seem to have very normal reference count (1-2 per peer).

These figures aren't effected whether or not i enable rtcachefriends flag in sip.conf. SIP qualify is already disable for all peers (static or dynamic).

I am running Asterisk 1.6.2.14 on CentOS 5.5 64-bit and obtaining reference counts from 'sip show objects' command. Let me know if i can help you guys further in solving this problem.

By: John Covert (jcovert) 2011-01-27 13:06:39.000-0600

We were seeing this on 1.6.1.6 and are still seeing it on 1.8.1.1.

We, too, are using ldap realtime peers.

See 2010-01-27_growth.pdf

By: Walter Doekes (wdoekes) 2011-03-07 01:53:17.000-0600

For those using uncached realtime peers, see ASTERISK-17510.

By: Walter Doekes (wdoekes) 2011-07-01 18:22:37.440-0500

Does anyone still have a problem with this with the latest versions?

By: Rusty Newton (rnewton) 2012-09-24 11:37:58.123-0500

Has anyone been able to verify whether these symptoms still occur in recent versions?

By: Matt Jordan (mjordan) 2012-10-17 09:00:30.810-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions.  After testing with Asterisk 1.8, if you find this problem has not been resolved, please open a new issue against Asterisk 1.8.

If you are still experiencing this problem, please retest with a later version of Asterisk 1.8.  If this is still a problem, we can reopen this issue.  Thanks!