[Home]

Summary:ASTERISK-28523: Asterisk 16.5.0 Memory leak
Reporter:Cyril Ramière (Cyril.r)Labels:pjsip webrtc
Date Opened:2019-09-04 09:55:09Date Closed:2019-09-24 08:03:08
Priority:MajorRegression?
Status:Closed/CompleteComponents:General
Versions:16.5.0 Frequency of
Occurrence
Related
Issues:
is a clone ofASTERISK-28541 Asterisk 16.5.0 Memory leak
Environment:Ubuntu 16.04.6 LTS / AWS C5.* instanceAttachments:( 0) ast_ram_24h.PNG
( 1) ast_ram_up.PNG
( 2) ast_sorcery_cache_endpoints_2000vs500.PNG
( 3) config.zip
( 4) endpoints.1000.sql
( 5) endpoints.2000.sql
( 6) extconfig.conf
( 7) extensions.conf
( 8) invite_recv.xml
( 9) invite.csv
(10) invite.xml
(11) leak_caching.txt
(12) leak_without_caching.txt
(13) memory_show_allocations_24h.zip
(14) memory_show_allocations.zip
(15) memory_show_summary_24h.txt
(16) memory_show_summary.txt
(17) modules.conf
(18) pjsip.conf
(19) res_odbc.conf
(20) sorcery.conf
Description:Hello everyone,

I noticed that our asterisk are consuming big amounts of ram (could be more than 1.6Gb after some days/weeks depending on server load).

We are using asterisk in realtime mode for all of our calls with MySQL (res_mysql).
Everything is handled through the ARI interface (dialplan passes to Stasis) connected to our application.

I attached the full configuration of our asterisk server, I just replaced sensitive information to not expose our passwords/endpoints.

I'm currently running an instance with MALLOC_DEBUG & no optimize, it's been couple hours since I started my test and it is taking almost 300Mb for now (it will continue to grow).

Here is the result of commands memory show.

Best regards.
Comments:By: Asterisk Team (asteriskteam) 2019-09-04 09:55:10.457-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Cyril Ramière (Cyril.r) 2019-09-04 09:56:07.643-0500

config.zip : full configuration
memory show : result of commands.

By: Cyril Ramière (Cyril.r) 2019-09-04 09:57:24.738-0500

Ram creeping up.

By: Cyril Ramière (Cyril.r) 2019-09-05 04:16:05.730-0500

Hi,

After 24 hours, ram usage is at 550M and continues to rise.

I attached result of memory show commands after 24h running.

By: Kevin Harwell (kharwell) 2019-09-09 17:37:20.888-0500

Anything change lately? Like did you upgrade, configuration changed, etc... and then noticed the memory growing?

What happens if you turn off caching in sorcery? Does the memory still grow?

By: Cyril Ramière (Cyril.r) 2019-09-10 08:29:41.059-0500

Hi Kevin,

No changes lately on configuration, only thing that changed is that we started to rely more on webRTC with dozens of active contacts, before it was too lightly used and we didn't notice anything.

I tried to disable sorcery cache by commenting out the lines in config file this morning and re-ran our test.

I can confirm that ram is not going up anymore.

By: Kevin Harwell (kharwell) 2019-09-10 16:28:40.980-0500

What's a typical use case look like when running? For instance, do you have any long running sessions? Active sessions would continue to hold a reference to the old endpoint object until they ended.

By: Kevin Harwell (kharwell) 2019-09-10 16:31:55.247-0500

Sounds like we might have narrowed it down a bit to the sorcery caching. What happens if you only disable caching for one of the types? Like disable it on endpoints, but not aors and auths, etc...

By: Cyril Ramière (Cyril.r) 2019-09-11 02:19:06.467-0500

Hi,

I will try to re-enable caching step by step, watch for memory usage and report back.

Use case is we receive phone calls on sip trunks (pjsip driver), and we have agents on webrtc that answer those calls.
Active webRTC sessions can last couple of hours to the whole day.
I already tried a test when I connect/disconnect webRTC agents and they don't take calls, memory is not really moving, so it seems to happen only when they do calls.



By: Cyril Ramière (Cyril.r) 2019-09-11 08:11:13.408-0500

Enabling again `endpoint/cache` in sorcery.conf seems to make the problem happens again, ram slowly going up from 120M to 230M in 5H, still growing.

I have not tested with other lines yet (it takes time).

By: Kevin Harwell (kharwell) 2019-09-11 09:59:27.974-0500

How many endpoints do you roughly have? Is your configuration fairly dynamic? Like new endpoints being added/updated regularly? Are you using ARI push configuration for instance?

Are you doing any reloads occurring? regularly?

Another thing you can try is reducing the maximum_objects in the cache. For instance maybe set it to below 500. If you can does it continue to grow in the same way and size?

By: Cyril Ramière (Cyril.r) 2019-09-11 10:11:39.903-0500

We have 123 endpoints on the test server (they appear in pjsip show endpoints)

The configuration is not really moving, we have our sip trunks that doesn't move at all, and our webRTC agents that can move very slightly, some agents can be created or deleted during the day in production but in our test setup it doesn't move.

No ARI push configuration, everything is in the database (see extconfig.conf, we hold extensions, moh, endpoints, auths and aors in database)

No reloads needed since everything is handled by our stasis application or realtime.

I will test reducing maximum_objects in the cache tomorrow so I can compare on 2 days with different settings.

By: Kevin Harwell (kharwell) 2019-09-11 10:29:33.489-0500

Oh also you should be able to use the following CLI command to monitor the objects in the cache:
{noformat}
*CLI> sorcery memory cache show <name>
{noformat}
By taking "snapshots" over time you might be able to see if objects are not expiring, being remove, etc...when they should.

By: Cyril Ramière (Cyril.r) 2019-09-11 10:55:48.453-0500

Oh, I tried this command for 5 minutes and output has not moved:

It seems that there is not a lot of objects in the cache.

{noformat}
sorcery memory cache show res_pjsip/endpoint
Sorcery memory cache: res_pjsip/endpoint
Number of objects within cache: 23
Maximum allowed objects: 2000
Number of seconds before object expires: 3600
Number of seconds before object becomes stale: 60
Expire all objects on reload: On
{noformat}

By: Cyril Ramière (Cyril.r) 2019-09-12 02:31:26.209-0500

Setting sorcery endpoint cache maximum_objects from 2000 to 500 doesn't make a difference.

First run was with 2000 then second run was with 500 calls.

By: Kevin Harwell (kharwell) 2019-09-12 15:21:55.532-0500

I've setup a testbed here to try and replicate. Can you attach your log file (full or messages), or if it has sensitive data can you search for something like the following: "Excessive refcount"

Or do you noticed any kind of other error or warning message?

By: Michael Maier (micha) 2019-09-16 11:37:48.260-0500

Some additional information:
I can see massive memory leak since asterisk >= 16.5. It's enough to run 4 idle trunks and 2 extensions - each using pjsip, the trunks are using SIPS. They're doing just ReINVITES or OPTIONS and the memory usage is hourly rising by 792 kBytes.

For me, the memory leak is fixed (= same behavior as with <=16.4.) by using pjsip 4.8 (instead of 4.9, which came with 16.5.0). From my point of view, there is a problem with pjsip 4.9.

Could you please try to run >= 16.5. with pjsip 4.8 and check if the memory leak disappears?

By: Kevin Harwell (kharwell) 2019-09-17 18:00:34.391-0500

I believe I was able to duplicate this issue. Ran two sets of tests. One with caching enabled, and the other without. I recorded the cpu/mem, and as well as MALLOC_DEBUG memory summary information at different intervals. See the following:

[^leak_caching.txt]
[^leak_without_caching.txt]

Interestingly, both show memory climbing, and not being fully released once calls were stopped. However, when sorcery caching is enabled memory seemed to climb faster. So there might be two separate leaks going on here.

It looks like there might be something holding onto format_cache objects.



By: Kevin Harwell (kharwell) 2019-09-17 18:13:57.759-0500

I disabled most modules from loading. See [^modules.conf]. The other _conf_ files were the base configs used for testing. I ran the test once with sorcery caching enabled, and then a second time after disabling caching (i.e. commented out the caching line):

[^extconfig.conf]
[^extensions.conf]
[^modules.conf]
[^pjsip.conf]
[^res_odbc.conf] (use your username/password here)
[^sorcery.conf]

Using a postgresql database I created the necessary tables using the Asterisk alembic scripts. I then inserted 2000 endpoints and associated data into the appropriate tables. I used the following files to easily insert the records:

[^endpoints.1000.sql]
[^endpoints.2000.sql]

I used something like the following command to do so:
{noformat}
$ psql -U your_db_user < endpoints.1000.sql
$ psql -U your_db_user < endpoints.2000.sql
{noformat}

Using the above configs, I started Asterisk, and then executed the attached SIPp scenarios to make calls through Asterisk. Start [^invite_recv.xml] first:
{noformat}
$ sipp 127.0.0.1 -p 5062 -aa -sf invite_recv.xml
{noformat}
Then I started [^invite.xml] next which initiates the calls. I let the scenarios run for around 10 minutes before stopping:
{noformat}
$ sipp 127.0.0.1 -p 5061 -sf invite.xml -inf invite.csv -aa -r 30
{noformat}

By: Friendly Automation (friendly-automation) 2019-09-24 08:03:10.200-0500

Change 12924 merged by Friendly Automation:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12924|https://gerrit.asterisk.org/c/asterisk/+/12924]

By: Friendly Automation (friendly-automation) 2019-09-24 08:11:32.156-0500

Change 12919 merged by Friendly Automation:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12919|https://gerrit.asterisk.org/c/asterisk/+/12919]

By: Friendly Automation (friendly-automation) 2019-09-24 08:16:59.648-0500

Change 12920 merged by Friendly Automation:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12920|https://gerrit.asterisk.org/c/asterisk/+/12920]

By: Friendly Automation (friendly-automation) 2019-09-24 08:47:36.739-0500

Change 12918 merged by Friendly Automation:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12918|https://gerrit.asterisk.org/c/asterisk/+/12918]

By: Friendly Automation (friendly-automation) 2019-09-24 08:48:00.237-0500

Change 12923 merged by George Joseph:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12923|https://gerrit.asterisk.org/c/asterisk/+/12923]

By: Friendly Automation (friendly-automation) 2019-09-24 08:48:51.561-0500

Change 12926 merged by George Joseph:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12926|https://gerrit.asterisk.org/c/asterisk/+/12926]

By: Friendly Automation (friendly-automation) 2019-09-24 10:29:03.923-0500

Change 12921 merged by Kevin Harwell:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12921|https://gerrit.asterisk.org/c/asterisk/+/12921]

By: Friendly Automation (friendly-automation) 2019-09-24 10:29:32.084-0500

Change 12925 merged by Kevin Harwell:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12925|https://gerrit.asterisk.org/c/asterisk/+/12925]

By: Friendly Automation (friendly-automation) 2019-09-24 10:30:32.011-0500

Change 12922 merged by Kevin Harwell:
res_sorcery_memory_cache: stale item update leak

[https://gerrit.asterisk.org/c/asterisk/+/12922|https://gerrit.asterisk.org/c/asterisk/+/12922]