[Home]

Summary:ASTERISK-28521: pjsip: Memory Leak
Reporter:Mark (mpa)Labels:patch pjsip
Date Opened:2019-09-03 07:51:21Date Closed:2019-09-24 14:23:54
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip Resources/res_pjsip_outbound_registration
Versions:16.4.0 16.4.1 16.5.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Asterisk 16.5.0, OpenBSD 6.6-betaAttachments:( 0) 0001-pjproject_bundled-Revert-pjproject-commits-causing-l.patch
( 1) 0030-revert_ssl.patch
( 2) 0031-revert_ssl.patch
( 3) pjsip.ast2.conf
( 4) pjsip.conf
( 5) pjsip.conf
( 6) summary_12h
( 7) summary_start
( 8) test_revert_ssl_remote_cert.patch
Description:After updating OpenBSD 6.6-current about 4 weeks ago, i've also
updated Asterisk 16.4.0 -> 16.4.1. Since then, Asterisk started
using more and more memory. Right now, i'm running Asterisk
16.5.0 and OpenBSD 6.6-beta in a very small test environment and
i still see the same issue.

Weeks ago, Asterisk started with ~60MB and stayed around that
value...now it starts with ~60MB and increases by 80MB/24h
without any calls made.

After compiling with MALLOC_DEBUG and running "memory show
summary" every few hours, there's only one line that goes through
the roof:

* after start
5542784 bytes in        464 allocations in file ../src/pj/pool_policy_malloc.c

* 12h later
67805312 bytes in      25962 allocations in file ../src/pj/pool_policy_malloc.c

The same applies for "memory show allocations":

* after start
$ grep "pool_policy_malloc.c" allocations_start | wc -l
468

* 12h later
$ grep "pool_policy_malloc.c" allocations_12h | wc -l
25968

The only setting that stops Asterisk from allocating more and more
memory is, when removing "registration" and "endpoints" from
pjsip.conf. This way, Asterisk only listens on 5061 TLS and does
nothing.
Comments:By: Asterisk Team (asteriskteam) 2019-09-03 07:51:21.914-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Mark (mpa) 2019-09-03 08:01:10.642-0500

summary_start = memory show summary   after START
summary_12h = memory show summary    12h later

By: Joshua C. Colp (jcolp) 2019-09-03 08:41:02.019-0500

That would be the PJSIP memory pool, which can make things extremely difficult to track down.

Can you narrow it down further? What's an example configuration that exhibits it? Are you doing reloads often? Are there calls going on?

By: Mark (mpa) 2019-09-04 01:09:44.888-0500

The test configuration setting i've setup about 9h ago

* is a very simple pjsip.conf with only the registration part and no endpoints (see attached file)
* no extensions.ael configured
* there were no calls made
* Asterisk hasn't been restarted after start

/var/log/asterisk/mmlog shows
1567544069 - New session

* memory show summary    - after start
5015680 bytes in        258 allocations in file ../src/pj/pool_policy_malloc.c

* memory show summary    - 9h later
26144896 bytes in       9428 allocations in file ../src/pj/pool_policy_malloc.c

By: Joshua C. Colp (jcolp) 2019-09-04 07:56:19.240-0500

Re-reading your initial description you mentioned it was going from 16.4.0 to 16.4.1 that caused the problem? Is my understanding correct? There were only two changes between those (at least in what we ship) that did not touch such areas at all. One change was in res_pjsip_messaging and the other chan_sip, which would be a strange thing to cause a regression in this area.

By: Mark (mpa) 2019-09-04 09:35:14.751-0500

Besides Asterisk, i've also updated depending packages and the base OS. Right now i can reproduce the error with all Asterisk versions listed above (16.4.0, 16.4.1, 16.5.0).

By: Kevin Harwell (kharwell) 2019-09-09 17:06:33.746-0500

Are the registrations and endpoint qualifications succeeding? Any kind of errors or warnings in the log?

If you can please try other changes to your configuration. For instance:

Remove the acl section. Does the leak still occur?

Try a non tls transport if you can?

If possible set the qualify_frequency to '0' (turning it off), and see if the leak still occurs. If it doesn't try turning it back on, but at a lesser value to see if the leak occurs faster.

If that doesn't affect things then try removing just the outbound registrations, but leave the endpoints and see if you still noticed a change.

By: Mark (mpa) 2019-09-12 07:36:33.395-0500

Though increasing debug/verbose log, i only see normal communication between asterisk<->sip provider/endpoints (one registered number + one endpoint during tests, so the produced log is fairly small).

I've tried with the following changes to the configuration:
- delete acl section = same issue
- endpoints only  (no registration) = same issue
- registration only (no endpoints) = same issue
- qualify_frequency set to 0 = same issue

As i didn't see any increased memory issues with zero registration/endpoints before, i also thought about changing parameters which are always involved with both registration/endpoint connection attempts.

To make it short, when switching from tls -> udp for registration and endpoints, i don't see anymore memory issues and "pool_policy_malloc.c" entries even decrease after time

# start
$ grep "pool_policy_malloc.c" allocations_start | wc -l
289

# after 2h
$ grep "pool_policy_malloc.c" allocations_2h | wc -l
273

Further tests show:
- registration (tls), endpoints (udp) = same issue
- registration (udp), endpoints (tls) = same issue
- registration (udp), endpoints (udp) = no issue

UPDATE: 6h later...everything's fine. No noticeable change in memory consumption.

By: Kevin Harwell (kharwell) 2019-09-13 12:25:16.704-0500

{quote}
UPDATE: 6h later...everything's fine. No noticeable change in memory consumption.
{quote}
I assume you were still running with the udp transport?

As well just to be clear the only messages being transmitted are REGISTERs and OPTIONs correct? What happens if you've configured it to use the tls transport, but set the qualify_frequency=0 (essentially disabling OPTIONs)?

I've setup a test here between two Asterisk instances, with instance 1 registering (using your configuration) to instance 2. So far I have not seen memory really going up. I even made a bunch of outbound calls.

By: Mark (mpa) 2019-09-15 02:31:54.781-0500

{quote}I assume you were still running with the udp transport?{quote}

Yes

{quote}As well just to be clear the only messages being transmitted are REGISTERs and OPTIONs correct?{quote}

Yes, normal keep alive messages

{quote}What happens if you've configured it to use the tls transport, but set the qualify_frequency=0 (essentially disabling OPTIONs)?{quote}

I've tried this before with no luck. Asterisk starts with app 60MB allocated and reaches 150MB in under a day.

{quote}I've setup a test here between two Asterisk instances, with instance 1 registering (using your configuration) to instance 2. So far I have not seen memory really going up. I even made a bunch of outbound calls.{quote}

Are you also running OpenBSD for your tests? I'm not 100% sure right now (needs more testing), but i see differences when running Asterisk 16.5.1 with OpenBSD 6.5 (no memory increase) and OpenBSD 6.6-beta.

By: Kevin Harwell (kharwell) 2019-09-16 10:48:27.847-0500

{quote}
Are you also running OpenBSD for your tests?
{quote}
Not at this time. I am currently running Ubuntu 18.04.

It's odd upgrading OpenBSD appears to cause the leak. What version of openssl is installed in each version? Did the openssl version change between OpenBSD 6.5 and 6.6-beta?

By: Michael Maier (micha) 2019-09-16 11:39:18.069-0500

Some additional information:
I can see massive memory leak since asterisk >= 16.5. It's enough to run 4 idle trunks and 2 extensions - each using pjsip, the trunks are using SIPS. They're doing just ReINVITES or OPTIONS and the memory usage is hourly rising by 792 kBytes.

For me, the memory leak is fixed (= same behavior as with <=16.4.) by using pjsip 4.8 (instead of 4.9, which came with 16.5.0). From my point of view, there is a problem with pjsip 4.9.

Could you please try to run >= 16.5. with pjsip 4.8 and check if the memory leak disappears?

By: Kevin Harwell (kharwell) 2019-09-16 18:18:26.109-0500

I believe I was able to replicate this. I've attached my [^pjsip.conf] file that I used as well. I added a few more endpoints and registrations in hopes it would speed up the memory growth. After running for a bit memory did slightly increase.

To speed things up further I then altered outbound registrations in Asterisk to send one every second. Here's some output from that (At startup):
{noformat}
6581856 bytes in        653 allocations in file ../src/pj/pool_policy_malloc.c
{noformat}
~30 minutes later:
{noformat}
VIRT       RES     SHARED  CPU  MEM
2946888 188544  29928 S   2.0  2.4   0:28.58 asterisk
9759328 bytes in       1164 allocations in file ../src/pj/pool_policy_malloc.c
{noformat}

By: Kevin Harwell (kharwell) 2019-09-16 18:20:01.570-0500

I too the switched to a UDP transport (at startup):
{noformat}
7766688 bytes in        800 allocations in file ../src/pj/pool_policy_malloc.c
{noformat}
~30 minutes later:
{noformat}
VIRT    RES     SHARED    CPU  MEM
2948928 183840  29440 S   2.3  2.3   0:35.31 asterisk
7783072 bytes in        804 allocations in file ../src/pj/pool_policy_malloc.c
{noformat}
even went down after some more time:
{noformat}
7580512 bytes in        789 allocations in file ../src/pj/pool_policy_malloc.c
{noformat}

By: Kevin Harwell (kharwell) 2019-09-16 18:22:18.516-0500

I was running the current 16 branch of Asterisk with pjproject bundled 2.9 on Ubuntu 18.04.
{noformat}
$openssl version
OpenSSL 1.1.1  11 Sep 2018
{noformat}

By: Kevin Harwell (kharwell) 2019-09-16 18:33:45.386-0500

Oh and to duplicate I just setup two Asterisk instances with the attached configurations (few items like host and key files will need changing), and watched the memory grow.

Attaching second config used to receive registrations [^pjsip.ast2.conf]

By: Mark (mpa) 2019-09-17 02:30:39.821-0500

Thanks for testing. I'm also using the bundled pjproject 2.9, but as i wrote earlier the issue happens with 16.4.0 (pjproject 2.8) as well. Nonetheless i'll try a few things and report back. Hopefully we can nail it down then.

* OpenBSD 6.5
$ openssl version
LibreSSL 2.9.1

* OpenBSD 6.6-beta
$ openssl version
LibreSSL 3.0.0


By: Mark (mpa) 2019-09-17 11:52:51.483-0500

Update: My last tests show, that it's definitely not related to the local OpenSSL/LibreSSL version, but the changes between bundled pjproject 2.8<->2.9. (as hinted by Michael) To be more precise, the following library makes the difference:

libasteriskpj.so.0.0

When running Asterisk with libasteriskpj.so.0.0 coming with Asterisk 16.4.0 (pjproject 2.8) everything's fine. It doesn't matter if you're running Asterisk 16.4.* or 16.5.*
As soon as i copy the libasteriskpj.so.0.0 from Asterisk 16.5.1 (pjproject 2.9), the memory issue starts.

By: Michael Maier (micha) 2019-09-17 12:12:48.771-0500

Thanks for confirmation of what I already wrote here: https://issues.asterisk.org/jira/browse/ASTERISK-28521?focusedCommentId=248020&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-248020

By: George Joseph (gjoseph) 2019-09-18 10:31:21.268-0500

Between the comment here and the thread on asterisk-dev I'm a little confused.
Are we saying this is a tls-only issue or does it affect udp as well?


By: Michael Maier (micha) 2019-09-18 11:09:21.432-0500

Mark wrote here https://issues.asterisk.org/jira/browse/ASTERISK-28521?focusedCommentId=247997&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-247997 :

------------------------------------------------------------------------
Further tests show:

   registration (tls), endpoints (udp) = same issue
   registration (udp), endpoints (tls) = same issue
   registration (udp), endpoints (udp) = no issue
------------------------------------------------------------------------

My own trunk / register tests described here https://www.mail-archive.com/asterisk-dev@lists.digium.com/msg44806.html are related to SIPS, too (TLS 1.2). I never tested without SIPS. Based on Marks and my tests it looks like it could be a TLS only problem. I could see a lot of SSL changes between pjsip 4.8 and 9 (but I'm not able to estimate them). The local openSSL library can't be the problem (regarding my test cases), because it was always the same version.

One small addition:
I ran asterisk / SIPS for a very short time (about 20 minutes - my machine is much too slow for this operation) with valgrind. Valgrind reported definitely lost memory regarding openssl library usage (I don't know the exact wording any more). Sorry.

Another finding:
The bigger the requests, the bigger the memory leak (I tested with and without mediasec patches (see here: https://www.mail-archive.com/asterisk-dev@lists.digium.com/msg44812.html)). This patch creates much bigger requests (REGISTER and OPTIONS) and added a new request type (494). With pjsip 4.9 -> unusable. With pjsip 4.8 -> works as it should - no rising memory usage at all - never mind which asterisk version is used (<= 16.4.x or >= 16.5.x)

By: George Joseph (gjoseph) 2019-09-19 10:45:45.080-0500

[~micha] Thanks.  I figured it out. :)

It looks like there was indeed a regression with TLS transports between pjproject 2.8 and 2.9.  I've got patches up on gerrit.
If you or [~mpa] could test that would be great.
I'll attach the patch here if you can't get to gerrit.


By: George Joseph (gjoseph) 2019-09-19 10:47:17.060-0500

Apply this patch from the root of the asterisk source tree with
{{patch -p1 < 0001-pjproject_bundled-Revert-pjproject-commits-causing-l.patch}}


By: Michael Maier (micha) 2019-09-19 13:11:18.872-0500

Hello George,
I tested the patch 0001-pjproject_bundled-Revert-pjproject-commits-causing-l.patch - I can see exactly the same behavior as without this patch. Maybe you detected another problem - but that's not the problem I or maybe Mark are facing.

By: George Joseph (gjoseph) 2019-09-19 13:34:42.196-0500

After you applied the patch did you do a make distclean then run ./configure again?
If not then sorry.  I should have mentioned that.


By: Michael Maier (micha) 2019-09-19 13:49:58.593-0500

I integrated the patch into the existing asterisk-16.5.1.tar.gz as 2 patchfiles in asterisk-16.5.1/third-party/pjproject/patches/0020-revert-multi-listener-6021.patch and 0021-revert-multi-listener-6002.patch. This way, it is integrated in the normal build based on the spec file for asterisk-16.5.1 with pjsip-4.9. I verified the existence of the patch in the rpmbuild/BUILD/... path.

By: George Joseph (gjoseph) 2019-09-19 14:30:10.864-0500

OK.  When you're testing, are you seeing any pjproject SSL errors reported?


By: Michael Maier (micha) 2019-09-19 14:50:38.129-0500

No. No errors can be seen, which wouldn't be seen with 4.8, too. The only one is at the end of a registration (on unregister):
SSL 6 [SSL_ERROR_ZERO_RETURN] (Read) ret: 0 len: 32000
But that's a warning.

By: Mark (mpa) 2019-09-20 04:48:30.860-0500

I don't see any changes nor errors with your diff here as well.

By: George Joseph (gjoseph) 2019-09-20 08:41:13.233-0500

Here's another patch to test.

[^test_revert_ssl_remote_cert.patch] is an asterisk patch file that you apply from the root of the asterisk source tree or you can copy [^0030-revert_ssl.patch] and [^0031-revert_ssl.patch] directly into {{third-party/pjproject/patches}}.

*THESE PATCHES ARE FOR TESTING ONLY*

By: Mark (mpa) 2019-09-20 09:15:57.298-0500

This one looks promising.

I've been running with your new diff applied and after 15 min with a few test calls no noticeable memory increase. I'll post again in the next 1-2h.

*UPDATE after 2,5h:*
Everything's fine

* memory show summary - after start
5260160 bytes in        382 allocations in file ../src/pj/pool_policy_malloc.c

* memory show summary - 2,5h and multiple calls later it's almost identical
5398144 bytes in        413 allocations in file ../src/pj/pool_policy_malloc.c

*UPDATE 2 after 6h:*
> 6h and 5 calls later, asterisk consumes < 1MB memory more. This looks absolutely normal and most notably allocations in "pool_policy_malloc.c" decrease after you hang up. That wasn't the case before.

With respect to Michaels post, i'll also try again with pjsip 2.8 to compare both results.

Thanks.

By: George Joseph (gjoseph) 2019-09-20 09:26:23.745-0500

Good news [~mpa].  Keep me posted.

By: Michael Maier (micha) 2019-09-20 12:27:52.799-0500

0030/0031 seem to address the problem. I can't see the typical 15 minutes memory grow up any more. But memory consumption compared to same Asterisk version (16.5.1) but with pjsip 4.8 seems significantly higher.
After 3 hours and 3 calls, pjsip 4.9 uses 71160 kByte. Same Asterisk with pjsip 4.8 consumed 72084 kByte after 20 hours and more calls. Let's wait.

Update:
PS: the connection reuse revert patches are applied, too.

By: Sean Bright (seanbright) 2019-09-20 13:01:57.082-0500

I'm not sure that I would consider a ≈1% increase in memory consumption to be "significantly higher." Are your test machines memory constrained in some way?

I think it's also safe to assume that changes internal to PJSIP between 2.8 and 2.9 - other than the ones that [~gjoseph] is patching away - may be contributing to the increased memory footprint.

*Edit:* Ah, I read the numbers backwards. Sorry for the noise.

By: Michael Maier (micha) 2019-09-22 02:09:17.168-0500

Meanwhile I can say, that patches 0030/0031 addressed the problem in idle mode for me. It's hard for me to say, if it addresses the problem during calls, too.

But I have to say, that I realized one more problem: asterisk 16.5.1 with pjsip 4.9 compared with pjsip 4.8 reveals a constantly higher CPU usage per hour on idle mode (4 sips trunks, 2 sip extensions): 1:16 vs. 1:05 minutes - that's a big difference of 16.9%. I'm not sure if this is related to those patches here - but it should urgently be taken into consideration, too. But I would like to wait until the correct patches are here to see if this problem still occurs.

By: Mark (mpa) 2019-09-22 02:39:08.663-0500

*My results with pjsip 2.8 (again 5 reg. numbers, 7 endpoints)*

* memory show summary - after start
5270112 bytes in        382 allocations in file ../src/pj/pool_policy_malloc.c

* memory show summary - 24h and a few test calls later it's almost identical
5383008 bytes in        389 allocations in file ../src/pj/pool_policy_malloc.c

Compared to tests with 2.9 + diff 30,31 above, memory consumption as show with 'top' is almost the same after >24h. Also, when inspecting graphs from my monitoring system regarding cpu/memory, at least i don't see any real differences between 2.8/2.9 + diff 30,31 here. CPU user time stays < 6% during a call.


By: George Joseph (gjoseph) 2019-09-23 06:14:54.111-0500

The upstream maintainers of pjproject have acknowledged the issue with the remote certificate chain leak and are working a permanent solution.


By: Friendly Automation (friendly-automation) 2019-09-24 14:23:55.334-0500

Change 12937 merged by Friendly Automation:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12937|https://gerrit.asterisk.org/c/asterisk/+/12937]

By: Friendly Automation (friendly-automation) 2019-09-24 15:39:59.503-0500

Change 12894 merged by George Joseph:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12894|https://gerrit.asterisk.org/c/asterisk/+/12894]

By: Friendly Automation (friendly-automation) 2019-09-24 15:40:15.594-0500

Change 12896 merged by George Joseph:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12896|https://gerrit.asterisk.org/c/asterisk/+/12896]

By: Friendly Automation (friendly-automation) 2019-09-24 15:40:35.966-0500

Change 12897 merged by George Joseph:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12897|https://gerrit.asterisk.org/c/asterisk/+/12897]

By: Friendly Automation (friendly-automation) 2019-09-24 15:40:53.027-0500

Change 12895 merged by George Joseph:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12895|https://gerrit.asterisk.org/c/asterisk/+/12895]

By: Friendly Automation (friendly-automation) 2019-09-24 15:41:12.440-0500

Change 12898 merged by George Joseph:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12898|https://gerrit.asterisk.org/c/asterisk/+/12898]

By: Friendly Automation (friendly-automation) 2019-09-24 15:41:31.935-0500

Change 12936 merged by George Joseph:
pjproject_bundled:  Revert pjproject 2.9 commits causing leaks

[https://gerrit.asterisk.org/c/asterisk/+/12936|https://gerrit.asterisk.org/c/asterisk/+/12936]