Summary: | ASTERISK-28521: pjsip: Memory Leak | ||
Reporter: | Mark (mpa) | Labels: | patch pjsip |
Date Opened: | 2019-09-03 07:51:21 | Date Closed: | 2019-09-24 14:23:54 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | Resources/res_pjsip Resources/res_pjsip_outbound_registration |
Versions: | 16.4.0 16.4.1 16.5.0 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | Asterisk 16.5.0, OpenBSD 6.6-beta | Attachments: | ( 0) 0001-pjproject_bundled-Revert-pjproject-commits-causing-l.patch ( 1) 0030-revert_ssl.patch ( 2) 0031-revert_ssl.patch ( 3) pjsip.ast2.conf ( 4) pjsip.conf ( 5) pjsip.conf ( 6) summary_12h ( 7) summary_start ( 8) test_revert_ssl_remote_cert.patch |
Description: | After updating OpenBSD 6.6-current about 4 weeks ago, i've also
updated Asterisk 16.4.0 -> 16.4.1. Since then, Asterisk started using more and more memory. Right now, i'm running Asterisk 16.5.0 and OpenBSD 6.6-beta in a very small test environment and i still see the same issue. Weeks ago, Asterisk started with ~60MB and stayed around that value...now it starts with ~60MB and increases by 80MB/24h without any calls made. After compiling with MALLOC_DEBUG and running "memory show summary" every few hours, there's only one line that goes through the roof: * after start 5542784 bytes in 464 allocations in file ../src/pj/pool_policy_malloc.c * 12h later 67805312 bytes in 25962 allocations in file ../src/pj/pool_policy_malloc.c The same applies for "memory show allocations": * after start $ grep "pool_policy_malloc.c" allocations_start | wc -l 468 * 12h later $ grep "pool_policy_malloc.c" allocations_12h | wc -l 25968 The only setting that stops Asterisk from allocating more and more memory is, when removing "registration" and "endpoints" from pjsip.conf. This way, Asterisk only listens on 5061 TLS and does nothing. | ||
Comments: | By: Asterisk Team (asteriskteam) 2019-09-03 07:51:21.914-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur. By: Mark (mpa) 2019-09-03 08:01:10.642-0500 summary_start = memory show summary after START summary_12h = memory show summary 12h later By: Joshua C. Colp (jcolp) 2019-09-03 08:41:02.019-0500 That would be the PJSIP memory pool, which can make things extremely difficult to track down. Can you narrow it down further? What's an example configuration that exhibits it? Are you doing reloads often? Are there calls going on? By: Mark (mpa) 2019-09-04 01:09:44.888-0500 The test configuration setting i've setup about 9h ago * is a very simple pjsip.conf with only the registration part and no endpoints (see attached file) * no extensions.ael configured * there were no calls made * Asterisk hasn't been restarted after start /var/log/asterisk/mmlog shows 1567544069 - New session * memory show summary - after start 5015680 bytes in 258 allocations in file ../src/pj/pool_policy_malloc.c * memory show summary - 9h later 26144896 bytes in 9428 allocations in file ../src/pj/pool_policy_malloc.c By: Joshua C. Colp (jcolp) 2019-09-04 07:56:19.240-0500 Re-reading your initial description you mentioned it was going from 16.4.0 to 16.4.1 that caused the problem? Is my understanding correct? There were only two changes between those (at least in what we ship) that did not touch such areas at all. One change was in res_pjsip_messaging and the other chan_sip, which would be a strange thing to cause a regression in this area. By: Mark (mpa) 2019-09-04 09:35:14.751-0500 Besides Asterisk, i've also updated depending packages and the base OS. Right now i can reproduce the error with all Asterisk versions listed above (16.4.0, 16.4.1, 16.5.0). By: Kevin Harwell (kharwell) 2019-09-09 17:06:33.746-0500 Are the registrations and endpoint qualifications succeeding? Any kind of errors or warnings in the log? If you can please try other changes to your configuration. For instance: Remove the acl section. Does the leak still occur? Try a non tls transport if you can? If possible set the qualify_frequency to '0' (turning it off), and see if the leak still occurs. If it doesn't try turning it back on, but at a lesser value to see if the leak occurs faster. If that doesn't affect things then try removing just the outbound registrations, but leave the endpoints and see if you still noticed a change. By: Mark (mpa) 2019-09-12 07:36:33.395-0500 Though increasing debug/verbose log, i only see normal communication between asterisk<->sip provider/endpoints (one registered number + one endpoint during tests, so the produced log is fairly small). I've tried with the following changes to the configuration: - delete acl section = same issue - endpoints only (no registration) = same issue - registration only (no endpoints) = same issue - qualify_frequency set to 0 = same issue As i didn't see any increased memory issues with zero registration/endpoints before, i also thought about changing parameters which are always involved with both registration/endpoint connection attempts. To make it short, when switching from tls -> udp for registration and endpoints, i don't see anymore memory issues and "pool_policy_malloc.c" entries even decrease after time # start $ grep "pool_policy_malloc.c" allocations_start | wc -l 289 # after 2h $ grep "pool_policy_malloc.c" allocations_2h | wc -l 273 Further tests show: - registration (tls), endpoints (udp) = same issue - registration (udp), endpoints (tls) = same issue - registration (udp), endpoints (udp) = no issue UPDATE: 6h later...everything's fine. No noticeable change in memory consumption. By: Kevin Harwell (kharwell) 2019-09-13 12:25:16.704-0500 {quote} UPDATE: 6h later...everything's fine. No noticeable change in memory consumption. {quote} I assume you were still running with the udp transport? As well just to be clear the only messages being transmitted are REGISTERs and OPTIONs correct? What happens if you've configured it to use the tls transport, but set the qualify_frequency=0 (essentially disabling OPTIONs)? I've setup a test here between two Asterisk instances, with instance 1 registering (using your configuration) to instance 2. So far I have not seen memory really going up. I even made a bunch of outbound calls. By: Mark (mpa) 2019-09-15 02:31:54.781-0500 {quote}I assume you were still running with the udp transport?{quote} Yes {quote}As well just to be clear the only messages being transmitted are REGISTERs and OPTIONs correct?{quote} Yes, normal keep alive messages {quote}What happens if you've configured it to use the tls transport, but set the qualify_frequency=0 (essentially disabling OPTIONs)?{quote} I've tried this before with no luck. Asterisk starts with app 60MB allocated and reaches 150MB in under a day. {quote}I've setup a test here between two Asterisk instances, with instance 1 registering (using your configuration) to instance 2. So far I have not seen memory really going up. I even made a bunch of outbound calls.{quote} Are you also running OpenBSD for your tests? I'm not 100% sure right now (needs more testing), but i see differences when running Asterisk 16.5.1 with OpenBSD 6.5 (no memory increase) and OpenBSD 6.6-beta. By: Kevin Harwell (kharwell) 2019-09-16 10:48:27.847-0500 {quote} Are you also running OpenBSD for your tests? {quote} Not at this time. I am currently running Ubuntu 18.04. It's odd upgrading OpenBSD appears to cause the leak. What version of openssl is installed in each version? Did the openssl version change between OpenBSD 6.5 and 6.6-beta? By: Michael Maier (micha) 2019-09-16 11:39:18.069-0500 Some additional information: I can see massive memory leak since asterisk >= 16.5. It's enough to run 4 idle trunks and 2 extensions - each using pjsip, the trunks are using SIPS. They're doing just ReINVITES or OPTIONS and the memory usage is hourly rising by 792 kBytes. For me, the memory leak is fixed (= same behavior as with <=16.4.) by using pjsip 4.8 (instead of 4.9, which came with 16.5.0). From my point of view, there is a problem with pjsip 4.9. Could you please try to run >= 16.5. with pjsip 4.8 and check if the memory leak disappears? By: Kevin Harwell (kharwell) 2019-09-16 18:18:26.109-0500 I believe I was able to replicate this. I've attached my [^pjsip.conf] file that I used as well. I added a few more endpoints and registrations in hopes it would speed up the memory growth. After running for a bit memory did slightly increase. To speed things up further I then altered outbound registrations in Asterisk to send one every second. Here's some output from that (At startup): {noformat} 6581856 bytes in 653 allocations in file ../src/pj/pool_policy_malloc.c {noformat} ~30 minutes later: {noformat} VIRT RES SHARED CPU MEM 2946888 188544 29928 S 2.0 2.4 0:28.58 asterisk 9759328 bytes in 1164 allocations in file ../src/pj/pool_policy_malloc.c {noformat} By: Kevin Harwell (kharwell) 2019-09-16 18:20:01.570-0500 I too the switched to a UDP transport (at startup): {noformat} 7766688 bytes in 800 allocations in file ../src/pj/pool_policy_malloc.c {noformat} ~30 minutes later: {noformat} VIRT RES SHARED CPU MEM 2948928 183840 29440 S 2.3 2.3 0:35.31 asterisk 7783072 bytes in 804 allocations in file ../src/pj/pool_policy_malloc.c {noformat} even went down after some more time: {noformat} 7580512 bytes in 789 allocations in file ../src/pj/pool_policy_malloc.c {noformat} By: Kevin Harwell (kharwell) 2019-09-16 18:22:18.516-0500 I was running the current 16 branch of Asterisk with pjproject bundled 2.9 on Ubuntu 18.04. {noformat} $openssl version OpenSSL 1.1.1 11 Sep 2018 {noformat} By: Kevin Harwell (kharwell) 2019-09-16 18:33:45.386-0500 Oh and to duplicate I just setup two Asterisk instances with the attached configurations (few items like host and key files will need changing), and watched the memory grow. Attaching second config used to receive registrations [^pjsip.ast2.conf] By: Mark (mpa) 2019-09-17 02:30:39.821-0500 Thanks for testing. I'm also using the bundled pjproject 2.9, but as i wrote earlier the issue happens with 16.4.0 (pjproject 2.8) as well. Nonetheless i'll try a few things and report back. Hopefully we can nail it down then. * OpenBSD 6.5 $ openssl version LibreSSL 2.9.1 * OpenBSD 6.6-beta $ openssl version LibreSSL 3.0.0 By: Mark (mpa) 2019-09-17 11:52:51.483-0500 Update: My last tests show, that it's definitely not related to the local OpenSSL/LibreSSL version, but the changes between bundled pjproject 2.8<->2.9. (as hinted by Michael) To be more precise, the following library makes the difference: libasteriskpj.so.0.0 When running Asterisk with libasteriskpj.so.0.0 coming with Asterisk 16.4.0 (pjproject 2.8) everything's fine. It doesn't matter if you're running Asterisk 16.4.* or 16.5.* As soon as i copy the libasteriskpj.so.0.0 from Asterisk 16.5.1 (pjproject 2.9), the memory issue starts. By: Michael Maier (micha) 2019-09-17 12:12:48.771-0500 Thanks for confirmation of what I already wrote here: https://issues.asterisk.org/jira/browse/ASTERISK-28521?focusedCommentId=248020&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-248020 By: George Joseph (gjoseph) 2019-09-18 10:31:21.268-0500 Between the comment here and the thread on asterisk-dev I'm a little confused. Are we saying this is a tls-only issue or does it affect udp as well? By: Michael Maier (micha) 2019-09-18 11:09:21.432-0500 Mark wrote here https://issues.asterisk.org/jira/browse/ASTERISK-28521?focusedCommentId=247997&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-247997 : ------------------------------------------------------------------------ Further tests show: registration (tls), endpoints (udp) = same issue registration (udp), endpoints (tls) = same issue registration (udp), endpoints (udp) = no issue ------------------------------------------------------------------------ My own trunk / register tests described here https://www.mail-archive.com/asterisk-dev@lists.digium.com/msg44806.html are related to SIPS, too (TLS 1.2). I never tested without SIPS. Based on Marks and my tests it looks like it could be a TLS only problem. I could see a lot of SSL changes between pjsip 4.8 and 9 (but I'm not able to estimate them). The local openSSL library can't be the problem (regarding my test cases), because it was always the same version. One small addition: I ran asterisk / SIPS for a very short time (about 20 minutes - my machine is much too slow for this operation) with valgrind. Valgrind reported definitely lost memory regarding openssl library usage (I don't know the exact wording any more). Sorry. Another finding: The bigger the requests, the bigger the memory leak (I tested with and without mediasec patches (see here: https://www.mail-archive.com/asterisk-dev@lists.digium.com/msg44812.html)). This patch creates much bigger requests (REGISTER and OPTIONS) and added a new request type (494). With pjsip 4.9 -> unusable. With pjsip 4.8 -> works as it should - no rising memory usage at all - never mind which asterisk version is used (<= 16.4.x or >= 16.5.x) By: George Joseph (gjoseph) 2019-09-19 10:45:45.080-0500 [~micha] Thanks. I figured it out. :) It looks like there was indeed a regression with TLS transports between pjproject 2.8 and 2.9. I've got patches up on gerrit. If you or [~mpa] could test that would be great. I'll attach the patch here if you can't get to gerrit. By: George Joseph (gjoseph) 2019-09-19 10:47:17.060-0500 Apply this patch from the root of the asterisk source tree with {{patch -p1 < 0001-pjproject_bundled-Revert-pjproject-commits-causing-l.patch}} By: Michael Maier (micha) 2019-09-19 13:11:18.872-0500 Hello George, I tested the patch 0001-pjproject_bundled-Revert-pjproject-commits-causing-l.patch - I can see exactly the same behavior as without this patch. Maybe you detected another problem - but that's not the problem I or maybe Mark are facing. By: George Joseph (gjoseph) 2019-09-19 13:34:42.196-0500 After you applied the patch did you do a make distclean then run ./configure again? If not then sorry. I should have mentioned that. By: Michael Maier (micha) 2019-09-19 13:49:58.593-0500 I integrated the patch into the existing asterisk-16.5.1.tar.gz as 2 patchfiles in asterisk-16.5.1/third-party/pjproject/patches/0020-revert-multi-listener-6021.patch and 0021-revert-multi-listener-6002.patch. This way, it is integrated in the normal build based on the spec file for asterisk-16.5.1 with pjsip-4.9. I verified the existence of the patch in the rpmbuild/BUILD/... path. By: George Joseph (gjoseph) 2019-09-19 14:30:10.864-0500 OK. When you're testing, are you seeing any pjproject SSL errors reported? By: Michael Maier (micha) 2019-09-19 14:50:38.129-0500 No. No errors can be seen, which wouldn't be seen with 4.8, too. The only one is at the end of a registration (on unregister): SSL 6 [SSL_ERROR_ZERO_RETURN] (Read) ret: 0 len: 32000 But that's a warning. By: Mark (mpa) 2019-09-20 04:48:30.860-0500 I don't see any changes nor errors with your diff here as well. By: George Joseph (gjoseph) 2019-09-20 08:41:13.233-0500 Here's another patch to test. [^test_revert_ssl_remote_cert.patch] is an asterisk patch file that you apply from the root of the asterisk source tree or you can copy [^0030-revert_ssl.patch] and [^0031-revert_ssl.patch] directly into {{third-party/pjproject/patches}}. *THESE PATCHES ARE FOR TESTING ONLY* By: Mark (mpa) 2019-09-20 09:15:57.298-0500 This one looks promising. I've been running with your new diff applied and after 15 min with a few test calls no noticeable memory increase. I'll post again in the next 1-2h. *UPDATE after 2,5h:* Everything's fine * memory show summary - after start 5260160 bytes in 382 allocations in file ../src/pj/pool_policy_malloc.c * memory show summary - 2,5h and multiple calls later it's almost identical 5398144 bytes in 413 allocations in file ../src/pj/pool_policy_malloc.c *UPDATE 2 after 6h:* > 6h and 5 calls later, asterisk consumes < 1MB memory more. This looks absolutely normal and most notably allocations in "pool_policy_malloc.c" decrease after you hang up. That wasn't the case before. With respect to Michaels post, i'll also try again with pjsip 2.8 to compare both results. Thanks. By: George Joseph (gjoseph) 2019-09-20 09:26:23.745-0500 Good news [~mpa]. Keep me posted. By: Michael Maier (micha) 2019-09-20 12:27:52.799-0500 0030/0031 seem to address the problem. I can't see the typical 15 minutes memory grow up any more. But memory consumption compared to same Asterisk version (16.5.1) but with pjsip 4.8 seems significantly higher. After 3 hours and 3 calls, pjsip 4.9 uses 71160 kByte. Same Asterisk with pjsip 4.8 consumed 72084 kByte after 20 hours and more calls. Let's wait. Update: PS: the connection reuse revert patches are applied, too. By: Sean Bright (seanbright) 2019-09-20 13:01:57.082-0500 I'm not sure that I would consider a ≈1% increase in memory consumption to be "significantly higher." Are your test machines memory constrained in some way? I think it's also safe to assume that changes internal to PJSIP between 2.8 and 2.9 - other than the ones that [~gjoseph] is patching away - may be contributing to the increased memory footprint. *Edit:* Ah, I read the numbers backwards. Sorry for the noise. By: Michael Maier (micha) 2019-09-22 02:09:17.168-0500 Meanwhile I can say, that patches 0030/0031 addressed the problem in idle mode for me. It's hard for me to say, if it addresses the problem during calls, too. But I have to say, that I realized one more problem: asterisk 16.5.1 with pjsip 4.9 compared with pjsip 4.8 reveals a constantly higher CPU usage per hour on idle mode (4 sips trunks, 2 sip extensions): 1:16 vs. 1:05 minutes - that's a big difference of 16.9%. I'm not sure if this is related to those patches here - but it should urgently be taken into consideration, too. But I would like to wait until the correct patches are here to see if this problem still occurs. By: Mark (mpa) 2019-09-22 02:39:08.663-0500 *My results with pjsip 2.8 (again 5 reg. numbers, 7 endpoints)* * memory show summary - after start 5270112 bytes in 382 allocations in file ../src/pj/pool_policy_malloc.c * memory show summary - 24h and a few test calls later it's almost identical 5383008 bytes in 389 allocations in file ../src/pj/pool_policy_malloc.c Compared to tests with 2.9 + diff 30,31 above, memory consumption as show with 'top' is almost the same after >24h. Also, when inspecting graphs from my monitoring system regarding cpu/memory, at least i don't see any real differences between 2.8/2.9 + diff 30,31 here. CPU user time stays < 6% during a call. By: George Joseph (gjoseph) 2019-09-23 06:14:54.111-0500 The upstream maintainers of pjproject have acknowledged the issue with the remote certificate chain leak and are working a permanent solution. By: Friendly Automation (friendly-automation) 2019-09-24 14:23:55.334-0500 Change 12937 merged by Friendly Automation: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12937|https://gerrit.asterisk.org/c/asterisk/+/12937] By: Friendly Automation (friendly-automation) 2019-09-24 15:39:59.503-0500 Change 12894 merged by George Joseph: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12894|https://gerrit.asterisk.org/c/asterisk/+/12894] By: Friendly Automation (friendly-automation) 2019-09-24 15:40:15.594-0500 Change 12896 merged by George Joseph: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12896|https://gerrit.asterisk.org/c/asterisk/+/12896] By: Friendly Automation (friendly-automation) 2019-09-24 15:40:35.966-0500 Change 12897 merged by George Joseph: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12897|https://gerrit.asterisk.org/c/asterisk/+/12897] By: Friendly Automation (friendly-automation) 2019-09-24 15:40:53.027-0500 Change 12895 merged by George Joseph: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12895|https://gerrit.asterisk.org/c/asterisk/+/12895] By: Friendly Automation (friendly-automation) 2019-09-24 15:41:12.440-0500 Change 12898 merged by George Joseph: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12898|https://gerrit.asterisk.org/c/asterisk/+/12898] By: Friendly Automation (friendly-automation) 2019-09-24 15:41:31.935-0500 Change 12936 merged by George Joseph: pjproject_bundled: Revert pjproject 2.9 commits causing leaks [https://gerrit.asterisk.org/c/asterisk/+/12936|https://gerrit.asterisk.org/c/asterisk/+/12936] |