Summary: | ASTERISK-29232: Memory Leak since 16.13.0 | ||
Reporter: | Luke Escude (lukeescude) | Labels: | fax |
Date Opened: | 2021-01-04 12:42:18.000-0600 | Date Closed: | |
Priority: | Major | Regression? | |
Status: | Open/New | Components: | Core/PBX |
Versions: | 16.15.0 16.20.0 16.22.0 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | CentOS 7 x64 | Attachments: | ( 0) Analysis.xlsx ( 1) Apex-Analysis.xlsx ( 2) Container_Leak_Tracking.xlsx ( 3) cw1-memchart.png ( 4) Jan6-1401.csv ( 5) nw1-memchart.png ( 6) PW3-Memchart.png |
Description: | So we have around 100 instances of Asterisk 16.13.0 that have been running for over 2 months, normal load (small businesses with less than 30 users each), without issue.
We have another 350 instances of Asterisk 16.15.0 that we've started seeing a very linear increase in memory consumption over time. Specifically, we see higher-load instances (150+ users) last only a few days before hitting our artificial 3GB ceiling and getting restarted by the OOM killer. There are very few differences in our implementation of the 16.13 and 16.15 versions. All versions are set up as the following: - CentOS 7 64-bit - Voicemail over ODBC - unixODBC 2.3.1 - MariaDB Connector (instead of the crappy mysql connector) - CDR over MySQL - SIP Trunks are registered every 2 minutes, qualified every 15 seconds. - User devices register every 10 minutes, qualified every 15 seconds. - User devices connect via TCP more often than UDP. - I have NO pjsip threadpool configuration options defined. I think the default is 50 threads? Here is what I am about to test within the next week: 1. unixODBC updated to 2.3.9 2. Longer SIP Trunk Registration period - Maybe PJSIP is working too hard? 3. Longer qualify timeout - Maybe PJSIP is working too hard? One of my first questions: Is it SAFE to compile asterisk with MALLOC_DEBUG and just leave it on permanently? I am scared to enable it, and suddenly have a bunch of users that are experiencing issues because I've enabled something that should only be enabled in Dev. Sorry for the length of the post, trying to cover as much ground as possible. | ||
Comments: | By: Asterisk Team (asteriskteam) 2021-01-04 12:42:20.561-0600 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur. Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/]. By: Luke Escude (lukeescude) 2021-01-04 12:52:54.890-0600 Added 3 charts showing memory consumption over the last 30 days. CW1 running Asterisk 16.13 instances. NW1 and PW3 running Asterisk 16.15 instances. Doesn't really help with the issue, but it's definitely interesting to look at. By: Joshua C. Colp (jcolp) 2021-01-06 03:53:18.833-0600 MALLOC_DEBUG does incur a performance penalty, but it should not be large and should be fine. It's likely the only way to isolate this issue since you're using a literal TON of functionality, including off-nominal rarely used stuff. The only other way would be to duplicate the environment and eliminate aspects bit by bit to further isolate. By: Luke Escude (lukeescude) 2021-01-06 10:58:30.485-0600 Awesome, we'll enable that flag and distribute that image to only 1 of our worker nodes so we can contain the debugging to a limited set of customers. I'll update this ticket when I get some data - It may be about a week before we can begin, since we're having to go to one of our datacenters and perform hardware upgrades this weekend. By: Luke Escude (lukeescude) 2021-01-06 14:04:30.376-0600 Attaching a CSV of memory show summary at the restart of a fresh (small) instance of Asterisk 16.15 Apparently the memory leak exists on these small ones too, but it takes about a week to notice. So, I will sample memory show summary in a week, with the same CSV syntax so it'll be easy to analyze/compare in Excel. By: Luke Escude (lukeescude) 2021-01-06 15:04:09.966-0600 So for the last hour, the instance has had 0 calls, with around 14 or so devices registered, and here are the files with some significant allocation changes: ../src/pj/pool_policy_malloc.c +742592 bytes datastore.c +40590 bytes res_pjsip_pubsub.c +10720 bytes sorcery.c +2191629 bytes There are other ones as well, but those are the largest. The total change in 1 hour has been +3034055 bytes. (Roughly 3MB) Sorcery appears to be the worst, so here's an output of SOME allocations for it: {noformat} 11 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 544 bytes allocated by sorcery_object_type_alloc() line 714 of sorcery.c 160 bytes allocated by sorcery_object_type_alloc() line 721 of sorcery.c 72 bytes allocated by sorcery_object_type_alloc() line 728 of sorcery.c 544 bytes allocated by sorcery_object_type_alloc() line 714 of sorcery.c 304 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 272 bytes allocated by ast_sorcery_lockable_alloc() line 1712 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 312 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 416 bytes allocated by sorcery_object_type_alloc() line 703 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 216 bytes allocated by ast_sorcery_object_fields_register() line 1165 of sorcery.c 304 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 6 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 66376 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 304 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 13 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 992 bytes allocated by __ast_sorcery_open() line 637 of sorcery.c 416 bytes allocated by sorcery_object_type_alloc() line 703 of sorcery.c 272 bytes allocated by ast_sorcery_lockable_alloc() line 1712 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 544 bytes allocated by sorcery_object_type_alloc() line 714 of sorcery.c 6 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 72 bytes allocated by ast_sorcery_observer_add() line 2379 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 272 bytes allocated by ast_sorcery_lockable_alloc() line 1712 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 272 bytes allocated by ast_sorcery_lockable_alloc() line 1712 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 52 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 72 bytes allocated by ast_sorcery_observer_add() line 2379 of sorcery.c 48 bytes allocated by sorcery_object_type_alloc() line 735 of sorcery.c 544 bytes allocated by sorcery_object_type_alloc() line 714 of sorcery.c 48 bytes allocated by sorcery_object_type_alloc() line 735 of sorcery.c 272 bytes allocated by ast_sorcery_lockable_alloc() line 1712 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 240 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 6 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 160 bytes allocated by __ast_sorcery_wizard_register() line 451 of sorcery.c 72 bytes allocated by ast_sorcery_observer_add() line 2379 of sorcery.c 24 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 14 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 40 bytes allocated by sorcery_object_type_alloc() line 708 of sorcery.c 216 bytes allocated by ast_sorcery_object_fields_register() line 1165 of sorcery.c 1248 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 66376 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 4 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 4 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 216 bytes allocated by ast_sorcery_object_fields_register() line 1165 of sorcery.c 272 bytes allocated by ast_sorcery_lockable_alloc() line 1712 of sorcery.c 4 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 304 bytes allocated by ast_sorcery_generic_alloc() line 1728 of sorcery.c 40 bytes allocated by sorcery_object_type_alloc() line 708 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 160 bytes allocated by sorcery_object_type_alloc() line 721 of sorcery.c 11 bytes allocated by ast_sorcery_alloc() line 1757 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 416 bytes allocated by sorcery_object_type_alloc() line 703 of sorcery.c 240 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 240 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 224 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c 160 bytes allocated by sorcery_object_type_alloc() line 721 of sorcery.c 160 bytes allocated by sorcery_object_type_alloc() line 721 of sorcery.c 240 bytes allocated by __ast_sorcery_object_field_register() line 1205 of sorcery.c {noformat} By: Joshua C. Colp (jcolp) 2021-01-06 15:09:49.328-0600 Many of those allocations are normal. It's only ast_sorcery_generic_alloc calls which actually allocate objects. Your description doesn't detail how PJSIP configuration is stored and used. Is it in database? Configuration file? By: Luke Escude (lukeescude) 2021-01-06 15:10:16.916-0600 Here's res_pjsip_pubsub allocations too, in case Sorcery is a dead end (I didn't realize Sorcery was memory related, so it's possible it's there to assist with debugging). {noformat} 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 102 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 103 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 712 bytes allocated by ast_sip_register_publish_handler() line 2726 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 103 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 712 bytes allocated by ast_sip_register_publish_handler() line 2726 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 11 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c 100 bytes allocated by allocate_subscription() line 1239 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 37 bytes allocated by subscription_persistence_create() line 655 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 176 bytes allocated by allocate_subscription_tree() line 1409 of res_pjsip_pubsub.c 13 bytes allocated by subscription_persistence_create() line 653 of res_pjsip_pubsub.c {noformat} By: Luke Escude (lukeescude) 2021-01-06 15:14:39.409-0600 PJSIP is configured in a standard conf file. We're not using any Realtime functions (at least not from my understanding of what Realtime is). Besides trunk/endpoint definitions, the only other stuff in pjsip.conf is the following: {noformat} [global] send_contact_status_on_update_registration=yes user_agent=Asterisk-XXXYYY [transport-udp] type=transport protocol=udp bind=0.0.0.0 local_net=XXXYYY/16 tos=104 cos=3 [transport-tcp] type=transport protocol=tcp bind=0.0.0.0 local_net=XXXYYY/16 tos=104 cos=3 {noformat} By: Richard Mudgett (rmudgett) 2021-01-06 15:28:16.332-0600 Please attach logs to the issue. Pasting large logs into comments can wind up with your comment being deleted to get JIRA to become responsive while viewing the issue. The best information from MALLOC_DEBUG is the summary counts as an individual allocation list is basically just useless spam. You can get a summary by the system and by specific module. System summaries point to which modules are consuming memory. Module summaries give you allocation counts by line. By: Luke Escude (lukeescude) 2021-01-06 15:28:56.640-0600 This looks interesting: https://github.com/asterisk/asterisk/commit/e1fd51cd2c183a3eb61803f52277d1a7d7a18c9c#diff-be39ef79e09077566ea704aaf08be25863d2a3a3fd5a7820cffcf37bc2df16f0 I did notice res_pjsip_outbound_registration.c is growing in size too.... Outbound registrations makes up 100% of Asterisk's communication with the outside world for us - It registers to SBCs which have public IPs of course, which pass customer devices through those same sockets in the other direction. This enables us to run hundreds of Asterisk instances behind NAT without ever having to open ports. I'll stop blowing this ticket up, and will perform another comparison tomorrow after 12 hours. EDIT: Sorry Richard, will do! By: Joshua C. Colp (jcolp) 2021-01-07 08:44:23.942-0600 We are reverting the referenced outbound registration change due to other changes involved. If you revert the change as well, does that solve the issue? By: Luke Escude (lukeescude) 2021-01-07 10:23:09.960-0600 Ah, looking good so far. I'll report back in 12-24 hours but upon immediate boot-up, memory appears to be flat. Normally it starts rising pretty immediately. Will update soon. By: Benjamin Keith Ford (bford) 2021-01-07 10:26:05.341-0600 Putting in feedback for now. By: Luke Escude (lukeescude) 2021-01-08 12:39:37.169-0600 Ended up booting up Asterisk 16.14 for a few production customers instead of reverting that 1 file (to at least narrow down a general scope of Version over File) but memory is still rising. I'll know more concretely tomorrow if downgrading fixes it or doesn't, then that will help rule out that particular change with PJSIP registrations. There's a possibility these particular customers must have high mem usage anyways because of their sheer volume. By: Luke Escude (lukeescude) 2021-01-10 19:17:04.199-0600 Booting Asterisk 16.14 did not fix the memory leak. Booted the same customer (around 180 user devices registered) with Asterisk 16.13 roughly 8 hours ago, on a Sunday with minimal call volume, and RAM usage is already over 500MB. So, I think the outbound registration code I had mentioned may not be the culprit in this particular case. I may just need to bite the bullet and compile that same customer with MALLOC_DEBUG. Doing it in my low-volume dev environment didn't really give us much data to operate on... I'll formulate a plan with my team to accomplish this. By: Luke Escude (lukeescude) 2021-01-10 22:00:55.888-0600 Attaching a memory analysis over a 2-hour period, sampled every 20 minutes. The tracked output of "memory show summary" on an Asterisk 16.13 instance with 170+ customer devices registered over TCP, sampled every 20 minutes. There is a "Diff 20min" column between samplings showing the differential between samples, and a "Diff Total" column after each sampling showing the differential from the INITIAL boot-up. At the end of it all, the following files showed the most RAM consumption increase: astobj2_hash.c datastore.c pbx.c res_pjsip/pjsip_distributor.c res_pjsip/pjsip_transport_events.c res_pjsip_dialog_info_body_generator.c res_pjsip_exten_state.c res_pjsip_pubsub.c sorcery.c See attached - Thanks guys! By: Benjamin Keith Ford (bford) 2021-01-11 09:56:48.782-0600 We may just have to open this one up as a general investigation to see where we may have leaks, narrowing the scope down to those files you've listed. A couple of questions: # Do you have a list of standard operations being done for the customer, such as type of calls, ARI applications, fax, etc.? Might help narrow down the scope a bit further. # It may not be necessary, but when someone looks at the issue, would it be possible to gain access to the customer's machine if we cannot reproduce the issue on our end? By: Luke Escude (lukeescude) 2021-01-11 10:14:08.230-0600 1. Off the top of my head: - Calls come in and are immediately Answer()'d. - CallerID Number is run through a local PHP script via SHELL() to be sanitized, all symbols and non-digits stripped, with the number capped at 20 digits in length. - CallerID Name is fetched via a local PHP script via SHELL() and is also returned in a sanitized format - Number is run through a Blacklist check script (also PHP) via SHELL(), returns a 1 if the number is blocked by the global blacklist. - Number is run through a spam score check script (also PHP) via SHELL(), returns a spam probability score 0-100, and follows rules the user sets up. - Fax detection is enabled by the user's choice, and is sent to the fax destination, otherwise Goto() the standard destination. IVRs, ring groups, etc. Those are the steps that ALL inbound calls go through. Next, as far as functionality is concerned: 1. PHP scripts are used throughout dial plan to get various states from the database, like queue agent login/logout. call forwarding settings, etc. 2. We are not using any ARI stuff whatsoever. 3. We are using a lot of AMI stuff - One-directional AMI. Basically every time extensions register, or calls are processed, we receive those AMI events and stick them into the database so users can monitor their phone systems in real-time. We don't write to the AMI socket much, except maybe for a click-to-call command here and there. 4. Voicemail is configured using the conf file, but uses ODBC. The voicemail database node is always less than 5ms away from wherever the Asterisk instance might be running. I intend to replace this with Amazon S3 in an upcoming update. 5. CDR is also stored via ODBC. This will also be replaced with in-dialplan PHP scripts instead of using Asterisk's built-in CDR engine in a future release. 6. Essentially everything is configured via the Conf files - No "realtime" stuff. 2. It's possible to give someone access, but it would have to be on-demand - Since these Asterisk instances can hop around different datacenters, there's no way for us to know where it might be in a week. (Example: It may be running in our Miami datacenter right now, but if Miami goes down for maintenance or whatever, it could end up Chicago, or NYC, or Dallas, etc.) I am going to keep trying to track it down. The instance in question is currently hovering around 796MB of RAM, but if it doesn't increase over the next 2 days, then there may not be an issue. By: Luke Escude (lukeescude) 2021-01-11 10:25:40.028-0600 I wonder if there's a correlation to hints/BLFs. This particular customer has 863 dial plan hints. Some of these hints have up to 30 watchers. Additionally, our hints were recently made complex in order to support our legacy and new-gen SBCs. By "complex" I mean every single user device has a compound hint. The hint is configured by using the PJSIP/Device1&PJSIP/Device2&PJSIP/Device3 etc. This isn't new, but the overall length of the hint definitions doubled. (If a user has 4 devices, the hint is actually an aggregation of 8 PJSIP/Device entries tied together with &). By: Luke Escude (lukeescude) 2021-01-13 16:42:11.886-0600 Hey guys, I believe I've figured it out. I manually checked about 30 instances of Asterisk, half of which are leaking memory and the other half are fine. The half that were fine were running a dial plan we last put into production on August 28th. The half that were leaking were running any dial plan updates we pushed after September 11th. The changes that were made on September 11th were eliminating all of our cURL() calls to fetch CallerID, Blacklisting, and control eFaxes in favor of local PHP scripts that run inside the instance. Example below: Former code: same => n,Set(spammer=${SHELL(curl -m 1 "http://URL/CNAM/fetch.php?phone_number=${CALLERID(num)}")}) New code: same => n,Set(spammer=${SHELL(php /instance_scripts/dialplan_get_spam_score.php "${CALLERID(num)}")}) Do you guys have any ideas on why switching to a local php script might cause a memory leak inside of Asterisk? By: Luke Escude (lukeescude) 2021-01-14 15:41:10.732-0600 I've disproved my previous theories, and will go ahead and push MALLOC_DEBUG into production on a few customer systems for a full day. Will update when I have some data from that. By: Asterisk Team (asteriskteam) 2021-01-29 12:00:26.713-0600 Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1]. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines By: Luke Escude (lukeescude) 2021-11-02 10:55:35.728-0500 I am about to resume my memory leak investigation, and plan to build a monitoring daemon that samples the memory summary every 30 minutes and stick it into a database for analysis over time. The output of memory show summary is human-readable but seems somewhat annoying to try and parse with code - Is there a better way (ARI or manager action or something?) to query the memory statistics? By: Asterisk Team (asteriskteam) 2021-11-02 10:55:36.216-0500 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Sean Bright (seanbright) 2021-11-02 14:19:09.728-0500 There is no better way, no. By: George Joseph (gjoseph) 2021-11-03 09:25:22.937-0500 The only thing I can suggest is to use the AMI "Command" command to issue the memory CLI commands. It's the same format as the CLI output but it may be easier to capture. The other thing to try is the Leak Sanitizer. It doesn't run until asterisk exits but it's very low runtime impact and I've found the output more useful. https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer There's an option in Compiler Flags to turn it on but you shouldn't use it and MALLOC DEBUG at the same time. By: Luke Escude (lukeescude) 2021-11-03 10:11:10.179-0500 George, how is the output handled? So if Asterisk exits and Leak Sanitizer has something to say about it, will it output to standard syslog/asterisk logging? Right now if asterisk exits, we have a utility that automatically restarts the entire container which means any files or logs will get wiped near-instantly. I can disable this, but would rather figure out how to pipe Leak Sanitizer output into a remote log server instead. By: George Joseph (gjoseph) 2021-11-03 10:36:36.171-0500 There are a whole bunch of options you can pass at runtime with the LSAN_OPTIONS environment variable. One is "log_path"... {code} # LSAN_OPTIONS=log_path=/tmp/lsan /usr/sbin/asterisk -fg -C /etc/asterisk/asterisk.conf {code} This will write the log to {{/tmp/lsan.<pid>}} If you have a host directory mounted in the container, point the log there. The Leak Sanitizer is part of the Address Sanitizer so some of the Address Sanitizer flags also apply to the Leak Sanitizer. https://github.com/google/sanitizers/wiki/AddressSanitizerFlags https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer I forgot to mention that you need libasan and liblsan installed to use the sannitizers. By: Luke Escude (lukeescude) 2021-11-03 11:02:55.727-0500 Interestingly, liblsan isn't available on CentOS 7 and locating it is surprisingly difficult. Once I do locate it, I just enable the LEAK_SANITIZER flag when compiling Asterisk, then set LSAN_OPTIONS log path to a host mounted folder. I think this is more effective than monitoring memory show summary every X minutes. EDIT: Located the package in the centos-release-scl-rh repo By: George Joseph (gjoseph) 2021-11-04 07:12:10.137-0500 [~lukeescude] I'm assigning this issue to you for further investigation. When you get some results, add them to the issue and mention me and I'll take a look. We can then decide where to go with it. By: Luke Escude (lukeescude) 2021-11-04 15:37:29.867-0500 @George Joseph - For some reason, "mentions" have never worked for me on this Jira board. Nothing happens when I start typing @... Anyways, please see the attached spreadsheet. The issue definitely exists in ../src/pj/pool_policy_malloc.c it's growing very linearly over time. The customer's system was sampled every 2 hours for as long as it could last, and it lasted 12 hours before crashing again, so we got 6 memory samples and a nice Excel line graph. By: Luke Escude (lukeescude) 2021-11-04 15:45:44.952-0500 Looking at the allocations, there is a MASSIVE amount of these: 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c 512 bytes line 46 of ../src/pj/pool_policy_malloc.c ... and so on. By: Luke Escude (lukeescude) 2021-11-04 18:14:51.055-0500 So I started looking into PJSIP, and our own SIP proxy servers, and noticed that our proxy is adding a Record-Route header to every SUBSCRIBE and NOTIFY (regardless if it's Asterisk > Endpoint or Endpoint > Asterisk). This record-route header appears to be unnecessary, and matches the Contact header (which we override in the proxy to create a B2BUA setup). is it at all possible the presence of the unnecessary record-route could cause additional PJSIP buildup? By: George Joseph (gjoseph) 2021-11-05 08:54:00.180-0500 The fact that pool_policy_malloc is the biggest allocator isn't surprising. Every new transaction and dialog gets a pool allocated to it and the pool _should_ be released when the dialog or transaction is terminated. Normally, libc should release the pool back to the OS but if the pages are very fragmented, it may not be able to efficiently. When things start going south, try running {{malloc trim}} from the CLI and see what that does to the memory used by the process. It tells libc to be more aggressive in its attempt to return memory to the OS. Can we do a little recap? The latest test was on 16.20 yes? What was the usage profile over that period? Number of calls, endpoints, inbound/outbound registrations, subscriptions, etc. For the OOM killer...are you talking about the kernel's OOM killer or something you've implemented? How are you setting the limit and on what parameter? virtual or resident memory? How much memory is actually installed on the machines? I don't _think_ adding the RR header is an issue but I can run a few tests on Monday. By: Luke Escude (lukeescude) 2021-11-05 10:22:14.638-0500 Correct, 16.20. The customer had about 300 SIP devices registered to Asterisk, and placed around 3,000 to 4,000 calls over the 12-hour period. I said "OOM killer" but I meant Docker's memory limit - It's set to a 3GB ceiling, and when the physically occupied memory reaches 3GB docker will kill the container and restart it. 16GB of memory is available on the Docker host. Typically Asterisk, even with 300 extensions registered, doesn't need more than like 800MB of memory in my experience. Even less actually. Looking at my git commits around the time I started noticing the issue, that was also when I refactored our phone-facing proxy servers. This refactor included the unnecessary record_route() in SUBSCRIBEs and NOTIFYs. Since no one else is reporting this in other Asterisk issues, I'm beginning to think I've caused this. By: Luke Escude (lukeescude) 2021-11-05 10:51:18.169-0500 I do find that taskprocessor depth gets very high sometimes for our phone-facing proxy servers. Example: pjsip/outreg/raleigh-proxy-00000069 42662 0 416 450 500 This is on a customer system that doesn't actually have any phones - Asterisk is simply registering to that proxy, but nothing's coming from it at all. It could be unrelated, maybe since it's TCP and a disconnect happens between Asterisk and the Proxy there can be a packet buildup in the buffer or something. By: George Joseph (gjoseph) 2021-11-05 11:09:40.706-0500 Ooooohhhhhh TCP. :) How many open sockets between the two? How's the DNS response time? Can you capture a few minutes of packets on the interface? From the CLI... {code} CLI> pjproject set log level 3 CLI> pjsip dump endpt details {code} By: Luke Escude (lukeescude) 2021-11-05 11:15:09.071-0500 Only 1 socket between Asterisk and the proxy (created by asterisk REGISTER to proxy). Then phone dialogs come in through it - So 100 devices could easily register through that 1 socket. No DNS is used, we got rid of it forever ago in favor of raw IPs. Really cut down on basically all possible latency besides basic physics. This is on my dev box, with only like 8 extensions on it: {code} pjsip dump endpt details sip_endpoint.c Dumping endpoint 0x3521148: cachpool Dumping caching pool: cachpool Capacity=72192, max_capacity=1048576, used_cnt=183 cachpool Dumping all active pools: cachpool pept0x35210a0: 26384464 of 26388096 (99%) used cachpool tpmgr: 4824 of 5440 (88%) used cachpool SIP: 240 of 1024 (23%) used cachpool tsxlayer: 524600 of 525312 (99%) used cachpool ua0x3500280: 529864 of 530432 (99%) used cachpool tcptp: 1400 of 1536 (91%) used cachpool glck0x39372c0: 584 of 1024 (57%) used cachpool udp0x39388d0: 1320 of 1536 (85%) used cachpool glck0x3938e10: 584 of 1024 (57%) used cachpool rtd0x3939350: 66600 of 72096 (92%) used cachpool resolver.c: 11696 of 16096 (72%) used cachpool glck0x7fb4c4005860: 584 of 1024 (57%) used cachpool evsub: 3696 of 5632 (65%) used cachpool regc0x7fb4c4005b00: 5128 of 6144 (83%) used cachpool regc0x7fb4c40036c0: 5128 of 6144 (83%) used cachpool regc0x7fb4c4004de0: 4232 of 5120 (82%) used cachpool regc0x7fb4c400b070: 4232 of 5120 (82%) used cachpool auth_cli0x7fb4c4045a70: 192 of 1024 (18%) used cachpool auth_cli0x7fb4c4046fb0: 192 of 1024 (18%) used cachpool auth_cli0x7fb4c41e9c70: 184 of 1024 (17%) used cachpool auth_cli0x7fb4c40f7c90: 184 of 1024 (17%) used cachpool tcp: 69072 of 69120 (99%) used cachpool glck0x7fb4c4223e90: 584 of 1024 (57%) used cachpool rtd0x7fb4f018e730: 168 of 4096 (4%) used cachpool tcp: 68720 of 69120 (99%) used cachpool glck0x7fb52c09d620: 584 of 1024 (57%) used cachpool rtd0x7fb4f0023a80: 168 of 4096 (4%) used cachpool dlg0x7fb4c8000f90: 121928 of 131584 (92%) used cachpool glck0x495b420: 584 of 1024 (57%) used cachpool glck0x7fb4c4042650: 584 of 1024 (57%) used cachpool dlg0x7fb4dc1d8bb0: 121928 of 131584 (92%) used cachpool glck0x7fb560014e80: 584 of 1024 (57%) used cachpool glck0x7fb4c439b200: 584 of 1024 (57%) used cachpool dlg0x7fb4c800ecd0: 121928 of 131584 (92%) used cachpool glck0x7fb4c45d6ec0: 584 of 1024 (57%) used cachpool glck0x7fb56000d2c0: 584 of 1024 (57%) used cachpool dlg0x7fb4dc3c0140: 121904 of 131584 (92%) used cachpool glck0x7fb4c43bc1c0: 584 of 1024 (57%) used cachpool glck0x7fb4c422b6a0: 584 of 1024 (57%) used cachpool dlg0x7fb5540242e0: 121888 of 131584 (92%) used cachpool glck0x7fb4e4004630: 584 of 1024 (57%) used cachpool glck0x7fb52c09ecf0: 584 of 1024 (57%) used cachpool dlg0x7fb4c416ba20: 121904 of 131584 (92%) used cachpool glck0x7fb4c44eabe0: 584 of 1024 (57%) used cachpool glck0x7fb4c467b300: 584 of 1024 (57%) used cachpool dlg0x7fb4dc0ceff0: 121952 of 131584 (92%) used cachpool glck0x7fb4c427ac70: 584 of 1024 (57%) used cachpool glck0x7fb4c44e6130: 584 of 1024 (57%) used cachpool dlg0x7fb564402a60: 121952 of 131584 (92%) used cachpool glck0x7fb4c422aa00: 584 of 1024 (57%) used cachpool glck0x7fb54c069dd0: 584 of 1024 (57%) used cachpool dlg0x7fb4dc3135e0: 121936 of 131584 (92%) used cachpool glck0x3b6ce70: 584 of 1024 (57%) used cachpool glck0x7fb554000ff0: 584 of 1024 (57%) used cachpool dlg0x7fb54c065920: 121952 of 131584 (92%) used cachpool glck0x7fb5540343b0: 584 of 1024 (57%) used cachpool glck0x7fb4c4116840: 584 of 1024 (57%) used cachpool dlg0x7fb5540bf870: 121952 of 131584 (92%) used cachpool glck0x7fb52c25af50: 584 of 1024 (57%) used cachpool glck0x7fb4c419c330: 584 of 1024 (57%) used cachpool dlg0x7fb4dc057cc0: 121952 of 131584 (92%) used cachpool glck0x7fb4c40fb410: 584 of 1024 (57%) used cachpool glck0x7fb4c463a370: 584 of 1024 (57%) used cachpool dlg0x7fb4f0383b30: 121952 of 131584 (92%) used cachpool glck0x7fb560007870: 584 of 1024 (57%) used cachpool glck0x7fb4c40cb010: 584 of 1024 (57%) used cachpool dlg0x7fb4dc38c4b0: 121936 of 131584 (92%) used cachpool glck0x7fb4dc3d7dd0: 584 of 1024 (57%) used cachpool glck0x7fb4c4032c40: 584 of 1024 (57%) used cachpool dlg0x7fb4f037d5c0: 121952 of 131584 (92%) used cachpool glck0x7fb4c4506a20: 584 of 1024 (57%) used cachpool glck0x7fb4c44ef680: 584 of 1024 (57%) used cachpool dlg0x7fb4f048a410: 121952 of 131584 (92%) used cachpool glck0x7fb4c42b7c20: 584 of 1024 (57%) used cachpool glck0x7fb554078970: 584 of 1024 (57%) used cachpool dlg0x7fb540058b80: 121952 of 131584 (92%) used cachpool glck0x7fb4c44e5550: 584 of 1024 (57%) used cachpool glck0x7fb4c4453bd0: 584 of 1024 (57%) used cachpool dlg0x7fb52c0f2a90: 121952 of 131584 (92%) used cachpool glck0x3fa6e40: 584 of 1024 (57%) used cachpool glck0x7fb4c4254f30: 584 of 1024 (57%) used cachpool dlg0x49c6c00: 121952 of 131584 (92%) used cachpool glck0x7fb4c8004370: 584 of 1024 (57%) used cachpool glck0x7fb4c417ebe0: 584 of 1024 (57%) used cachpool dlg0x7fb5640764f0: 121952 of 131584 (92%) used cachpool glck0x7fb4c43eac90: 584 of 1024 (57%) used cachpool glck0x3face00: 584 of 1024 (57%) used cachpool dlg0x7fb554031690: 121952 of 131584 (92%) used cachpool glck0x7fb4c4348ab0: 584 of 1024 (57%) used cachpool glck0x4938690: 584 of 1024 (57%) used cachpool dlg0x7fb4f0295550: 121936 of 131584 (92%) used cachpool glck0x7fb4d400b460: 584 of 1024 (57%) used cachpool glck0x7fb4f03cf540: 584 of 1024 (57%) used cachpool dlg0x7fb554015460: 121952 of 131584 (92%) used cachpool glck0x4a67690: 584 of 1024 (57%) used cachpool glck0x7fb4f0381560: 584 of 1024 (57%) used cachpool dlg0x7fb560006fd0: 121952 of 131584 (92%) used cachpool glck0x7fb4c443f120: 584 of 1024 (57%) used cachpool glck0x7fb4c4051760: 584 of 1024 (57%) used cachpool dlg0x7fb4c40e4990: 121952 of 131584 (92%) used cachpool glck0x7fb5643f27f0: 584 of 1024 (57%) used cachpool glck0x7fb4c419e850: 584 of 1024 (57%) used cachpool dlg0x7fb540084470: 208808 of 225792 (92%) used cachpool glck0x7fb4c416aa60: 584 of 1024 (57%) used cachpool glck0x7fb4c4344320: 584 of 1024 (57%) used cachpool dlg0x7fb4c40d5660: 208808 of 225792 (92%) used cachpool glck0x7fb4c4173e30: 584 of 1024 (57%) used cachpool glck0x7fb4c474e1b0: 584 of 1024 (57%) used cachpool dlg0x7fb4dc00b510: 208808 of 225792 (92%) used cachpool glck0x7fb5643dc040: 584 of 1024 (57%) used cachpool glck0x7fb5641ca9c0: 584 of 1024 (57%) used cachpool dlg0x7fb54c26d6a0: 121880 of 131584 (92%) used cachpool glck0x495e9d0: 584 of 1024 (57%) used cachpool glck0x7fb4c4376350: 584 of 1024 (57%) used cachpool dlg0x7fb56425b920: 121880 of 131584 (92%) used cachpool glck0x495fb80: 584 of 1024 (57%) used cachpool glck0x7fb4c41721d0: 584 of 1024 (57%) used cachpool dlg0x7fb4c42667f0: 121880 of 131584 (92%) used cachpool glck0x7fb4c44a0c30: 584 of 1024 (57%) used cachpool glck0x7fb4dc095010: 584 of 1024 (57%) used cachpool dlg0x7fb4f0270820: 121880 of 131584 (92%) used cachpool glck0x7fb4c4200470: 584 of 1024 (57%) used cachpool glck0x7fb4c414e790: 584 of 1024 (57%) used cachpool dlg0x7fb4dc0c61d0: 208648 of 225792 (92%) used cachpool glck0x7fb4c4442630: 584 of 1024 (57%) used cachpool glck0x7fb4c4016d90: 584 of 1024 (57%) used cachpool dlg0x7fb4f0287450: 208648 of 225792 (92%) used cachpool glck0x7fb4c424d710: 584 of 1024 (57%) used cachpool glck0x7fb4c4375ea0: 584 of 1024 (57%) used cachpool dlg0x7fb4c4163fa0: 208648 of 225792 (92%) used cachpool glck0x7fb4c40e5520: 584 of 1024 (57%) used cachpool glck0x3fb2560: 584 of 1024 (57%) used cachpool dlg0x7fb54c0b8e50: 208648 of 225792 (92%) used cachpool glck0x7fb4c4413280: 584 of 1024 (57%) used cachpool glck0x7fb4dc0caa40: 584 of 1024 (57%) used cachpool dlg0x7fb4c4393520: 216840 of 235008 (92%) used cachpool glck0x7fb4c402dd20: 584 of 1024 (57%) used cachpool glck0x7fb4c40311e0: 584 of 1024 (57%) used cachpool dlg0x7fb4c4396570: 217776 of 236032 (92%) used cachpool glck0x7fb4c4194b30: 584 of 1024 (57%) used cachpool glck0x7fb5643db860: 584 of 1024 (57%) used cachpool dlg0x7fb554179100: 77176 of 83456 (92%) used cachpool glck0x7fb4f037d320: 584 of 1024 (57%) used cachpool glck0x7fb4c44e8c30: 584 of 1024 (57%) used cachpool dlg0x7fb4f043d900: 17688 of 18944 (93%) used cachpool glck0x7fb4c40ce350: 584 of 1024 (57%) used cachpool glck0x7fb4c4362970: 584 of 1024 (57%) used cachpool dlg0x7fb54c1bf880: 16744 of 17920 (93%) used cachpool glck0x7fb4f0191250: 584 of 1024 (57%) used cachpool glck0x7fb4c481a200: 584 of 1024 (57%) used cachpool dlg0x7fb4f0450cd0: 17696 of 18944 (93%) used cachpool glck0x7fb4f007dc00: 584 of 1024 (57%) used cachpool glck0x7fb4c41b6910: 584 of 1024 (57%) used cachpool dlg0x7fb5540b25e0: 17688 of 18944 (93%) used cachpool glck0x7fb4c40360f0: 584 of 1024 (57%) used cachpool glck0x7fb4c405ef70: 584 of 1024 (57%) used cachpool dlg0x7fb56407c9b0: 17688 of 18944 (93%) used cachpool glck0x7fb4dc04a120: 584 of 1024 (57%) used cachpool glck0x7fb4c4345e20: 584 of 1024 (57%) used cachpool dlg0x7fb4f0458830: 17688 of 18944 (93%) used cachpool glck0x7fb554059a10: 584 of 1024 (57%) used cachpool glck0x7fb4c421db10: 584 of 1024 (57%) used cachpool dlg0x7fb4c45ad270: 17672 of 18944 (93%) used cachpool glck0x7fb4f04afff0: 584 of 1024 (57%) used cachpool glck0x7fb5640c7530: 584 of 1024 (57%) used cachpool dlg0x7fb4f03ccdd0: 17696 of 18944 (93%) used cachpool glck0x4a124e0: 584 of 1024 (57%) used cachpool glck0x7fb4c44f1ba0: 584 of 1024 (57%) used cachpool dlg0x7fb56402d570: 17680 of 18944 (93%) used cachpool glck0x7fb564189d90: 584 of 1024 (57%) used cachpool glck0x7fb4f054aa80: 584 of 1024 (57%) used cachpool dlg0x7fb4c44f9290: 17736 of 18944 (93%) used cachpool glck0x7fb4f054f370: 584 of 1024 (57%) used cachpool glck0x7fb55406f930: 584 of 1024 (57%) used cachpool dlg0x7fb5540bc140: 17688 of 18944 (93%) used cachpool glck0x7fb5540baa30: 584 of 1024 (57%) used cachpool glck0x7fb4c44fc730: 584 of 1024 (57%) used cachpool dlg0x7fb4c43d6a90: 17696 of 18944 (93%) used cachpool glck0x7fb4c449e700: 584 of 1024 (57%) used cachpool glck0x7fb4dc09c100: 584 of 1024 (57%) used cachpool dlg0x7fb4dc317e50: 17680 of 18944 (93%) used cachpool glck0x7fb4dc3c2ff0: 584 of 1024 (57%) used cachpool glck0x7fb4dc22e630: 584 of 1024 (57%) used cachpool Total 33487848 of 34028096 (98 %) used! sip_endpoint.c Endpoint pool capacity=26388096, used_size=26384464 resolver.c Dumping resolver state: resolver.c Name servers: resolver.c NS 0: 208.67.222.222:53 (state=Active until -75763s, rtt=10000 ms) resolver.c NS 1: 8.8.8.8:53 (state=Active until -75763s, rtt=10000 ms) resolver.c NS 2: 1.1.1.1:53 (state=Active until -75763s, rtt=10000 ms) resolver.c Nb. of cached responses: 0 resolver.c Nb. of pending queries: 0 (0) resolver.c Nb. of pending query free nodes: 0 resolver.c Nb. of timer entries: 56 resolver.c Pool capacity: 16096, used size: 11696 sip_transport.c Dumping listeners: sip_transport.c tcptp:5060 TCP:172.17.0.3:5060 sip_transport.c Dumping transports: sip_transport.c udp0x39388d0 x-ast-txp:transport-udp(refcnt=3) sip_transport.c tcpc0x7fb4c4382ca8 TCP to 207.246.88.196:4242 listener[tcptp:5060](refcnt=2) sip_transport.c tcpc0x7fb4dc146478 TCP to 104.238.165.75:4242 listener[tcptp:5060](refcnt=2) timer.c Dumping timer heap: timer.c Cur size: 56 entries, max: 262142 timer.c Entries: timer.c _id Id Elapsed Source timer.c ---------------------------------- timer.c 30 0 21.296 res_pjsip_outbound_registration.c:683 timer.c 82 0 103.700 res_pjsip_outbound_registration.c:683 timer.c 25 0 23.500 res_pjsip_outbound_registration.c:683 timer.c 53 0 104.764 res_pjsip_outbound_registration.c:683 timer.c 20 2 347.630 ../src/pjsip-simple/evsub.c:524 timer.c 87 2 343.934 ../src/pjsip-simple/evsub.c:524 timer.c 27 2 378.627 ../src/pjsip-simple/evsub.c:524 timer.c 2 2 362.433 ../src/pjsip-simple/evsub.c:524 timer.c 83 2 158.243 ../src/pjsip-simple/evsub.c:524 timer.c 42 2 370.628 ../src/pjsip-simple/evsub.c:524 timer.c 48 2 348.935 ../src/pjsip-simple/evsub.c:524 timer.c 84 2 374.130 ../src/pjsip-simple/evsub.c:524 timer.c 58 2 351.630 ../src/pjsip-simple/evsub.c:524 timer.c 38 2 380.135 ../src/pjsip-simple/evsub.c:524 timer.c 26 2 391.961 ../src/pjsip-simple/evsub.c:524 timer.c 45 2 379.634 ../src/pjsip-simple/evsub.c:524 timer.c 55 2 365.627 ../src/pjsip-simple/evsub.c:524 timer.c 79 2 365.129 ../src/pjsip-simple/evsub.c:524 timer.c 47 2 169.745 ../src/pjsip-simple/evsub.c:524 timer.c 65 2 376.626 ../src/pjsip-simple/evsub.c:524 timer.c 11 2 505.205 ../src/pjsip-simple/evsub.c:524 timer.c 77 2 375.629 ../src/pjsip-simple/evsub.c:524 timer.c 35 2 383.129 ../src/pjsip-simple/evsub.c:524 timer.c 1 2 376.131 ../src/pjsip-simple/evsub.c:524 timer.c 37 2 381.130 ../src/pjsip-simple/evsub.c:524 timer.c 54 2 360.628 ../src/pjsip-simple/evsub.c:524 timer.c 40 2 377.129 ../src/pjsip-simple/evsub.c:524 timer.c 46 2 561.104 ../src/pjsip-simple/evsub.c:524 timer.c 18 2 506.957 ../src/pjsip-simple/evsub.c:524 timer.c 22 2 405.544 ../src/pjsip-simple/evsub.c:524 timer.c 34 2 392.524 ../src/pjsip-simple/evsub.c:524 timer.c 70 2 391.465 ../src/pjsip-simple/evsub.c:524 timer.c 80 2 504.920 ../src/pjsip-simple/evsub.c:524 timer.c 51 2 505.455 ../src/pjsip-simple/evsub.c:524 timer.c 62 2 382.129 ../src/pjsip-simple/evsub.c:524 timer.c 72 2 407.355 ../src/pjsip-simple/evsub.c:524 timer.c 75 2 389.356 ../src/pjsip-simple/evsub.c:524 timer.c 9 2 388.963 ../src/pjsip-simple/evsub.c:524 timer.c 29 2 170.240 ../src/pjsip-simple/evsub.c:524 timer.c 12 2 473.356 ../src/pjsip-simple/evsub.c:524 timer.c 71 2 539.931 ../src/pjsip-simple/evsub.c:524 timer.c 5 2 505.721 ../src/pjsip-simple/evsub.c:524 timer.c 21 2 506.611 ../src/pjsip-simple/evsub.c:524 timer.c 68 2 504.058 ../src/pjsip-simple/evsub.c:524 timer.c 8 2 505.999 ../src/pjsip-simple/evsub.c:524 timer.c 69 2 387.630 ../src/pjsip-simple/evsub.c:524 timer.c 67 2 508.207 ../src/pjsip-simple/evsub.c:524 timer.c 4 2 503.815 ../src/pjsip-simple/evsub.c:524 timer.c 3 2 506.251 ../src/pjsip-simple/evsub.c:524 timer.c 49 2 382.628 ../src/pjsip-simple/evsub.c:524 timer.c 33 2 395.960 ../src/pjsip-simple/evsub.c:524 timer.c 36 2 504.649 ../src/pjsip-simple/evsub.c:524 timer.c 31 2 400.526 ../src/pjsip-simple/evsub.c:524 timer.c 52 2 504.335 ../src/pjsip-simple/evsub.c:524 timer.c 56 2 389.533 ../src/pjsip-simple/evsub.c:524 timer.c 6 2 567.253 ../src/pjsip-simple/evsub.c:524 {code} By: Luke Escude (lukeescude) 2021-11-09 11:06:00.405-0600 I believe this issue has to do with our usage of TCP for endpoints. Out of 500+ customer systems, the only one that appears to be completely stable without a leak issue has all of their devices using UDP. I'm putting together a spreadsheet comparing various customers and their various dial plan features, and that's what pokes out the most so far. Call volume appears to have nothing to do with it, in addition to BLFs (SUBSCRIBE and NOTIFY). Is there additional direction you can give me for diagnosing TCP? EDIT: Never mind, came across 3 systems with 80+ extensions using less than 200MB of memory, all on TCP. This is absolutely stumping me, but there may be a correlation with the quantity of BLF subscriptions. I will add that to my spreadsheet. By: Luke Escude (lukeescude) 2021-11-09 15:56:44.016-0600 Adding analysis of 18 customer systems. Strong correlation between compound hints (asterisk hints joined together with &) and memory consumption. I changed the way our NOTIFYs and SUBSCRIBEs use record-route headers, and this did not fix anything. By: George Joseph (gjoseph) 2021-11-10 06:34:53.648-0600 I can take a look this afternoon. By: Luke Escude (lukeescude) 2021-11-10 10:15:03.984-0600 Thanks George! Here's what I am going to start doing today: 1. malloc trim hourly cronjob - Malloc trim doesn't return all the leaked memory but it does a good chunk 2. Only create dial plan hints that will actually be used by customer devices (instead of creating hints for everything we monitor) 3. Move SUBSCRIBE/NOTIFY stuff to Kamailio, utilizing Asterisk's PUBLISH capabilities so Asterisk no long has to handle subscriptions. Besides enabling MALLOC_DEBUG in the compile process, I couldn't figure out how to do anything else for memory diagnostics, including asan/lsan - I got no output from it, and have 0 experience working with it. I know there's something called Valgrind but I haven't looked into how to use it yet. By: George Joseph (gjoseph) 2021-11-10 12:55:43.545-0600 bq. 1. malloc trim hourly cronjob - Malloc trim doesn't return all the leaked memory but it does a good chunk Bingo! We ran across a similar issue last year where we were seeing pjproject memory pool fragmentation with TCP connections and running malloc trim helped a lot. It was reported as part of a security issue that was fixed but we were unsure if it actually fixed the memory leak. Looking into it more. By: George Joseph (gjoseph) 2021-11-10 13:18:12.988-0600 [~lukeescude] Would it be possible for you to get a coredump of a running asterisk instance that shows the issue? It would interrupt calls for a few seconds but it would be very helpful. There don't have to be lots of calls in progress so it could be done in a low volume period as long as the memory usage was excessive. Grab the latest ast_coredumper script from https://github.com/asterisk/asterisk/raw/16/contrib/scripts/ast_coredumper and run it with {{sudo ./ast_coredumper --running --tarball-coredumps}}. The resulting tarball will be fairly large and contain sensitive info so if you could upload it to Google Drive, Dropbox, etc and email me the link, that would be best. When I grab teh file, I'll let you know and you can delete it. By: Luke Escude (lukeescude) 2021-11-10 13:20:38.546-0600 George, since this issue happens when the system is just sitting idle (regardless of call volume) would it be equally meaningful to do it after-hours when there is very little traffic? If it's more meaningful to grab it during peak call processing, that's fine I'll just have to coordinate it with my customer. By: Luke Escude (lukeescude) 2021-11-10 13:30:56.861-0600 Additionally, here's something interesting that may be unrelated (something I didn't notice back when we used UDP transport) When everything is running just fine (Asterisk making outbound registrations to our SIP proxies, and inbound registrations coming from devices are coming into Asterisk like normal), and suddenly the connection with the SIP proxy gets broken/restarted, the following occurs: WARNING[3360]: taskprocessor.c:1160 taskprocessor_push: The 'pjsip/outreg/dev-chicago-1-0000019f' task processor queue reached 500 scheduled tasks. [Nov 10 19:27:47] DEBUG[6508]: res_pjsip_outbound_registration.c:860 reregister_immediately_cb: Outbound registration transport to server 'sip:IPADDR:4040' from client 'sip:999999@domain' shutdown [Nov 10 19:27:47] DEBUG[6508]: res_pjsip_outbound_registration.c:608 handle_client_registration: Outbound REGISTER attempt 1 to 'sip:IPADDR:4040' with client 'sip:999999@domain' Those last 2 messages repeat probably 200+ times. So, is it possible there's some build-up of events in PJSIP outbound registration? By: George Joseph (gjoseph) 2021-11-11 08:34:25.843-0600 bq. George, since this issue happens when the system is just sitting idle (regardless of call volume) would it be equally meaningful to do it after-hours when there is very little traffic? That's fine. As long as there has been a build-up and the asterisk instance is still consuming excessive memory, it doesn't matter if there is currently a call in progress. By: Luke Escude (lukeescude) 2021-11-11 15:01:46.273-0600 For some reason this keeps happening when I try to run it: realpath: '/tmp/core-asterisk-running-2021-11-11T21-01-14Z': No such file or directory It has permissions to access /tmp and whatnot.... By: Luke Escude (lukeescude) 2021-11-15 11:56:05.166-0600 It's possible I need to enable SYS_PTRACE on the container to allow gdb to function properly. I will need to perform some upgrades to our Docker clustering system to be able to do this, unfortunately. I'm hoping you guys have some previous tickets or knowledge that may help in the event that I can't get a core dump from a production customer. |