[Home]

Summary:ASTERISK-29232: Memory Leak since 16.13.0
Reporter:Luke Escude (lukeescude)Labels:fax
Date Opened:2021-01-04 12:42:18.000-0600Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Core/PBX
Versions:16.15.0 16.20.0 16.22.0 Frequency of
Occurrence
Related
Issues:
Environment:CentOS 7 x64Attachments:( 0) Analysis.xlsx
( 1) Apex-Analysis.xlsx
( 2) Container_Leak_Tracking.xlsx
( 3) cw1-memchart.png
( 4) Jan6-1401.csv
( 5) nw1-memchart.png
( 6) PW3-Memchart.png
Description:So we have around 100 instances of Asterisk 16.13.0 that have been running for over 2 months, normal load (small businesses with less than 30 users each), without issue.

We have another 350 instances of Asterisk 16.15.0 that we've started seeing a very linear increase in memory consumption over time. Specifically, we see higher-load instances (150+ users) last only a few days before hitting our artificial 3GB ceiling and getting restarted by the OOM killer.

There are very few differences in our implementation of the 16.13 and 16.15 versions. All versions are set up as the following:

- CentOS 7 64-bit
- Voicemail over ODBC
- unixODBC 2.3.1
- MariaDB Connector (instead of the crappy mysql connector)
- CDR over MySQL
- SIP Trunks are registered every 2 minutes, qualified every 15 seconds.
- User devices register every 10 minutes, qualified every 15 seconds.
- User devices connect via TCP more often than UDP.
- I have NO pjsip threadpool configuration options defined. I think the default is 50 threads?

Here is what I am about to test within the next week:
1. unixODBC updated to 2.3.9
2. Longer SIP Trunk Registration period - Maybe PJSIP is working too hard?
3. Longer qualify timeout - Maybe PJSIP is working too hard?

One of my first questions: Is it SAFE to compile asterisk with MALLOC_DEBUG and just leave it on permanently? I am scared to enable it, and suddenly have a bunch of users that are experiencing issues because I've enabled something that should only be enabled in Dev.

Sorry for the length of the post, trying to cover as much ground as possible.
Comments:By: Asterisk Team (asteriskteam) 2021-01-04 12:42:20.561-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Luke Escude (lukeescude) 2021-01-04 12:52:54.890-0600

Added 3 charts showing memory consumption over the last 30 days.

CW1 running Asterisk 16.13 instances.

NW1 and PW3 running Asterisk 16.15 instances.

Doesn't really help with the issue, but it's definitely interesting to look at.

By: Joshua C. Colp (jcolp) 2021-01-06 03:53:18.833-0600

MALLOC_DEBUG does incur a performance penalty, but it should not be large and should be fine. It's likely the only way to isolate this issue since you're using a literal TON of functionality, including off-nominal rarely used stuff. The only other way would be to duplicate the environment and eliminate aspects bit by bit to further isolate.

By: Luke Escude (lukeescude) 2021-01-06 10:58:30.485-0600

Awesome, we'll enable that flag and distribute that image to only 1 of our worker nodes so we can contain the debugging to a limited set of customers.

I'll update this ticket when I get some data - It may be about a week before we can begin, since we're having to go to one of our datacenters and perform hardware upgrades this weekend.

By: Luke Escude (lukeescude) 2021-01-06 14:04:30.376-0600

Attaching a CSV of memory show summary at the restart of a fresh (small) instance of Asterisk 16.15

Apparently the memory leak exists on these small ones too, but it takes about a week to notice.

So, I will sample memory show summary in a week, with the same CSV syntax so it'll be easy to analyze/compare in Excel.

By: Luke Escude (lukeescude) 2021-01-06 15:04:09.966-0600

So for the last hour, the instance has had 0 calls, with around 14 or so devices registered, and here are the files with some significant allocation changes:

../src/pj/pool_policy_malloc.c +742592 bytes
datastore.c +40590 bytes
res_pjsip_pubsub.c +10720 bytes
sorcery.c +2191629 bytes


There are other ones as well, but those are the largest. The total change in 1 hour has been +3034055 bytes. (Roughly 3MB)

Sorcery appears to be the worst, so here's an output of SOME allocations for it:

{noformat}
11 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      544 bytes allocated by sorcery_object_type_alloc() line   714 of sorcery.c
      160 bytes allocated by sorcery_object_type_alloc() line   721 of sorcery.c
       72 bytes allocated by sorcery_object_type_alloc() line   728 of sorcery.c
      544 bytes allocated by sorcery_object_type_alloc() line   714 of sorcery.c
      304 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      272 bytes allocated by ast_sorcery_lockable_alloc() line  1712 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      312 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      416 bytes allocated by sorcery_object_type_alloc() line   703 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      216 bytes allocated by ast_sorcery_object_fields_register() line  1165 of sorcery.c
      304 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
        6 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
    66376 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      304 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
       13 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      992 bytes allocated by   __ast_sorcery_open() line   637 of sorcery.c
      416 bytes allocated by sorcery_object_type_alloc() line   703 of sorcery.c
      272 bytes allocated by ast_sorcery_lockable_alloc() line  1712 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      544 bytes allocated by sorcery_object_type_alloc() line   714 of sorcery.c
        6 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
       72 bytes allocated by ast_sorcery_observer_add() line  2379 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      272 bytes allocated by ast_sorcery_lockable_alloc() line  1712 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      272 bytes allocated by ast_sorcery_lockable_alloc() line  1712 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
       52 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
       72 bytes allocated by ast_sorcery_observer_add() line  2379 of sorcery.c
       48 bytes allocated by sorcery_object_type_alloc() line   735 of sorcery.c
      544 bytes allocated by sorcery_object_type_alloc() line   714 of sorcery.c
       48 bytes allocated by sorcery_object_type_alloc() line   735 of sorcery.c
      272 bytes allocated by ast_sorcery_lockable_alloc() line  1712 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      240 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
        6 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      160 bytes allocated by __ast_sorcery_wizard_register() line   451 of sorcery.c
       72 bytes allocated by ast_sorcery_observer_add() line  2379 of sorcery.c
       24 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
       14 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
       40 bytes allocated by sorcery_object_type_alloc() line   708 of sorcery.c
      216 bytes allocated by ast_sorcery_object_fields_register() line  1165 of sorcery.c
     1248 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
    66376 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
        4 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
        4 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      216 bytes allocated by ast_sorcery_object_fields_register() line  1165 of sorcery.c
      272 bytes allocated by ast_sorcery_lockable_alloc() line  1712 of sorcery.c
        4 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      304 bytes allocated by ast_sorcery_generic_alloc() line  1728 of sorcery.c
       40 bytes allocated by sorcery_object_type_alloc() line   708 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      160 bytes allocated by sorcery_object_type_alloc() line   721 of sorcery.c
       11 bytes allocated by    ast_sorcery_alloc() line  1757 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      416 bytes allocated by sorcery_object_type_alloc() line   703 of sorcery.c
      240 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      240 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      224 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
      160 bytes allocated by sorcery_object_type_alloc() line   721 of sorcery.c
      160 bytes allocated by sorcery_object_type_alloc() line   721 of sorcery.c
      240 bytes allocated by __ast_sorcery_object_field_register() line  1205 of sorcery.c
{noformat}

By: Joshua C. Colp (jcolp) 2021-01-06 15:09:49.328-0600

Many of those allocations are normal. It's only ast_sorcery_generic_alloc calls which actually allocate objects.

Your description doesn't detail how PJSIP configuration is stored and used. Is it in database? Configuration file?

By: Luke Escude (lukeescude) 2021-01-06 15:10:16.916-0600

Here's res_pjsip_pubsub allocations too, in case Sorcery is a dead end (I didn't realize Sorcery was memory related, so it's possible it's there to assist with debugging).

{noformat}
11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      102 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      103 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      712 bytes allocated by ast_sip_register_publish_handler() line  2726 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      103 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      712 bytes allocated by ast_sip_register_publish_handler() line  2726 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       11 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
      100 bytes allocated by allocate_subscription() line  1239 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       37 bytes allocated by subscription_persistence_create() line   655 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
      176 bytes allocated by allocate_subscription_tree() line  1409 of res_pjsip_pubsub.c
       13 bytes allocated by subscription_persistence_create() line   653 of res_pjsip_pubsub.c
{noformat}

By: Luke Escude (lukeescude) 2021-01-06 15:14:39.409-0600

PJSIP is configured in a standard conf file. We're not using any Realtime functions (at least not from my understanding of what Realtime is).

Besides trunk/endpoint definitions, the only other stuff in pjsip.conf is the following:

{noformat}
[global]
send_contact_status_on_update_registration=yes
user_agent=Asterisk-XXXYYY

[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0
local_net=XXXYYY/16
tos=104
cos=3

[transport-tcp]
type=transport
protocol=tcp
bind=0.0.0.0
local_net=XXXYYY/16
tos=104
cos=3
{noformat}

By: Richard Mudgett (rmudgett) 2021-01-06 15:28:16.332-0600

Please attach logs to the issue.  Pasting large logs into comments can wind up with your comment being deleted to get JIRA to become responsive while viewing the issue.

The best information from MALLOC_DEBUG is the summary counts as an individual allocation list is basically just useless spam.  You can get a summary by the system and by specific module.  System summaries point to which modules are consuming memory.  Module summaries give you allocation counts by line.

By: Luke Escude (lukeescude) 2021-01-06 15:28:56.640-0600

This looks interesting: https://github.com/asterisk/asterisk/commit/e1fd51cd2c183a3eb61803f52277d1a7d7a18c9c#diff-be39ef79e09077566ea704aaf08be25863d2a3a3fd5a7820cffcf37bc2df16f0

I did notice res_pjsip_outbound_registration.c is growing in size too.... Outbound registrations makes up 100% of Asterisk's communication with the outside world for us - It registers to SBCs which have public IPs of course, which pass customer devices through those same sockets in the other direction. This enables us to run hundreds of Asterisk instances behind NAT without ever having to open ports.

I'll stop blowing this ticket up, and will perform another comparison tomorrow after 12 hours.

EDIT: Sorry Richard, will do!

By: Joshua C. Colp (jcolp) 2021-01-07 08:44:23.942-0600

We are reverting the referenced outbound registration change due to other changes involved. If you revert the change as well, does that solve the issue?

By: Luke Escude (lukeescude) 2021-01-07 10:23:09.960-0600

Ah, looking good so far. I'll report back in 12-24 hours but upon immediate boot-up, memory appears to be flat. Normally it starts rising pretty immediately.

Will update soon.

By: Benjamin Keith Ford (bford) 2021-01-07 10:26:05.341-0600

Putting in feedback for now.

By: Luke Escude (lukeescude) 2021-01-08 12:39:37.169-0600

Ended up booting up Asterisk 16.14 for a few production customers instead of reverting that 1 file (to at least narrow down a general scope of Version over File) but memory is still rising.

I'll know more concretely tomorrow if downgrading fixes it or doesn't, then that will help rule out that particular change with PJSIP registrations. There's a possibility these particular customers must have high mem usage anyways because of their sheer volume.

By: Luke Escude (lukeescude) 2021-01-10 19:17:04.199-0600

Booting Asterisk 16.14 did not fix the memory leak.

Booted the same customer (around 180 user devices registered) with Asterisk 16.13 roughly 8 hours ago, on a Sunday with minimal call volume, and RAM usage is already over 500MB. So, I think the outbound registration code I had mentioned may not be the culprit in this particular case.

I may just need to bite the bullet and compile that same customer with MALLOC_DEBUG. Doing it in my low-volume dev environment didn't really give us much data to operate on... I'll formulate a plan with my team to accomplish this.

By: Luke Escude (lukeescude) 2021-01-10 22:00:55.888-0600

Attaching a memory analysis over a 2-hour period, sampled every 20 minutes.

The tracked output of "memory show summary" on an Asterisk 16.13 instance with 170+ customer devices registered over TCP, sampled every 20 minutes.

There is a "Diff 20min" column between samplings showing the differential between samples, and a "Diff Total" column after each sampling showing the differential from the INITIAL boot-up.

At the end of it all, the following files showed the most RAM consumption increase:

astobj2_hash.c
datastore.c
pbx.c
res_pjsip/pjsip_distributor.c
res_pjsip/pjsip_transport_events.c
res_pjsip_dialog_info_body_generator.c
res_pjsip_exten_state.c
res_pjsip_pubsub.c
sorcery.c

See attached - Thanks guys!

By: Benjamin Keith Ford (bford) 2021-01-11 09:56:48.782-0600

We may just have to open this one up as a general investigation to see where we may have leaks, narrowing the scope down to those files you've listed. A couple of questions:
# Do you have a list of standard operations being done for the customer, such as type of calls, ARI applications, fax, etc.? Might help narrow down the scope a bit further.
# It may not be necessary, but when someone looks at the issue, would it be possible to gain access to the customer's machine if we cannot reproduce the issue on our end?

By: Luke Escude (lukeescude) 2021-01-11 10:14:08.230-0600

1. Off the top of my head:
- Calls come in and are immediately Answer()'d.
- CallerID Number is run through a local PHP script via SHELL() to be sanitized, all symbols and non-digits stripped, with the number capped at 20 digits in length.
- CallerID Name is fetched via a local PHP script via SHELL() and is also returned in a sanitized format
- Number is run through a Blacklist check script (also PHP) via SHELL(), returns a 1 if the number is blocked by the global blacklist.
- Number is run through a spam score check script (also PHP) via SHELL(), returns a spam probability score 0-100, and follows rules the user sets up.
- Fax detection is enabled by the user's choice, and is sent to the fax destination, otherwise Goto() the standard destination. IVRs, ring groups, etc.

Those are the steps that ALL inbound calls go through. Next, as far as functionality is concerned:

1. PHP scripts are used throughout dial plan to get various states from the database, like queue agent login/logout. call forwarding settings, etc.
2. We are not using any ARI stuff whatsoever.
3. We are using a lot of AMI stuff - One-directional AMI. Basically every time extensions register, or calls are processed, we receive those AMI events and stick them into the database so users can monitor their phone systems in real-time. We don't write to the AMI socket much, except maybe for a click-to-call command here and there.
4. Voicemail is configured using the conf file, but uses ODBC. The voicemail database node is always less than 5ms away from wherever the Asterisk instance might be running. I intend to replace this with Amazon S3 in an upcoming update.
5. CDR is also stored via ODBC. This will also be replaced with in-dialplan PHP scripts instead of using Asterisk's built-in CDR engine in a future release.
6. Essentially everything is configured via the Conf files - No "realtime" stuff.

2. It's possible to give someone access, but it would have to be on-demand - Since these Asterisk instances can hop around different datacenters, there's no way for us to know where it might be in a week. (Example: It may be running in our Miami datacenter right now, but if Miami goes down for maintenance or whatever, it could end up Chicago, or NYC, or Dallas, etc.)

I am going to keep trying to track it down. The instance in question is currently hovering around 796MB of RAM, but if it doesn't increase over the next 2 days, then there may not be an issue.

By: Luke Escude (lukeescude) 2021-01-11 10:25:40.028-0600

I wonder if there's a correlation to hints/BLFs. This particular customer has 863 dial plan hints. Some of these hints have up to 30 watchers.

Additionally, our hints were recently made complex in order to support our legacy and new-gen SBCs. By "complex" I mean every single user device has a compound hint. The hint is configured by using the PJSIP/Device1&PJSIP/Device2&PJSIP/Device3 etc. This isn't new, but the overall length of the hint definitions doubled. (If a user has 4 devices, the hint is actually an aggregation of 8 PJSIP/Device entries tied together with &).

By: Luke Escude (lukeescude) 2021-01-13 16:42:11.886-0600

Hey guys, I believe I've figured it out.

I manually checked about 30 instances of Asterisk, half of which are leaking memory and the other half are fine.

The half that were fine were running a dial plan we last put into production on August 28th. The half that were leaking were running any dial plan updates we pushed after September 11th.

The changes that were made on September 11th were eliminating all of our cURL() calls to fetch CallerID, Blacklisting, and control eFaxes in favor of local PHP scripts that run inside the instance. Example below:

Former code:

same => n,Set(spammer=${SHELL(curl -m 1 "http://URL/CNAM/fetch.php?phone_number=${CALLERID(num)}")})

New code:

same => n,Set(spammer=${SHELL(php /instance_scripts/dialplan_get_spam_score.php "${CALLERID(num)}")})

Do you guys have any ideas on why switching to a local php script might cause a memory leak inside of Asterisk?

By: Luke Escude (lukeescude) 2021-01-14 15:41:10.732-0600

I've disproved my previous theories, and will go ahead and push MALLOC_DEBUG into production on a few customer systems for a full day.

Will update when I have some data from that.

By: Asterisk Team (asteriskteam) 2021-01-29 12:00:26.713-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Luke Escude (lukeescude) 2021-11-02 10:55:35.728-0500

I am about to resume my memory leak investigation, and plan to build a monitoring daemon that samples the memory summary every 30 minutes and stick it into a database for analysis over time.

The output of memory show summary is human-readable but seems somewhat annoying to try and parse with code - Is there a better way (ARI or manager action or something?) to query the memory statistics?

By: Asterisk Team (asteriskteam) 2021-11-02 10:55:36.216-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Sean Bright (seanbright) 2021-11-02 14:19:09.728-0500

There is no better way, no.

By: George Joseph (gjoseph) 2021-11-03 09:25:22.937-0500

The only thing I can suggest is to use the AMI "Command" command to issue the memory CLI commands.  It's the same format as the CLI output but it may be easier to capture.

The other thing to try is the Leak Sanitizer.  It doesn't run until asterisk exits but it's very low runtime impact and I've found the output more useful.
https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer

There's an option in Compiler Flags to turn it on but you shouldn't use it and MALLOC DEBUG at the same time.


By: Luke Escude (lukeescude) 2021-11-03 10:11:10.179-0500

George, how is the output handled? So if Asterisk exits and Leak Sanitizer has something to say about it, will it output to standard syslog/asterisk logging?

Right now if asterisk exits, we have a utility that automatically restarts the entire container which means any files or logs will get wiped near-instantly. I can disable this, but would rather figure out how to pipe Leak Sanitizer output into a remote log server instead.

By: George Joseph (gjoseph) 2021-11-03 10:36:36.171-0500

There are a whole bunch of options you can pass at runtime with the LSAN_OPTIONS environment variable.  One is "log_path"...
{code}
# LSAN_OPTIONS=log_path=/tmp/lsan /usr/sbin/asterisk -fg -C /etc/asterisk/asterisk.conf
{code}
This will write the log to {{/tmp/lsan.<pid>}}
If you have a host directory mounted in the container, point the log there.

The Leak Sanitizer is part of the Address Sanitizer so some of the Address Sanitizer flags also apply to the Leak Sanitizer.
https://github.com/google/sanitizers/wiki/AddressSanitizerFlags
https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer

I forgot to mention that you need libasan and liblsan installed to use the sannitizers.



By: Luke Escude (lukeescude) 2021-11-03 11:02:55.727-0500

Interestingly, liblsan isn't available on CentOS 7 and locating it is surprisingly difficult.

Once I do locate it, I just enable the LEAK_SANITIZER flag when compiling Asterisk, then set LSAN_OPTIONS log path to a host mounted folder. I think this is more effective than monitoring memory show summary every X minutes.

EDIT: Located the package in the centos-release-scl-rh repo

By: George Joseph (gjoseph) 2021-11-04 07:12:10.137-0500

[~lukeescude] I'm assigning this issue to you for further investigation.  When you get some results, add them to the issue and mention me and I'll take a look.  We can then decide where to go with it.


By: Luke Escude (lukeescude) 2021-11-04 15:37:29.867-0500

@George Joseph - For some reason, "mentions" have never worked for me on this Jira board. Nothing happens when I start typing @...

Anyways, please see the attached spreadsheet.

The issue definitely exists in ../src/pj/pool_policy_malloc.c it's growing very linearly over time.

The customer's system was sampled every 2 hours for as long as it could last, and it lasted 12 hours before crashing again, so we got 6 memory samples and a nice Excel line graph.

By: Luke Escude (lukeescude) 2021-11-04 15:45:44.952-0500

Looking at the allocations, there is a MASSIVE amount of these:

512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c
512 bytes line 46 of ../src/pj/pool_policy_malloc.c


... and so on.

By: Luke Escude (lukeescude) 2021-11-04 18:14:51.055-0500

So I started looking into PJSIP, and our own SIP proxy servers, and noticed that our proxy is adding a Record-Route header to every SUBSCRIBE and NOTIFY (regardless if it's Asterisk > Endpoint or Endpoint > Asterisk).

This record-route header appears to be unnecessary, and matches the Contact header (which we override in the proxy to create a B2BUA setup).

is it at all possible the presence of the unnecessary record-route could cause additional PJSIP buildup?

By: George Joseph (gjoseph) 2021-11-05 08:54:00.180-0500

The fact that pool_policy_malloc is the biggest allocator isn't surprising.  Every new transaction and dialog gets a pool allocated to it and the pool _should_ be released when the dialog or transaction is terminated.  Normally, libc should release the pool back to the OS but if the pages are very fragmented, it may not be able to efficiently.   When things start going south, try running {{malloc trim}} from the CLI and see what that does to the memory used by the process.  It tells libc to be more aggressive in its attempt to return memory to the OS.

Can we do a little recap?

The latest test was on 16.20 yes?
What was the usage profile over that period?  Number of calls, endpoints, inbound/outbound registrations, subscriptions, etc.
For the OOM killer...are you talking about the kernel's OOM killer or something you've implemented?
How are you setting the limit and on what parameter?  virtual or resident memory?
How much memory is actually installed on the machines?

I don't _think_ adding the RR header is an issue but I can run a few tests on Monday.




By: Luke Escude (lukeescude) 2021-11-05 10:22:14.638-0500

Correct, 16.20.

The customer had about 300 SIP devices registered to Asterisk, and placed around 3,000 to 4,000 calls over the 12-hour period.

I said "OOM killer" but I meant Docker's memory limit - It's set to a 3GB ceiling, and when the physically occupied memory reaches 3GB docker will kill the container and restart it.

16GB of memory is available on the Docker host. Typically Asterisk, even with 300 extensions registered, doesn't need more than like 800MB of memory in my experience. Even less actually.

Looking at my git commits around the time I started noticing the issue, that was also when I refactored our phone-facing proxy servers. This refactor included the unnecessary record_route() in SUBSCRIBEs and NOTIFYs. Since no one else is reporting this in other Asterisk issues, I'm beginning to think I've caused this.

By: Luke Escude (lukeescude) 2021-11-05 10:51:18.169-0500

I do find that taskprocessor depth gets very high sometimes for our phone-facing proxy servers. Example:

pjsip/outreg/raleigh-proxy-00000069                                     42662          0        416        450        500

This is on a customer system that doesn't actually have any phones - Asterisk is simply registering to that proxy, but nothing's coming from it at all. It could be unrelated, maybe since it's TCP and a disconnect happens between Asterisk and the Proxy there can be a packet buildup in the buffer or something.

By: George Joseph (gjoseph) 2021-11-05 11:09:40.706-0500

Ooooohhhhhh TCP. :)
How many open sockets between the two?
How's the DNS response time?
Can you capture a few minutes of packets on the interface?
From the CLI...
{code}
CLI> pjproject set log level 3
CLI> pjsip dump endpt details
{code}





By: Luke Escude (lukeescude) 2021-11-05 11:15:09.071-0500

Only 1 socket between Asterisk and the proxy (created by asterisk REGISTER to proxy). Then phone dialogs come in through it - So 100 devices could easily register through that 1 socket.

No DNS is used, we got rid of it forever ago in favor of raw IPs. Really cut down on basically all possible latency besides basic physics.

This is on my dev box, with only like 8 extensions on it:

{code}
pjsip dump endpt details
       sip_endpoint.c Dumping endpoint 0x3521148:
             cachpool  Dumping caching pool:
             cachpool    Capacity=72192, max_capacity=1048576, used_cnt=183
             cachpool   Dumping all active pools:
             cachpool       pept0x35210a0: 26384464 of 26388096 (99%) used
             cachpool               tpmgr:     4824 of     5440 (88%) used
             cachpool                 SIP:      240 of     1024 (23%) used
             cachpool            tsxlayer:   524600 of   525312 (99%) used
             cachpool         ua0x3500280:   529864 of   530432 (99%) used
             cachpool               tcptp:     1400 of     1536 (91%) used
             cachpool       glck0x39372c0:      584 of     1024 (57%) used
             cachpool        udp0x39388d0:     1320 of     1536 (85%) used
             cachpool       glck0x3938e10:      584 of     1024 (57%) used
             cachpool        rtd0x3939350:    66600 of    72096 (92%) used
             cachpool          resolver.c:    11696 of    16096 (72%) used
             cachpool    glck0x7fb4c4005860:      584 of     1024 (57%) used
             cachpool               evsub:     3696 of     5632 (65%) used
             cachpool    regc0x7fb4c4005b00:     5128 of     6144 (83%) used
             cachpool    regc0x7fb4c40036c0:     5128 of     6144 (83%) used
             cachpool    regc0x7fb4c4004de0:     4232 of     5120 (82%) used
             cachpool    regc0x7fb4c400b070:     4232 of     5120 (82%) used
             cachpool    auth_cli0x7fb4c4045a70:      192 of     1024 (18%) used
             cachpool    auth_cli0x7fb4c4046fb0:      192 of     1024 (18%) used
             cachpool    auth_cli0x7fb4c41e9c70:      184 of     1024 (17%) used
             cachpool    auth_cli0x7fb4c40f7c90:      184 of     1024 (17%) used
             cachpool                 tcp:    69072 of    69120 (99%) used
             cachpool    glck0x7fb4c4223e90:      584 of     1024 (57%) used
             cachpool    rtd0x7fb4f018e730:      168 of     4096 (4%) used
             cachpool                 tcp:    68720 of    69120 (99%) used
             cachpool    glck0x7fb52c09d620:      584 of     1024 (57%) used
             cachpool    rtd0x7fb4f0023a80:      168 of     4096 (4%) used
             cachpool    dlg0x7fb4c8000f90:   121928 of   131584 (92%) used
             cachpool       glck0x495b420:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4042650:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc1d8bb0:   121928 of   131584 (92%) used
             cachpool    glck0x7fb560014e80:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c439b200:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c800ecd0:   121928 of   131584 (92%) used
             cachpool    glck0x7fb4c45d6ec0:      584 of     1024 (57%) used
             cachpool    glck0x7fb56000d2c0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc3c0140:   121904 of   131584 (92%) used
             cachpool    glck0x7fb4c43bc1c0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c422b6a0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb5540242e0:   121888 of   131584 (92%) used
             cachpool    glck0x7fb4e4004630:      584 of     1024 (57%) used
             cachpool    glck0x7fb52c09ecf0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c416ba20:   121904 of   131584 (92%) used
             cachpool    glck0x7fb4c44eabe0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c467b300:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc0ceff0:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c427ac70:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c44e6130:      584 of     1024 (57%) used
             cachpool    dlg0x7fb564402a60:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c422aa00:      584 of     1024 (57%) used
             cachpool    glck0x7fb54c069dd0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc3135e0:   121936 of   131584 (92%) used
             cachpool       glck0x3b6ce70:      584 of     1024 (57%) used
             cachpool    glck0x7fb554000ff0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb54c065920:   121952 of   131584 (92%) used
             cachpool    glck0x7fb5540343b0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4116840:      584 of     1024 (57%) used
             cachpool    dlg0x7fb5540bf870:   121952 of   131584 (92%) used
             cachpool    glck0x7fb52c25af50:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c419c330:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc057cc0:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c40fb410:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c463a370:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f0383b30:   121952 of   131584 (92%) used
             cachpool    glck0x7fb560007870:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c40cb010:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc38c4b0:   121936 of   131584 (92%) used
             cachpool    glck0x7fb4dc3d7dd0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4032c40:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f037d5c0:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c4506a20:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c44ef680:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f048a410:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c42b7c20:      584 of     1024 (57%) used
             cachpool    glck0x7fb554078970:      584 of     1024 (57%) used
             cachpool    dlg0x7fb540058b80:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c44e5550:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4453bd0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb52c0f2a90:   121952 of   131584 (92%) used
             cachpool       glck0x3fa6e40:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4254f30:      584 of     1024 (57%) used
             cachpool        dlg0x49c6c00:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c8004370:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c417ebe0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb5640764f0:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c43eac90:      584 of     1024 (57%) used
             cachpool       glck0x3face00:      584 of     1024 (57%) used
             cachpool    dlg0x7fb554031690:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c4348ab0:      584 of     1024 (57%) used
             cachpool       glck0x4938690:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f0295550:   121936 of   131584 (92%) used
             cachpool    glck0x7fb4d400b460:      584 of     1024 (57%) used
             cachpool    glck0x7fb4f03cf540:      584 of     1024 (57%) used
             cachpool    dlg0x7fb554015460:   121952 of   131584 (92%) used
             cachpool       glck0x4a67690:      584 of     1024 (57%) used
             cachpool    glck0x7fb4f0381560:      584 of     1024 (57%) used
             cachpool    dlg0x7fb560006fd0:   121952 of   131584 (92%) used
             cachpool    glck0x7fb4c443f120:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4051760:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c40e4990:   121952 of   131584 (92%) used
             cachpool    glck0x7fb5643f27f0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c419e850:      584 of     1024 (57%) used
             cachpool    dlg0x7fb540084470:   208808 of   225792 (92%) used
             cachpool    glck0x7fb4c416aa60:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4344320:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c40d5660:   208808 of   225792 (92%) used
             cachpool    glck0x7fb4c4173e30:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c474e1b0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc00b510:   208808 of   225792 (92%) used
             cachpool    glck0x7fb5643dc040:      584 of     1024 (57%) used
             cachpool    glck0x7fb5641ca9c0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb54c26d6a0:   121880 of   131584 (92%) used
             cachpool       glck0x495e9d0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4376350:      584 of     1024 (57%) used
             cachpool    dlg0x7fb56425b920:   121880 of   131584 (92%) used
             cachpool       glck0x495fb80:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c41721d0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c42667f0:   121880 of   131584 (92%) used
             cachpool    glck0x7fb4c44a0c30:      584 of     1024 (57%) used
             cachpool    glck0x7fb4dc095010:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f0270820:   121880 of   131584 (92%) used
             cachpool    glck0x7fb4c4200470:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c414e790:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc0c61d0:   208648 of   225792 (92%) used
             cachpool    glck0x7fb4c4442630:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4016d90:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f0287450:   208648 of   225792 (92%) used
             cachpool    glck0x7fb4c424d710:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4375ea0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c4163fa0:   208648 of   225792 (92%) used
             cachpool    glck0x7fb4c40e5520:      584 of     1024 (57%) used
             cachpool       glck0x3fb2560:      584 of     1024 (57%) used
             cachpool    dlg0x7fb54c0b8e50:   208648 of   225792 (92%) used
             cachpool    glck0x7fb4c4413280:      584 of     1024 (57%) used
             cachpool    glck0x7fb4dc0caa40:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c4393520:   216840 of   235008 (92%) used
             cachpool    glck0x7fb4c402dd20:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c40311e0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c4396570:   217776 of   236032 (92%) used
             cachpool    glck0x7fb4c4194b30:      584 of     1024 (57%) used
             cachpool    glck0x7fb5643db860:      584 of     1024 (57%) used
             cachpool    dlg0x7fb554179100:    77176 of    83456 (92%) used
             cachpool    glck0x7fb4f037d320:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c44e8c30:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f043d900:    17688 of    18944 (93%) used
             cachpool    glck0x7fb4c40ce350:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4362970:      584 of     1024 (57%) used
             cachpool    dlg0x7fb54c1bf880:    16744 of    17920 (93%) used
             cachpool    glck0x7fb4f0191250:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c481a200:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f0450cd0:    17696 of    18944 (93%) used
             cachpool    glck0x7fb4f007dc00:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c41b6910:      584 of     1024 (57%) used
             cachpool    dlg0x7fb5540b25e0:    17688 of    18944 (93%) used
             cachpool    glck0x7fb4c40360f0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c405ef70:      584 of     1024 (57%) used
             cachpool    dlg0x7fb56407c9b0:    17688 of    18944 (93%) used
             cachpool    glck0x7fb4dc04a120:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c4345e20:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f0458830:    17688 of    18944 (93%) used
             cachpool    glck0x7fb554059a10:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c421db10:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c45ad270:    17672 of    18944 (93%) used
             cachpool    glck0x7fb4f04afff0:      584 of     1024 (57%) used
             cachpool    glck0x7fb5640c7530:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4f03ccdd0:    17696 of    18944 (93%) used
             cachpool       glck0x4a124e0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c44f1ba0:      584 of     1024 (57%) used
             cachpool    dlg0x7fb56402d570:    17680 of    18944 (93%) used
             cachpool    glck0x7fb564189d90:      584 of     1024 (57%) used
             cachpool    glck0x7fb4f054aa80:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c44f9290:    17736 of    18944 (93%) used
             cachpool    glck0x7fb4f054f370:      584 of     1024 (57%) used
             cachpool    glck0x7fb55406f930:      584 of     1024 (57%) used
             cachpool    dlg0x7fb5540bc140:    17688 of    18944 (93%) used
             cachpool    glck0x7fb5540baa30:      584 of     1024 (57%) used
             cachpool    glck0x7fb4c44fc730:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4c43d6a90:    17696 of    18944 (93%) used
             cachpool    glck0x7fb4c449e700:      584 of     1024 (57%) used
             cachpool    glck0x7fb4dc09c100:      584 of     1024 (57%) used
             cachpool    dlg0x7fb4dc317e50:    17680 of    18944 (93%) used
             cachpool    glck0x7fb4dc3c2ff0:      584 of     1024 (57%) used
             cachpool    glck0x7fb4dc22e630:      584 of     1024 (57%) used
             cachpool   Total  33487848 of  34028096 (98 %) used!
       sip_endpoint.c  Endpoint pool capacity=26388096, used_size=26384464
           resolver.c  Dumping resolver state:
           resolver.c   Name servers:
           resolver.c    NS 0: 208.67.222.222:53 (state=Active until -75763s, rtt=10000 ms)
           resolver.c    NS 1: 8.8.8.8:53 (state=Active until -75763s, rtt=10000 ms)
           resolver.c    NS 2: 1.1.1.1:53 (state=Active until -75763s, rtt=10000 ms)
           resolver.c   Nb. of cached responses: 0
           resolver.c   Nb. of pending queries: 0 (0)
           resolver.c   Nb. of pending query free nodes: 0
           resolver.c   Nb. of timer entries: 56
           resolver.c   Pool capacity: 16096, used size: 11696
      sip_transport.c  Dumping listeners:
      sip_transport.c   tcptp:5060 TCP:172.17.0.3:5060
      sip_transport.c  Dumping transports:
      sip_transport.c   udp0x39388d0 x-ast-txp:transport-udp(refcnt=3)
      sip_transport.c   tcpc0x7fb4c4382ca8 TCP to 207.246.88.196:4242 listener[tcptp:5060](refcnt=2)
      sip_transport.c   tcpc0x7fb4dc146478 TCP to 104.238.165.75:4242 listener[tcptp:5060](refcnt=2)
              timer.c Dumping timer heap:
              timer.c   Cur size: 56 entries, max: 262142
              timer.c   Entries:
              timer.c     _id Id Elapsed Source
              timer.c     ----------------------------------
              timer.c     30 0 21.296 res_pjsip_outbound_registration.c:683
              timer.c     82 0 103.700 res_pjsip_outbound_registration.c:683
              timer.c     25 0 23.500 res_pjsip_outbound_registration.c:683
              timer.c     53 0 104.764 res_pjsip_outbound_registration.c:683
              timer.c     20 2 347.630 ../src/pjsip-simple/evsub.c:524
              timer.c     87 2 343.934 ../src/pjsip-simple/evsub.c:524
              timer.c     27 2 378.627 ../src/pjsip-simple/evsub.c:524
              timer.c     2 2 362.433 ../src/pjsip-simple/evsub.c:524
              timer.c     83 2 158.243 ../src/pjsip-simple/evsub.c:524
              timer.c     42 2 370.628 ../src/pjsip-simple/evsub.c:524
              timer.c     48 2 348.935 ../src/pjsip-simple/evsub.c:524
              timer.c     84 2 374.130 ../src/pjsip-simple/evsub.c:524
              timer.c     58 2 351.630 ../src/pjsip-simple/evsub.c:524
              timer.c     38 2 380.135 ../src/pjsip-simple/evsub.c:524
              timer.c     26 2 391.961 ../src/pjsip-simple/evsub.c:524
              timer.c     45 2 379.634 ../src/pjsip-simple/evsub.c:524
              timer.c     55 2 365.627 ../src/pjsip-simple/evsub.c:524
              timer.c     79 2 365.129 ../src/pjsip-simple/evsub.c:524
              timer.c     47 2 169.745 ../src/pjsip-simple/evsub.c:524
              timer.c     65 2 376.626 ../src/pjsip-simple/evsub.c:524
              timer.c     11 2 505.205 ../src/pjsip-simple/evsub.c:524
              timer.c     77 2 375.629 ../src/pjsip-simple/evsub.c:524
              timer.c     35 2 383.129 ../src/pjsip-simple/evsub.c:524
              timer.c     1 2 376.131 ../src/pjsip-simple/evsub.c:524
              timer.c     37 2 381.130 ../src/pjsip-simple/evsub.c:524
              timer.c     54 2 360.628 ../src/pjsip-simple/evsub.c:524
              timer.c     40 2 377.129 ../src/pjsip-simple/evsub.c:524
              timer.c     46 2 561.104 ../src/pjsip-simple/evsub.c:524
              timer.c     18 2 506.957 ../src/pjsip-simple/evsub.c:524
              timer.c     22 2 405.544 ../src/pjsip-simple/evsub.c:524
              timer.c     34 2 392.524 ../src/pjsip-simple/evsub.c:524
              timer.c     70 2 391.465 ../src/pjsip-simple/evsub.c:524
              timer.c     80 2 504.920 ../src/pjsip-simple/evsub.c:524
              timer.c     51 2 505.455 ../src/pjsip-simple/evsub.c:524
              timer.c     62 2 382.129 ../src/pjsip-simple/evsub.c:524
              timer.c     72 2 407.355 ../src/pjsip-simple/evsub.c:524
              timer.c     75 2 389.356 ../src/pjsip-simple/evsub.c:524
              timer.c     9 2 388.963 ../src/pjsip-simple/evsub.c:524
              timer.c     29 2 170.240 ../src/pjsip-simple/evsub.c:524
              timer.c     12 2 473.356 ../src/pjsip-simple/evsub.c:524
              timer.c     71 2 539.931 ../src/pjsip-simple/evsub.c:524
              timer.c     5 2 505.721 ../src/pjsip-simple/evsub.c:524
              timer.c     21 2 506.611 ../src/pjsip-simple/evsub.c:524
              timer.c     68 2 504.058 ../src/pjsip-simple/evsub.c:524
              timer.c     8 2 505.999 ../src/pjsip-simple/evsub.c:524
              timer.c     69 2 387.630 ../src/pjsip-simple/evsub.c:524
              timer.c     67 2 508.207 ../src/pjsip-simple/evsub.c:524
              timer.c     4 2 503.815 ../src/pjsip-simple/evsub.c:524
              timer.c     3 2 506.251 ../src/pjsip-simple/evsub.c:524
              timer.c     49 2 382.628 ../src/pjsip-simple/evsub.c:524
              timer.c     33 2 395.960 ../src/pjsip-simple/evsub.c:524
              timer.c     36 2 504.649 ../src/pjsip-simple/evsub.c:524
              timer.c     31 2 400.526 ../src/pjsip-simple/evsub.c:524
              timer.c     52 2 504.335 ../src/pjsip-simple/evsub.c:524
              timer.c     56 2 389.533 ../src/pjsip-simple/evsub.c:524
              timer.c     6 2 567.253 ../src/pjsip-simple/evsub.c:524
{code}

By: Luke Escude (lukeescude) 2021-11-09 11:06:00.405-0600

I believe this issue has to do with our usage of TCP for endpoints.

Out of 500+ customer systems, the only one that appears to be completely stable without a leak issue has all of their devices using UDP. I'm putting together a spreadsheet comparing various customers and their various dial plan features, and that's what pokes out the most so far.

Call volume appears to have nothing to do with it, in addition to BLFs (SUBSCRIBE and NOTIFY).

Is there additional direction you can give me for diagnosing TCP?

EDIT: Never mind, came across 3 systems with 80+ extensions using less than 200MB of memory, all on TCP. This is absolutely stumping me, but there may be a correlation with the quantity of BLF subscriptions. I will add that to my spreadsheet.

By: Luke Escude (lukeescude) 2021-11-09 15:56:44.016-0600

Adding analysis of 18 customer systems.

Strong correlation between compound hints (asterisk hints joined together with &) and memory consumption.

I changed the way our NOTIFYs and SUBSCRIBEs use record-route headers, and this did not fix anything.

By: George Joseph (gjoseph) 2021-11-10 06:34:53.648-0600

I can take a look this afternoon.


By: Luke Escude (lukeescude) 2021-11-10 10:15:03.984-0600

Thanks George!

Here's what I am going to start doing today:
1. malloc trim hourly cronjob - Malloc trim doesn't return all the leaked memory but it does a good chunk
2. Only create dial plan hints that will actually be used by customer devices (instead of creating hints for everything we monitor)
3. Move SUBSCRIBE/NOTIFY stuff to Kamailio, utilizing Asterisk's PUBLISH capabilities so Asterisk no long has to handle subscriptions.

Besides enabling MALLOC_DEBUG in the compile process, I couldn't figure out how to do anything else for memory diagnostics, including asan/lsan - I got no output from it, and have 0 experience working with it. I know there's something called Valgrind but I haven't looked into how to use it yet.

By: George Joseph (gjoseph) 2021-11-10 12:55:43.545-0600

bq. 1. malloc trim hourly cronjob - Malloc trim doesn't return all the leaked memory but it does a good chunk

Bingo!  We ran across a similar issue last year where we were seeing pjproject memory pool fragmentation with TCP connections and running malloc trim helped a lot.    It was reported as part of a security issue that was fixed but we were unsure if it actually fixed the memory leak.  Looking into it more.


By: George Joseph (gjoseph) 2021-11-10 13:18:12.988-0600

[~lukeescude] Would it be possible for you to get a coredump of a running asterisk instance that shows the issue?  It would interrupt calls for a few seconds but it would be very helpful.   There don't have to be lots of calls in progress so it could be done in a low volume period as long as the memory usage was excessive.

Grab the latest ast_coredumper script from https://github.com/asterisk/asterisk/raw/16/contrib/scripts/ast_coredumper
and run it with {{sudo ./ast_coredumper --running --tarball-coredumps}}.
The resulting tarball will be fairly large and contain sensitive info so if you could upload it to Google Drive, Dropbox, etc and email me the link, that would be best.
When I grab teh file, I'll let you know and you can delete it.




By: Luke Escude (lukeescude) 2021-11-10 13:20:38.546-0600

George, since this issue happens when the system is just sitting idle (regardless of call volume) would it be equally meaningful to do it after-hours when there is very little traffic?

If it's more meaningful to grab it during peak call processing, that's fine I'll just have to coordinate it with my customer.

By: Luke Escude (lukeescude) 2021-11-10 13:30:56.861-0600

Additionally, here's something interesting that may be unrelated (something I didn't notice back when we used UDP transport)

When everything is running just fine (Asterisk making outbound registrations to our SIP proxies, and inbound registrations coming from devices are coming into Asterisk like normal), and suddenly the connection with the SIP proxy gets broken/restarted, the following occurs:

WARNING[3360]: taskprocessor.c:1160 taskprocessor_push: The 'pjsip/outreg/dev-chicago-1-0000019f' task processor queue reached 500 scheduled tasks.
[Nov 10 19:27:47] DEBUG[6508]: res_pjsip_outbound_registration.c:860 reregister_immediately_cb: Outbound registration transport to server 'sip:IPADDR:4040' from client 'sip:999999@domain' shutdown
[Nov 10 19:27:47] DEBUG[6508]: res_pjsip_outbound_registration.c:608 handle_client_registration: Outbound REGISTER attempt 1 to 'sip:IPADDR:4040' with client 'sip:999999@domain'

Those last 2 messages repeat probably 200+ times.

So, is it possible there's some build-up of events in PJSIP outbound registration?

By: George Joseph (gjoseph) 2021-11-11 08:34:25.843-0600

bq. George, since this issue happens when the system is just sitting idle (regardless of call volume) would it be equally meaningful to do it after-hours when there is very little traffic?

That's fine.  As long as there has been a build-up and the asterisk instance is still consuming excessive memory, it doesn't matter if there is currently a call in progress.


By: Luke Escude (lukeescude) 2021-11-11 15:01:46.273-0600

For some reason this keeps happening when I try to run it:

realpath: '/tmp/core-asterisk-running-2021-11-11T21-01-14Z': No such file or directory

It has permissions to access /tmp and whatnot....

By: Luke Escude (lukeescude) 2021-11-15 11:56:05.166-0600

It's possible I need to enable SYS_PTRACE on the container to allow gdb to function properly. I will need to perform some upgrades to our Docker clustering system to be able to do this, unfortunately.

I'm hoping you guys have some previous tickets or knowledge that may help in the event that I can't get a core dump from a production customer.