[Home]

Summary:ASTERISK-26836: res_pjsip: Memory corruption of endpoint
Reporter:Daniel Journo (journo)Labels:
Date Opened:2017-03-03 05:46:44.000-0600Date Closed:2017-04-05 10:00:34
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:13.11.2 13.14.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Centos 6, PJSIP 2.4.5Attachments:( 0) cli.txt
( 1) pjsipconfig.txt
( 2) pjsip-reload.txt
Description:This is a recurring issue on all my production servers. When it happens, Asterisk stops serving SIP clients properly.
It reads to me as a memory leak as it's just one character incrementing.

The latest time, it occurred shortly (but not immediately) after I issued a 'pjsip reload', but it may have just been a coincidence. I've included the timing in the CLI log.

Seems to happen every 4 or 5 days requiring a 'killall -9 asterisk' and 'database deltree registrar' in order to get working again. I can't remember, but the first time it happened, Asterisk would not respond to 'core stop now' or 'core restart now' so I had to do killall. But I havent tried 'core stop now' during subsequent issues.

This issue stops all endpoints from communicating with Asterisk. Even ones that have already registered.

Comments:By: Asterisk Team (asteriskteam) 2017-03-03 05:46:46.028-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2017-03-03 08:42:06.569-0600

This appears to be memory corruption.

What is the environment? Are you strictly using .conf files? Are you using realtime at all? What's the configuration? What's the usage patterns (are there lots of calls going on at the same time)?

By: Daniel Journo (journo) 2017-03-03 08:53:49.453-0600

What is the environment?
Centos 6.8

Are you strictly using .conf files?
For pjsip config, I'm using static files only.

Are you using realtime at all?
Yes, for queues, queue members, queue logs, voicemail and musiconhold.

What's the configuration?
Attached is my pjsip transports and an example of how my endpoints are all defined. They are identical apart from the username/password and mailbox.

What's the usage patterns (are there lots of calls going on at the same time)?
Average usage at the time. Peak usage is around 40 calls. When this last one occurred, there were about 10 calls.
I can't spot an obvious way to repeat the issue but it does repeat.


By: Joshua C. Colp (jcolp) 2017-03-03 08:55:21.963-0600

And how often/frequently are you doing reloads? Is it possible you were doing two at once?

By: Daniel Journo (journo) 2017-03-03 09:02:48.739-0600

pjsip reloads are done automatically if some of the pjsip config has been changed.
I can add some logging to my application so I can confirm what commands are being sent.
But looking at the CLI, I can't see a double reload being dealt with by asterisk.
CLI output attached from just before the latest issue.

By: Daniel Journo (journo) 2017-03-04 13:45:11.497-0600

I've reviewed the application that issues the 'pjsip reload' and it isn't possible for it to be issuing multiple reloads.

However, I've just done a test using AMI to send the 'pjsip reload' command 5 times at once.
Asterisk was able to deal with it without any problems.

I then did it again, and at the same time submitted a 'pjsip reload' from the CLI was got this output.
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
A module reload request is already in progress; please be patient
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet
   -- The previous reload command didn't finish yet

So the memory corruption doesn't appear to be related to multiple reloads occurring at the same time.

Also note that none of my endpoints have 'transport=asterisk' so I'm not sure where that is coming from.
I have not set the transport on the endpoints as they can use any of the available transports.

By: Richard Mudgett (rmudgett) 2017-03-16 12:31:13.012-0500

Memory corruption issues are very difficult to find and fix without a starting place of what is being misused.  I'm a little surprised that you aren't getting a crash because of the memory corruption.

Please enable MALLOC_DEBUG with DONT_OPTIMIZE.  These two options do not affect performance too much so you can enable them on production systems while looking for the memory corruption.

Are you using bundled pjproject?

Some useful debugging links:
https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
https://wiki.asterisk.org/wiki/display/AST/Debugging
https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag


By: Daniel Journo (journo) 2017-03-16 12:52:37.815-0500

Since I've upgraded to the latest code and bundled pjsip 2.6, the issue hasn't repeated.
It normally happens every 4 to 5 days but it hasnt happened since I updated.

I'll update this issue in 14 days.

By: Daniel Journo (journo) 2017-04-05 04:42:07.512-0500

You can close this as it hasn't reoccured since I upgraded to bundled pjproject v2.6.