[Home]

Summary:ASTERISK-20227: Segfault (possible memory corruption?)
Reporter:Jared Smith (jsmith)Labels:
Date Opened:2012-08-13 17:22:35Date Closed:2012-12-19 18:36:09.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:General
Versions:1.8.15.0 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-20226 Segfault in chan_sip while performing connected line update
Environment:LinuxAttachments:( 0) another_backtrace.20120820
( 1) asterisk_backtrace_09032012.txt
( 2) asterisk_configs.tgz
( 3) backtrace_20227.txt
( 4) backtrace.3975
( 5) malloc_backtrace.txt
( 6) malloc-enhancements-1.8.15.0.diff
( 7) test_configs.tgz
Description:Another segfault I'm seeing (not the same one as ASTERISK-20226). Opening this bug at the request of mjordan.

[Edit by Rusty Newton - removed older backtrace *from description* and attached as backtrace_20227.txt]
Comments:By: Jared Smith (jsmith) 2012-08-13 17:47:24.059-0500

This is an updated backtrace, with more debugging symbols for glibc installed.

By: Rusty Newton (rnewton) 2012-08-16 18:24:36.587-0500

Jared I see a lot of values optimized out. Can you get another backtrace with asterisk compiled with DONT_OPTIMIZE and BETTER_BACKTRACES ?

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Jared Smith (jsmith) 2012-08-20 16:47:02.803-0500

This is another backtrace with a very similar crash

By: Matt Jordan (mjordan) 2012-08-21 11:15:23.361-0500

Please make sure you hit the Send Back button once you've provided feedback - otherwise it may drift off the Triage radar.

Are they using chan_agent by any chance?

By: Rusty Newton (rnewton) 2012-09-19 20:32:08.877-0500

Jared were you able to obtain a backtrace after recompiling with the options mentioned above?



By: Jared Smith (jsmith) 2012-09-20 02:34:09.786-0500

Sure, I have a bunch of them.  I've attached another example here, this one should have both DONT_OPTIMIZE and BETTER_BACKTRACES turned on.

For the record, we're in the process of swapping out the hardware as well, just to verify that it's not a hardware issue.

By: Matt Jordan (mjordan) 2012-10-17 08:35:17.371-0500

Hey Jared -

I'm pretty sure this is a memory corruption of some sort.  Can you provide the .conf files for the system(s) affected?  

Ideally if this can be reproduced in a lab environment, a valgrind trace would also be hugely useful.

By: Matt Jordan (mjordan) 2012-10-17 08:45:31.031-0500

(Also: please remember to hit "Send Back" when you've provided feedback, otherwise it doesn't always show up in the Triage filters)

By: Jared Smith (jsmith) 2012-10-17 15:27:17.629-0500

I've attached my configs.  It should be a fairly ordinary Asterisk install (using FreePBX as the front-end to generate the configs).  

The only thing unusual about this server is the high number of queues in use on this system.  Last I checked, there were somewhere around 180 queues in use at any given time on this system.

By: Rusty Newton (rnewton) 2012-10-19 10:20:57.042-0500

Thanks for additional info. Will you be able to get valgrind output?

By: Jared Smith (jsmith) 2012-10-19 10:35:59.330-0500

No, I won't be able to get valgrind output.  This is a production system handling tens of thousands of calls per day.  Sorry :-(

By: Deniz (deniz) 2012-11-06 06:10:34.338-0600

having the same segfault running 1.8.17 within production....

By: Richard Mudgett (rmudgett) 2012-11-06 16:56:17.365-0600

The reviewboard patch for MALLOC_DEBUG enhancements should help locate the possible memory corruption.
https://reviewboard.asterisk.org/r/2182/

MALLOC_DEBUG logs its output to stderr and to the /var/log/asterisk/mmlog file by default.

By: Matt Jordan (mjordan) 2012-11-08 11:12:43.533-0600

Any luck on running the production system with Richard's patches?

By: Matt Jordan (mjordan) 2012-11-08 15:17:33.678-0600

Attaching a patch (malloc-enhancements-1.8.15.0.diff) that provides Richard's MALLOC_DEBUG enhancements for this version.

There are two ways to use this patch:
1) Enable MALLOC_DEBUG.  This will create a mmlog file that will log out information related to a memory corruption that will be useful in the case that one happens.
2) Along with MALLOC_DEBUG, enable DO_CRASH.  This will cause Asterisk to immediately crash when a memory corruption is detected, as opposed to waiting for something to access the now corrupted memory.  If you can tolerate what may potentially be a 'quicker' crash, this would help as well.


By: Jared Smith (jsmith) 2012-11-08 17:52:40.876-0600

We tested the patch in the lab today, and were easily able to crash the lab system.  I think the patch does more harm than help.  

I'll post the backtrace from the crash on the lab system here shortly.

By: Jared Smith (jsmith) 2012-11-08 18:00:25.389-0600

This is the backtrace from a crash on the very latest 1.8 from SVN (revision 376029) on my lab system.

By: Matt Jordan (mjordan) 2012-11-08 20:59:59.073-0600

Nothing logged to mmlog?

By: Jared Smith (jsmith) 2012-11-09 08:33:13.286-0600

Nothing interesting logged to mmlog -- it looks like this:

1352409852 - New session
1352409921 - New session
1352409989 - New session
1352411167 - New session
1352417026 - New session
1352419089 - New session
1352420047 - New session
1352428444 - New session
1352428951 - New session


By: Richard Mudgett (rmudgett) 2012-11-09 16:35:54.610-0600

The new backtrace does not show a crash that I would expect with MALLOC_DEBUG enabled.  The MALLOC_DEBUG code assumes that all allocations go through it.  I would expect to see a __ast_alloc_region() in that backtrace.

The MALLOC_DEBUG code wipes the contents of a released block with the 0xdeaddead value and delays actually freeing the memory.

The debug code will prevent memory corruption writes from causing a crash because the freeing of a block is delayed.  When the block is rotated back to the heap, it is checked to see if the memory has been changed from 0xdeaddead.

The debug code should cause a crash if a released block attempts to dereference a pointer because a released block is wiped with the 0xdeaddead value.  Therefor, a dereference of a freed pointer will attempt to dereference the address 0xdeaddead which is usually an invalid memory address.

If you also enable DO_CRASH option, a crash will be forced if an assertion fails or MALLOC_DEBUG reports a warning.

By: Jared Smith (jsmith) 2012-11-09 18:05:35.630-0600

Right -- I understand that the patch wouldn't catch this type of crash.  What I'm saying is that the patch appears to be *causing* this type of crash, at least in our lab testing.  With the patch, we can easily crash the system with backtraces similar to the latest one attached to this ticket -- without it, we can go much longer without a segfault.

By: Jared Smith (jsmith) 2012-11-13 14:51:18.486-0600

After chatting with mjordan on IRC, he asked that I attach the configs from the test run with 1.8 (from SVN) where we were seeing problems with the patch.  I've attached the relevant configs -- everything else is a stock config from "make samples" in Asterisk.

By: Matt Jordan (mjordan) 2012-11-15 16:47:01.727-0600

Did some testing of this tonight by logging in two agents (jared/chris) and using a third SIP phone to dial into Queue 302.  No crashes or memory reports kicked back yet.

Does this typically crash quickly, or do you usually script something to simulate a large number of calls?

By: Jared Smith (jsmith) 2012-11-15 21:36:02.271-0600

I could usually trigger the crash with a few dozen calls.  I'll keep pounding on it in the lab and get some additional backtraces, if that's helpful.  If I can get to reliably crash again in the lab, I'll give you access to the box and let you work your magic.

By: Matt Jordan (mjordan) 2012-11-16 08:47:37.369-0600

Little bit more info on what I was testing:

I changed jared/chris into two local D40 SIP phones that I have (digium01/digium02).  Otherwise, the dialplan/config is the same:

{noformat:title=agents.conf}
agent => digium01,4321,Jared Smith
agent => digium02,4321
{noformat}

{noformat:title=queues.conf}
[sales]
strategy=ringall
announce=sales
musicclass = default
;member => Agent/chris
;member => Agent/larry
;member => Agent/paul
;member => Agent/patrick
;member => Agent/anthony
;member => Agent/derek
;member => Agent/jared
;member => Agent/olle
member => Local/digium01@agents
member => Local/digium02@agents
{noformat}

I then used call files to spam calls into the sales queue (extension 300) - the bash script creates 10 calls at a time.  Randomly, at each phone I either ignore the call (which puts it back into the Queue) or Answer it, wait a bit, and hang up.  So far I've processed about 100 calls without a crash.


By: Rusty Newton (rnewton) 2012-12-19 18:17:30.955-0600

Jared, can you give us any further guidance on reproducing the crash?

By: Jared Smith (jsmith) 2012-12-19 18:34:11.647-0600

I think this bug can be safely closed.  After applying the patch in ASTERISK-20226 (which I didn't think was related to this issue), we haven't had any more crashes in the past 3 weeks, 6 days, 19 hours, 10 minutes, 33 seconds.  During that time, we've put 922,685 calls through the system.

By: Jared Smith (jsmith) 2012-12-19 18:36:09.073-0600

Assuming the patch from bug ASTERISK-20226 is added to the 1.8.20.0 release, I don't see any reason to keep this bug open.  The problem doesn't seem to happen any more after applying the patch.

By: Matt Jordan (mjordan) 2012-12-20 08:28:58.258-0600

Yup, that patch is in 1.8.20.0-rc1. I'll close this out for now and if it rears its ugly head again, we'll reopen.