ASTERISK-20227: Segfault (possible memory corruption?)

[Home]

Summary: ASTERISK-20227: Segfault (possible memory corruption?)

Reporter: Jared Smith (jsmith) Labels:

Date Opened: 2012-08-13 17:22:35 Date Closed: 2012-12-19 18:36:09.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: General

Versions: 1.8.15.0 Frequency of
Occurrence Occasional

Related
Issues:
is related to ASTERISK-20226 Segfault in chan_sip while performing connected line update

Environment: Linux Attachments: ( 0) another_backtrace.20120820
( 1) asterisk_backtrace_09032012.txt
( 2) asterisk_configs.tgz
( 3) backtrace_20227.txt
( 4) backtrace.3975
( 5) malloc_backtrace.txt
( 6) malloc-enhancements-1.8.15.0.diff
( 7) test_configs.tgz

Description: Another segfault I'm seeing (not the same one as ASTERISK-20226). Opening this bug at the request of mjordan.

[Edit by Rusty Newton - removed older backtrace *from description* and attached as backtrace_20227.txt]

Comments: By: Jared Smith (jsmith) 2012-08-13 17:47:24.059-0500

This is an updated backtrace, with more debugging symbols for glibc installed.
By: Rusty Newton (rnewton) 2012-08-16 18:24:36.587-0500

Jared I see a lot of values optimized out. Can you get another backtrace with asterisk compiled with DONT_OPTIMIZE and BETTER_BACKTRACES ?

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace
By: Jared Smith (jsmith) 2012-08-20 16:47:02.803-0500

This is another backtrace with a very similar crash
By: Matt Jordan (mjordan) 2012-08-21 11:15:23.361-0500

Please make sure you hit the Send Back button once you've provided feedback - otherwise it may drift off the Triage radar.

Are they using chan_agent by any chance?
By: Rusty Newton (rnewton) 2012-09-19 20:32:08.877-0500

Jared were you able to obtain a backtrace after recompiling with the options mentioned above?

By: Jared Smith (jsmith) 2012-09-20 02:34:09.786-0500

Sure, I have a bunch of them. I've attached another example here, this one should have both DONT_OPTIMIZE and BETTER_BACKTRACES turned on.

For the record, we're in the process of swapping out the hardware as well, just to verify that it's not a hardware issue.
By: Matt Jordan (mjordan) 2012-10-17 08:35:17.371-0500

Hey Jared -

I'm pretty sure this is a memory corruption of some sort. Can you provide the .conf files for the system(s) affected?

Ideally if this can be reproduced in a lab environment, a valgrind trace would also be hugely useful.
By: Matt Jordan (mjordan) 2012-10-17 08:45:31.031-0500

(Also: please remember to hit "Send Back" when you've provided feedback, otherwise it doesn't always show up in the Triage filters)
By: Jared Smith (jsmith) 2012-10-17 15:27:17.629-0500

I've attached my configs. It should be a fairly ordinary Asterisk install (using FreePBX as the front-end to generate the configs).

The only thing unusual about this server is the high number of queues in use on this system. Last I checked, there were somewhere around 180 queues in use at any given time on this system.
By: Rusty Newton (rnewton) 2012-10-19 10:20:57.042-0500

Thanks for additional info. Will you be able to get valgrind output?
By: Jared Smith (jsmith) 2012-10-19 10:35:59.330-0500

No, I won't be able to get valgrind output. This is a production system handling tens of thousands of calls per day. Sorry :-(
By: Deniz (deniz) 2012-11-06 06:10:34.338-0600

having the same segfault running 1.8.17 within production....
By: Richard Mudgett (rmudgett) 2012-11-06 16:56:17.365-0600

The reviewboard patch for MALLOC_DEBUG enhancements should help locate the possible memory corruption.
https://reviewboard.asterisk.org/r/2182/

MALLOC_DEBUG logs its output to stderr and to the /var/log/asterisk/mmlog file by default.
By: Matt Jordan (mjordan) 2012-11-08 11:12:43.533-0600

Any luck on running the production system with Richard's patches?
By: Matt Jordan (mjordan) 2012-11-08 15:17:33.678-0600

Attaching a patch (malloc-enhancements-1.8.15.0.diff) that provides Richard's MALLOC_DEBUG enhancements for this version.

There are two ways to use this patch:
1) Enable MALLOC_DEBUG. This will create a mmlog file that will log out information related to a memory corruption that will be useful in the case that one happens.
2) Along with MALLOC_DEBUG, enable DO_CRASH. This will cause Asterisk to immediately crash when a memory corruption is detected, as opposed to waiting for something to access the now corrupted memory. If you can tolerate what may potentially be a 'quicker' crash, this would help as well.

By: Jared Smith (jsmith) 2012-11-08 17:52:40.876-0600

We tested the patch in the lab today, and were easily able to crash the lab system. I think the patch does more harm than help.

I'll post the backtrace from the crash on the lab system here shortly.
By: Jared Smith (jsmith) 2012-11-08 18:00:25.389-0600

This is the backtrace from a crash on the very latest 1.8 from SVN (revision 376029) on my lab system.
By: Matt Jordan (mjordan) 2012-11-08 20:59:59.073-0600

Nothing logged to mmlog?
By: Jared Smith (jsmith) 2012-11-09 08:33:13.286-0600

Nothing interesting logged to mmlog -- it looks like this:

1352409852 - New session
1352409921 - New session
1352409989 - New session
1352411167 - New session
1352417026 - New session
1352419089 - New session
1352420047 - New session
1352428444 - New session
1352428951 - New session

By: Richard Mudgett (rmudgett) 2012-11-09 16:35:54.610-0600

The new backtrace does not show a crash that I would expect with MALLOC_DEBUG enabled. The MALLOC_DEBUG code assumes that all allocations go through it. I would expect to see a __ast_alloc_region() in that backtrace.

The MALLOC_DEBUG code wipes the contents of a released block with the 0xdeaddead value and delays actually freeing the memory.

The debug code will prevent memory corruption writes from causing a crash because the freeing of a block is delayed. When the block is rotated back to the heap, it is checked to see if the memory has been changed from 0xdeaddead.

The debug code should cause a crash if a released block attempts to dereference a pointer because a released block is wiped with the 0xdeaddead value. Therefor, a dereference of a freed pointer will attempt to dereference the address 0xdeaddead which is usually an invalid memory address.

If you also enable DO_CRASH option, a crash will be forced if an assertion fails or MALLOC_DEBUG reports a warning.
By: Jared Smith (jsmith) 2012-11-09 18:05:35.630-0600

Right -- I understand that the patch wouldn't catch this type of crash. What I'm saying is that the patch appears to be *causing* this type of crash, at least in our lab testing. With the patch, we can easily crash the system with backtraces similar to the latest one attached to this ticket -- without it, we can go much longer without a segfault.
By: Jared Smith (jsmith) 2012-11-13 14:51:18.486-0600

After chatting with mjordan on IRC, he asked that I attach the configs from the test run with 1.8 (from SVN) where we were seeing problems with the patch. I've attached the relevant configs -- everything else is a stock config from "make samples" in Asterisk.
By: Matt Jordan (mjordan) 2012-11-15 16:47:01.727-0600

Did some testing of this tonight by logging in two agents (jared/chris) and using a third SIP phone to dial into Queue 302. No crashes or memory reports kicked back yet.

Does this typically crash quickly, or do you usually script something to simulate a large number of calls?
By: Jared Smith (jsmith) 2012-11-15 21:36:02.271-0600

I could usually trigger the crash with a few dozen calls. I'll keep pounding on it in the lab and get some additional backtraces, if that's helpful. If I can get to reliably crash again in the lab, I'll give you access to the box and let you work your magic.
By: Matt Jordan (mjordan) 2012-11-16 08:47:37.369-0600

Little bit more info on what I was testing:

I changed jared/chris into two local D40 SIP phones that I have (digium01/digium02). Otherwise, the dialplan/config is the same:

{noformat:title=agents.conf}
agent => digium01,4321,Jared Smith
agent => digium02,4321
{noformat}

{noformat:title=queues.conf}
[sales]
strategy=ringall
announce=sales
musicclass = default
;member => Agent/chris
;member => Agent/larry
;member => Agent/paul
;member => Agent/patrick
;member => Agent/anthony
;member => Agent/derek
;member => Agent/jared
;member => Agent/olle
member => Local/digium01@agents
member => Local/digium02@agents
{noformat}

I then used call files to spam calls into the sales queue (extension 300) - the bash script creates 10 calls at a time. Randomly, at each phone I either ignore the call (which puts it back into the Queue) or Answer it, wait a bit, and hang up. So far I've processed about 100 calls without a crash.

By: Rusty Newton (rnewton) 2012-12-19 18:17:30.955-0600

Jared, can you give us any further guidance on reproducing the crash?
By: Jared Smith (jsmith) 2012-12-19 18:34:11.647-0600

I think this bug can be safely closed. After applying the patch in ASTERISK-20226 (which I didn't think was related to this issue), we haven't had any more crashes in the past 3 weeks, 6 days, 19 hours, 10 minutes, 33 seconds. During that time, we've put 922,685 calls through the system.
By: Jared Smith (jsmith) 2012-12-19 18:36:09.073-0600

Assuming the patch from bug ASTERISK-20226 is added to the 1.8.20.0 release, I don't see any reason to keep this bug open. The problem doesn't seem to happen any more after applying the patch.
By: Matt Jordan (mjordan) 2012-12-20 08:28:58.258-0600

Yup, that patch is in 1.8.20.0-rc1. I'll close this out for now and if it rears its ugly head again, we'll reopen.