Summary: | ASTERISK-20227: Segfault (possible memory corruption?) | ||||
Reporter: | Jared Smith (jsmith) | Labels: | |||
Date Opened: | 2012-08-13 17:22:35 | Date Closed: | 2012-12-19 18:36:09.000-0600 | ||
Priority: | Major | Regression? | |||
Status: | Closed/Complete | Components: | General | ||
Versions: | 1.8.15.0 | Frequency of Occurrence | Occasional | ||
Related Issues: |
| ||||
Environment: | Linux | Attachments: | ( 0) another_backtrace.20120820 ( 1) asterisk_backtrace_09032012.txt ( 2) asterisk_configs.tgz ( 3) backtrace_20227.txt ( 4) backtrace.3975 ( 5) malloc_backtrace.txt ( 6) malloc-enhancements-1.8.15.0.diff ( 7) test_configs.tgz | ||
Description: | Another segfault I'm seeing (not the same one as ASTERISK-20226). Opening this bug at the request of mjordan. [Edit by Rusty Newton - removed older backtrace *from description* and attached as backtrace_20227.txt] | ||||
Comments: | By: Jared Smith (jsmith) 2012-08-13 17:47:24.059-0500 This is an updated backtrace, with more debugging symbols for glibc installed. By: Rusty Newton (rnewton) 2012-08-16 18:24:36.587-0500 Jared I see a lot of values optimized out. Can you get another backtrace with asterisk compiled with DONT_OPTIMIZE and BETTER_BACKTRACES ? https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace By: Jared Smith (jsmith) 2012-08-20 16:47:02.803-0500 This is another backtrace with a very similar crash By: Matt Jordan (mjordan) 2012-08-21 11:15:23.361-0500 Please make sure you hit the Send Back button once you've provided feedback - otherwise it may drift off the Triage radar. Are they using chan_agent by any chance? By: Rusty Newton (rnewton) 2012-09-19 20:32:08.877-0500 Jared were you able to obtain a backtrace after recompiling with the options mentioned above? By: Jared Smith (jsmith) 2012-09-20 02:34:09.786-0500 Sure, I have a bunch of them. I've attached another example here, this one should have both DONT_OPTIMIZE and BETTER_BACKTRACES turned on. For the record, we're in the process of swapping out the hardware as well, just to verify that it's not a hardware issue. By: Matt Jordan (mjordan) 2012-10-17 08:35:17.371-0500 Hey Jared - I'm pretty sure this is a memory corruption of some sort. Can you provide the .conf files for the system(s) affected? Ideally if this can be reproduced in a lab environment, a valgrind trace would also be hugely useful. By: Matt Jordan (mjordan) 2012-10-17 08:45:31.031-0500 (Also: please remember to hit "Send Back" when you've provided feedback, otherwise it doesn't always show up in the Triage filters) By: Jared Smith (jsmith) 2012-10-17 15:27:17.629-0500 I've attached my configs. It should be a fairly ordinary Asterisk install (using FreePBX as the front-end to generate the configs). The only thing unusual about this server is the high number of queues in use on this system. Last I checked, there were somewhere around 180 queues in use at any given time on this system. By: Rusty Newton (rnewton) 2012-10-19 10:20:57.042-0500 Thanks for additional info. Will you be able to get valgrind output? By: Jared Smith (jsmith) 2012-10-19 10:35:59.330-0500 No, I won't be able to get valgrind output. This is a production system handling tens of thousands of calls per day. Sorry :-( By: Deniz (deniz) 2012-11-06 06:10:34.338-0600 having the same segfault running 1.8.17 within production.... By: Richard Mudgett (rmudgett) 2012-11-06 16:56:17.365-0600 The reviewboard patch for MALLOC_DEBUG enhancements should help locate the possible memory corruption. https://reviewboard.asterisk.org/r/2182/ MALLOC_DEBUG logs its output to stderr and to the /var/log/asterisk/mmlog file by default. By: Matt Jordan (mjordan) 2012-11-08 11:12:43.533-0600 Any luck on running the production system with Richard's patches? By: Matt Jordan (mjordan) 2012-11-08 15:17:33.678-0600 Attaching a patch (malloc-enhancements-1.8.15.0.diff) that provides Richard's MALLOC_DEBUG enhancements for this version. There are two ways to use this patch: 1) Enable MALLOC_DEBUG. This will create a mmlog file that will log out information related to a memory corruption that will be useful in the case that one happens. 2) Along with MALLOC_DEBUG, enable DO_CRASH. This will cause Asterisk to immediately crash when a memory corruption is detected, as opposed to waiting for something to access the now corrupted memory. If you can tolerate what may potentially be a 'quicker' crash, this would help as well. By: Jared Smith (jsmith) 2012-11-08 17:52:40.876-0600 We tested the patch in the lab today, and were easily able to crash the lab system. I think the patch does more harm than help. I'll post the backtrace from the crash on the lab system here shortly. By: Jared Smith (jsmith) 2012-11-08 18:00:25.389-0600 This is the backtrace from a crash on the very latest 1.8 from SVN (revision 376029) on my lab system. By: Matt Jordan (mjordan) 2012-11-08 20:59:59.073-0600 Nothing logged to mmlog? By: Jared Smith (jsmith) 2012-11-09 08:33:13.286-0600 Nothing interesting logged to mmlog -- it looks like this: 1352409852 - New session 1352409921 - New session 1352409989 - New session 1352411167 - New session 1352417026 - New session 1352419089 - New session 1352420047 - New session 1352428444 - New session 1352428951 - New session By: Richard Mudgett (rmudgett) 2012-11-09 16:35:54.610-0600 The new backtrace does not show a crash that I would expect with MALLOC_DEBUG enabled. The MALLOC_DEBUG code assumes that all allocations go through it. I would expect to see a __ast_alloc_region() in that backtrace. The MALLOC_DEBUG code wipes the contents of a released block with the 0xdeaddead value and delays actually freeing the memory. The debug code will prevent memory corruption writes from causing a crash because the freeing of a block is delayed. When the block is rotated back to the heap, it is checked to see if the memory has been changed from 0xdeaddead. The debug code should cause a crash if a released block attempts to dereference a pointer because a released block is wiped with the 0xdeaddead value. Therefor, a dereference of a freed pointer will attempt to dereference the address 0xdeaddead which is usually an invalid memory address. If you also enable DO_CRASH option, a crash will be forced if an assertion fails or MALLOC_DEBUG reports a warning. By: Jared Smith (jsmith) 2012-11-09 18:05:35.630-0600 Right -- I understand that the patch wouldn't catch this type of crash. What I'm saying is that the patch appears to be *causing* this type of crash, at least in our lab testing. With the patch, we can easily crash the system with backtraces similar to the latest one attached to this ticket -- without it, we can go much longer without a segfault. By: Jared Smith (jsmith) 2012-11-13 14:51:18.486-0600 After chatting with mjordan on IRC, he asked that I attach the configs from the test run with 1.8 (from SVN) where we were seeing problems with the patch. I've attached the relevant configs -- everything else is a stock config from "make samples" in Asterisk. By: Matt Jordan (mjordan) 2012-11-15 16:47:01.727-0600 Did some testing of this tonight by logging in two agents (jared/chris) and using a third SIP phone to dial into Queue 302. No crashes or memory reports kicked back yet. Does this typically crash quickly, or do you usually script something to simulate a large number of calls? By: Jared Smith (jsmith) 2012-11-15 21:36:02.271-0600 I could usually trigger the crash with a few dozen calls. I'll keep pounding on it in the lab and get some additional backtraces, if that's helpful. If I can get to reliably crash again in the lab, I'll give you access to the box and let you work your magic. By: Matt Jordan (mjordan) 2012-11-16 08:47:37.369-0600 Little bit more info on what I was testing: I changed jared/chris into two local D40 SIP phones that I have (digium01/digium02). Otherwise, the dialplan/config is the same: {noformat:title=agents.conf} agent => digium01,4321,Jared Smith agent => digium02,4321 {noformat} {noformat:title=queues.conf} [sales] strategy=ringall announce=sales musicclass = default ;member => Agent/chris ;member => Agent/larry ;member => Agent/paul ;member => Agent/patrick ;member => Agent/anthony ;member => Agent/derek ;member => Agent/jared ;member => Agent/olle member => Local/digium01@agents member => Local/digium02@agents {noformat} I then used call files to spam calls into the sales queue (extension 300) - the bash script creates 10 calls at a time. Randomly, at each phone I either ignore the call (which puts it back into the Queue) or Answer it, wait a bit, and hang up. So far I've processed about 100 calls without a crash. By: Rusty Newton (rnewton) 2012-12-19 18:17:30.955-0600 Jared, can you give us any further guidance on reproducing the crash? By: Jared Smith (jsmith) 2012-12-19 18:34:11.647-0600 I think this bug can be safely closed. After applying the patch in ASTERISK-20226 (which I didn't think was related to this issue), we haven't had any more crashes in the past 3 weeks, 6 days, 19 hours, 10 minutes, 33 seconds. During that time, we've put 922,685 calls through the system. By: Jared Smith (jsmith) 2012-12-19 18:36:09.073-0600 Assuming the patch from bug ASTERISK-20226 is added to the 1.8.20.0 release, I don't see any reason to keep this bug open. The problem doesn't seem to happen any more after applying the patch. By: Matt Jordan (mjordan) 2012-12-20 08:28:58.258-0600 Yup, that patch is in 1.8.20.0-rc1. I'll close this out for now and if it rears its ugly head again, we'll reopen. |