ASTERISK-16460: [patch] Asterisk crashes every less than 1 hours (arrount 20 mn) when using manager and http manager using asterisk-1.4.35-rc1

[Home]

Summary: ASTERISK-16460: [patch] Asterisk crashes every less than 1 hours (arrount 20 mn) when using manager and http manager using asterisk-1.4.35-rc1

Reporter: Ravelomanantsoa Hoby (hoby) Labels:

Date Opened: 2010-07-29 07:38:59 Date Closed: 2011-06-15 10:35:06

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/ManagerInterface

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 1.6.2.14-rc1_manager.c.diff
( 1) backtrace_1.4.41.txt
( 2) backtrace_1.6.2.14-rc1.txt
( 3) backtrace.txt
( 4) backtrace-1.4.35.txt
( 5) backtrace-1.4.35-rc1.txt
( 6) backtrace2.txt
( 7) backtrace3.txt
( 8) core_show_locks.txt
( 9) full_backtrace-1.4.34.txt
(10) issue17747_1.4_svn_markII.patch
(11) issue17747_1.4_svn.patch
(12) issue17747_1.6.2_svn.patch
(13) manager.c-1.4.35-rc1.diff

Description: When under heavy load (60 - 160 simultanous), using http manager and manager crashes asterisk. Asterisk is used for callcenter solution using Queue and Agent.
We used 1.4.21.2 before and never seen crash.
This issue is for 1.4.35-rc1, 1.4.34

****** ADDITIONAL INFORMATION ******

Before, we used 1.4.21.2 and never reached this issue.
Now with lastest 1.4 release 1.4.34 and 1.4.35-rc1 we face this crash issue.
When we try to run with manager disabled thing goes fine,
When we use http manager and manager for retrieving realtime information we get crashed every 20 minutes.

Edit: Removed inline backtrace - pabelanger

Comments: By: Paul Belanger (pabelanger) 2010-07-29 08:02:43

Please do not post your backtrace as a note, uploading is better. Also, you trace is optimized (see below), we'll need a new one.
---
Thank you for your bug report. In order to move your issue forward, we require a backtrace from the core file produced after the crash. Please see the doc/backtrace.txt file in your Asterisk source directory.

Also, be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then:

make install

after enabling, reproduce the crash, and then execute the instructions in doc/backtrace.txt.

When complete, attach that file to this issue report. Thanks!
By: Ravelomanantsoa Hoby (hoby) 2010-07-29 08:20:09

pabelanger, the backtrace i uploaded is already obtained following instruction in doc/backtrace.txt with DONT_OPTIMIZE, DEBUG_THREADS in make menu select.

Additional information, it seems that there is a blank user registered to manager when i run "manager show connected":
/var/log/asterisk/cdr-csv# asterisk -rx"manager show connected"
Username IP Address
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
oxygen 127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
oxygen 127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
oxygen 127.0.0.1
127.0.0.1
127.0.0.1
these may be helpfull too
By: Ravelomanantsoa Hoby (hoby) 2010-07-29 08:37:02

i have fully upload all what gdb backtracing report, discard the [New Thread XXXXX] lines, the interresting information start by

Core was generated by `/usr/sbin/asterisk -f -p -g -vvvg'.
Program terminated with signal 7, Bus error.

so you just do a search of the string "Core" into the backtrace to see it well.
By: Paul Belanger (pabelanger) 2010-07-29 08:54:44

Theses are still not correct. If you see '<value optimized out>', then you did not follow doc/backtrace.txt properly.
By: Ravelomanantsoa Hoby (hoby) 2010-07-29 10:32:18

Thanks, I'll redo compilation.
Meanwhile i have got new information.
In manager.conf i set the httptimeout=60 (old value 20) and it seems thats no blank user appear when i do further "manager show connected".
But after disconnection of the application that use http manager and manager
i got
[2010-07-29 18:00:44] ERROR[1934] utils.c: write() returned error: Broken pipe
[2010-07-29 18:00:52] ERROR[1934] utils.c: write() returned error: Broken pipe

that are the same number as the connected user to manager even if it does not crashes asterisk it seems that manager has problem on expired session;
i also noticed that asterisk crash more later (every hour) than before.

By: Leif Madsen (lmadsen) 2010-07-29 11:19:37

Marked as Feedback while we await an unoptimized backtrace.
By: Ravelomanantsoa Hoby (hoby) 2010-07-30 01:22:29

i did a make menuselect with dont optmize selected but still have '<value optimized out>' any idea?
By: Ravelomanantsoa Hoby (hoby) 2010-08-03 01:56:45

I did a recompile with asterisk-1.4.34 and now i got a unoptimized backtrace.
Please see my new attached full backtrace. Hope this help.

Program terminated with signal 11, Segmentation fault.
#0 0x080d0835 in generic_http_callback (format=0, requestor=0x42e9b030,
uri=0x42ce2400 "", params=0x86480d8, status=0x42ce12e0, title=0x42ce12e4,
contentlength=0x42ce12dc) at manager.c:2936
2936 if ((retval = malloc((wlen = strlen(workspace)) + (tlen = strlen

(tmpbuf)) + 128))) {

By: Ravelomanantsoa Hoby (hoby) 2010-08-03 10:44:42

i also did a recompile with asterisk-1.4.35-rc1 and get a new full backtrace. Hope this help.

Program terminated with signal 7, Bus error.
#0 0x080d07f9 in generic_http_callback (format=0, requestor=0x436a1ff0, uri=0x43f76400 "", params=0x8f6db78, status=0x43f752e0, title=0x43f752e4,
contentlength=0x43f752dc) at manager.c:2925
2925 buf[l] = '\0';
By: Ravelomanantsoa Hoby (hoby) 2010-08-10 22:27:00

when walked through ASTERISK-15359 i tryied to patch manager.c for 1.4.35-rc1 and till now no longer crash but sometimes we observe a bit second of freeze (when verbosity is on it stop verbose a bit moment sometimes)

I upload the diff file manager.c-1.4.35-rc1.diff
By: Peter Fern (pdf) 2010-08-26 00:10:13

Issue confirmed here.
By: David Purdue (davidp) 2010-08-26 01:01:30

Seeing the same crash here in 1.4.35. Backtrace uploaded as backtrace-1.4.35.txt.
By: Ravelomanantsoa Hoby (hoby) 2010-08-29 01:45:53

davidp > have you tried the patch i uploaded? this patch work fine for me till now with up to 100 simutanous calls using Queue/Agent systems.
By: Peter Fern (pdf) 2010-09-07 20:10:21

The patch is not elegant, but will test and report back.
By: David Purdue (davidp) 2010-09-16 19:43:41

We have run this patch on our production server for 1 week, and the Asterisk crashes have stopped.
By: Csaba Lack (csabka) 2010-11-30 02:39:16.000-0600

in 1.4.36 official (without patch) this crash seems to be still exists:

Program terminated with signal 7, Bus error.
#0 0x080c0f9a in ?? ()
By: Peter Fern (pdf) 2010-11-30 04:01:12.000-0600

Can we get this into reviewboard please?
By: Sean Bright (seanbright) 2011-01-20 09:33:46.000-0600

leurk, backtrace_1.6.2.14-rc1.txt was generated with an unmodified manager.c?
By: leurk (leurk) 2011-01-20 15:01:28.000-0600

seanbright,

yes, backtrace_1.6.2.14-rc1.txt is for unmodified version (which obviously writes behind mmaped memory when ftell returns value of l on a page boundary - this should SUGBUS per POSIX). with 1.6.2.14-rc1_manager.c.diff AJAM is rock stable.
By: Sean Bright (seanbright) 2011-01-23 16:02:40.000-0600

Alright. I don't like the patch because it shouldn't be necessary. I feel like rolling back the patch from 16506 is the correct approach here but I have no way to test it right now.
By: Peter Fern (pdf) 2011-01-23 17:26:43.000-0600

@seanbright - I'm not sure I agree with you there. I don't particularly like this patch, but ASTERISK-15359 was not erroneous. This patch does fix the current problem, and I can confirm that ASTERISK-15359 was also a real issue. Either this code needs to be committed, or the whole code-path involved needs a really serious block of time dedicated to making it sane. I've got this patch deployed on dozens of systems currently, because some of our applications make heavy use of AMI/AJAM and we simply couldn't wait.

Perhaps commit this now, then review the whole lot. This is the third bug in closely related code that's hurt us.
By: Sean Bright (seanbright) 2011-01-23 20:05:05.000-0600

This patch masks the problem, it doesn't fix anything. The entire point of mmap()ing the temporary file is that we don't want to allocate a big chunk of memory on the heap each time through - this patch does exactly that. So sure, we don't crash, and that's a Good Thing, but that doesn't mean it's the right way to fix the underlying problem.

I agree that the entire block needs some TLC, which is on my personal TODO.
By: Leif Madsen (lmadsen) 2011-01-24 14:37:56.000-0600

Status droppped to Confirmed based on feedback from seanbright.
By: Sean Bright (seanbright) 2011-05-05 12:33:58

I've attached 2 patches that should apply cleanly to the latest 1.4 and 1.6.2 from SVN.

Is there anyone that would be willing to test these and report results? I can whip one up for 1.8 and/or trunk if necessary.
By: Peter Fern (pdf) 2011-05-05 18:51:20

Will test and report back, testing will likely take until mid next week.
By: Sean Bright (seanbright) 2011-05-12 09:07:49

Any update on your testing? No rush, I just want to close this one out for good.
By: Peter Fern (pdf) 2011-05-12 18:23:08

Had some delays - revamped our deployment/test process and needed to release 1.4.41, which is going out today. Testing on this will start first thing next week, will keep this tracker informed.
By: Peter Fern (pdf) 2011-05-22 22:58:21

Has been running in our test env for the past week, will push to some production systems today for confirmation.
By: David Purdue (davidp) 2011-05-24 01:31:35

When we moved the patch in to production we started getting Asterisk crashes again, this time in xml_translate.

See uploaded backtrace_1.4.41.txt.
By: Walter Doekes (wdoekes) 2011-05-24 02:08:00

davidp: you should DONT_OPTIMIZE for a backtrace to be really useful. Don't worry about performance, the loss of it is minimal.
By: Peter Fern (pdf) 2011-05-24 04:03:00

@wdoekes: That creates some problems for our build and release process, as we build packages for deployment. It should also be noted that generally our line numbers may not precisely match SVN since we include additional patches required for our environment, and we are unable to deploy to production without them.

Some further information that may be relevant - the trace above shows a regression to the behaviour 0016506 was meant to fix (almost - it runs off the end of the memory in the next loop this time).

If you look at 0015495, you'll see that we essentially did the same thing as this patch way back then (it had slipped my memory), but then ran into 0016506 down the track.

All we've really done with the last patch here is revert 0016506, though looking again at 0016506 I'm starting agree that it was a poor idea, and certainly the cause of the SIGBUS.

So, what we really need to do here is resolve 0016506 correctly.

By: Leif Madsen (lmadsen) 2011-05-24 07:08:28

Is this an issue in Asterisk 1.8 as well?

~~~~~

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch.

For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions
By: Peter Fern (pdf) 2011-05-24 07:21:59

Looking at the code, it would appear likely for 1.8 to SIGBUS under the right conditions currently as well, and the previous patches to this problematic code have been applied to all branches, so I would expect this to be relevant.

On a policy note, 'Security Fix Only' doesn't include remote crash of Asterisk?! I haven't been able to yet, but I'm sure some industrious fellow could reproduce this with correctly crafted payload for DoS purposes.

Even further OT, in the past new Asterisk versions haven't had the major bugs ironed out until the micro rev reached double-digits, so I've been reluctant to migrate. I know the process has improved significantly since then, but it's still a big move. Anyway, sorry for the tracker noise with this last.
By: Sean Bright (seanbright) 2011-05-24 08:41:17

Let me chime in here - this affects everything from asterisk 1.4 through trunk.

The fix in ASTERISK-15359 was wrong - which is what I reverted. What I believe is happening is that there is a discrepancy between the fd and the FILE * that is used to write data to the same underlying file. We write to the FILE *, but it is not flushed to disk, we mmap() the fd and read/write out of bounds.

Adding 1 when passing the size of the region to mmap() doesn't change the length of the underlying file and will ultimately result in a SIGBUS or SIGSEGV depending on the state of the underlying file descriptor.

By: Sean Bright (seanbright) 2011-05-24 15:12:54

I just posted a new patch to 1.4 (issue17747_1.4_svn_markII.patch) that removes all uses of the FILE *. All of the remaining operations act on the fd alone.

If this still results in a crash (but not a bus error), I can feel confident that the off-by-one error was a red herring and we can move on to search for other solutions.

By: David Purdue (davidp) 2011-05-25 03:26:12

Latest patch has been applied to our production server - so far so good but at 6:00pm our server is not at peak load.

Will leave it in place for a week then report back (unless we get crashes earlier).
By: David Purdue (davidp) 2011-05-26 00:01:54

Just had a look at manager.c in asterisk 1.8.4.1, and the problematic code is present there.
By: Sean Bright (seanbright) 2011-05-27 09:03:34

pdf and davidp - you work together or are these separate installations?
By: David Purdue (davidp) 2011-05-29 19:46:24

pdf and I work for the same company - Peter manages R&D and I manage Support.
By: Sean Bright (seanbright) 2011-05-30 11:43:32

No crashes with the latest patch so far?
By: David Purdue (davidp) 2011-05-30 17:34:08

Not so far, but we are giving it until Friday before we call it. The heaviest user of the main piece of software that triggered the crash for us is out of the office this week, so we want to give everyone else a chance to crash it.
By: Sean Bright (seanbright) 2011-05-31 13:55:32

Agreed. Thanks for the feedback thus far.
By: Sean Bright (seanbright) 2011-06-08 07:32:32.650-0500

Do I assume that the lack of feedback means that the patch has fixed the crashing?
By: Peter Fern (pdf) 2011-06-09 02:10:52.098-0500

We advanced this patch on Tuesday, and pushed it to our repositories. So far we have not deployed this version to any clients, however our internal production server has now been under standard load for a week with no related crashes (we did experience a SIGABRT, but I believe that's unrelated, though haven't had sufficient time to do further anaysis).

Based on our current assessment, this patch appears to have fixed the crashing, and the assertions it's based on appear sound.

EDIT: Notifications appear to have been lost during the Jira migration, have now subscribed to this issue.
By: James Van Vleet (jvanvleet) 2011-08-19 13:50:55.777-0500

I have found a bug related to this issue - please see ASTERISK-18223