ASTERISK-22875: CLONE - Segfault in __ao2

[Home]

Summary: ASTERISK-22875: CLONE - Segfault in __ao2_find ()

Reporter: David Brillert (aragon) Labels:

Date Opened: 2013-11-21 08:10:55.000-0600 Date Closed: 2014-01-06 20:54:53.000-0600

Priority: Critical Regression?

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 11.6.0 Frequency of
Occurrence Occasional

Related
Issues:
is the original version of this clone: ASTERISK-22763 Segfault in __ao2_find ()

is a clone of ASTERISK-23339 Segfault in __ao2_find at astobj2.c, in find_interface at format.c

Environment: centos 5.9 64bit Attachments: ( 0) backtrace.txt
( 1) backtrace3.txt
( 2) backtrace_unoptimized_feb_20_2014.txt
( 3) gdb_thread_apply_all.txt
( 4) gdb_trace.txt

Description: Segfault. Backtrace attached.
Asterisk was compiled with DONT_OPTIMIZE and BETTER_BACKTRACES

Comments: By: David Brillert (aragon) 2013-11-21 08:11:40.863-0600

I cloned my old bug report to get this re-opened. I was too hasty in closing the original report.
By: David Brillert (aragon) 2013-11-21 08:15:43.714-0600

backtrace.txt looks a lot more useful than the traces from the 13th.
By: David Brillert (aragon) 2013-11-21 08:40:36.013-0600

I am going to rebuild Asterisk until I can an un-optimized back trace.
I can't let this bug live, it has already crashed production servers 5 times.
By: Rusty Newton (rnewton) 2013-11-21 19:15:01.089-0600

Did this one start occurring after an upgrade?
Edit - Nevermind I see the old issue now.
By: Rusty Newton (rnewton) 2013-11-21 19:18:23.114-0600

Other than the optimizations, looks like you are still using binaries without all the debug symbols. That will have to be remedied before a new backtrace will be helpful. See Matt's comments on the old issue.
By: Jeremy Lainé (sharky) 2013-11-26 08:55:33.624-0600

I don't know if it's the same bug, but I am also seeing a segfault in __ao2__find() using asterisk 11.5.1, backtrace attached.
By: David Brillert (aragon) 2013-11-26 09:38:05.232-0600

Jeremy:
Your backtrace is optimized out too. Which means Digium can't help us with a patch.
I'm having serious issues getting Asterisk to spit out an unoptimized backtrace but maybe you'll have better luck.
Try to follow the instructions at https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace and compile Asterisk with menuselect options for DONT_OPTIMIZE and BETTER_BACKTRACES

Then wait for another crash and see if your gdb output does not show <optimized out>.
That should mean you can provide a usable backtrace and help move this issue forward.
By: Jeremy Lainé (sharky) 2013-11-27 06:42:34.541-0600

OK, I have compiled asterisk with both options, we'll see if the crash triggers again.

I co-maintain Debian's asterisk packages, and was wondering whether it would be sane to build our packages with DONT_OPTIMIZE and BETTER_BACKTRACES? The documentation says performance impact is negligible, do you have any figures to quantify this?
By: David Brillert (aragon) 2013-11-27 08:07:18.333-0600

That is a tough decision.
We also manage our own rpms through a build system. And back in the 1.4 days it was an easy decision to make. We built our rpms with all debug options on and the performance hit did not outweigh the need to get access to sufficient backtraces for problem resolution. Since then the performance hit in 1.8 and 11 is more remarkable. I definitely wouldn't compile with DEBUG_THREADS since that really affects performance.
I would do some benchmarking:
1. Compare CPU load and voice quality with and without the standard backtrace requirements.
2. Test the gdb output on multiple servers and OS versions to see if your .deb versions provide the necessary unoptimized output.

Perhaps you should build two .deb versions; Production Vs. Debug and put them in your repos and only allow one or the other to be installed by the package manager.
By: Jeremy Lainé (sharky) 2013-11-29 09:00:52.968-0600

Here goes without optimisations.
By: David Brillert (aragon) 2013-11-29 09:24:07.147-0600

It looks like I'm not the only one having trouble sending an unoptimized backtrace... ;)
By: Matt Jordan (mjordan) 2013-12-07 20:34:42.229-0600

So, backtrace3.txt does contain a crash that occurred due to something being off in a packet 2 packet bridge. I'm assuming this is what Jeremy Laine attached.

This is quite strange. From this, codecs->payloads - which should be an ao2 container allocated on the codecs object when the instance object is created - is NULL while the codecs object has a valid address.

{noformat}
#0 0x0000000000452bb8 in __ao2_find (c=0x0, arg=0x7f3e5cd068d4, flags=96) at astobj2.c:1237
#1 0x0000000000552581 in ast_rtp_codecs_find_payload_code (codecs=0x24f1430, code=0) at rtp_engine.c:771
#2 0x00007f3e6bfaf04c in bridge_p2p_rtp_write (len=172, rtpheader=0x2026e38, instance=0x215ea58, hdrlen=<optimized out>) at res_rtp_asterisk.c:3413
{noformat}

As an aside, getting a {{bt full}} is quite helpful here, as it would show more of these values. Right now we have the backtrace from the seg faulting thread, but not the values in the stack trace or all of the values of the other threads. The wiki has more information on correctly obtaining a backtrace - see https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

The really odd part is {{codecs->payloads}} should never be NULL. This object should be set so long as the {{ast_codecs}} object is valid. Even if {{codecs}} was already freed, I'd expect this to be pointing to invalid memory, not NULL - which makes me wonder what is going on here.

Note that this is also a completely different issue from what David reported, which occurred when adding a hint:

{noformat}
#0 0x08089bb0 in __ao2_find ()
#0 0x08089bb0 in __ao2_find ()
No symbol table info available.
#1 0x08157df0 in ast_add_hint ()
No symbol table info available.
#2 0x08161165 in ast_add_extension2_lockopt ()
No symbol table info available.
#3 0x003d9a33 in pbx_load_config () at pbx_config.c:1644
__PRETTY_FUNCTION__ = "pbx_load_config"
#4 pbx_load_module () at pbx_config.c:1848
con = <value optimized out>
__PRETTY_FUNCTION__ = "pbx_load_module"
{noformat}

In neither case, however, am I sure how the system got in this state, or how to reproduce these errors.

I think that in both cases, you (either of you :-) ) will need to provide some information on what led up to the crash and a method to reproduce the errors. These are substantially strange enough that just looking at the somewhat incomplete backtraces isn't sufficient for someone to reproduce or assist on these problems.

By: Jeremy Lainé (sharky) 2013-12-18 02:59:18.537-0600

For the sake of clarity: forget about the backtraces I added, it sounds as though I hit a different problem than the one reported by David (mine was not triggered by a reload from the CLI). If the problem occurs again I will open a different issue.
By: Rusty Newton (rnewton) 2014-01-06 20:54:43.805-0600

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested. Further information can be found at http://www.asterisk.org/developers/bug-guidelines

Suspending since we don't have any further information to lead this investigation.
By: David Brillert (aragon) 2014-02-20 17:13:24.941-0600

Can I get this reopened?
I have seen the segfault again and this time I have an unoptimized backtrace attached.
If not I'll open a new ticket.
By: David Brillert (aragon) 2014-02-21 08:56:26.195-0600

Nevermind, I opened a new report and attached the backtrace there.