[Home]

Summary:ASTERISK-22960: [patch] Segfaults in res_musiconhold.c:moh_release
Reporter:Ben Smithurst (bensmithurst)Labels:patch
Date Opened:2013-12-10 16:13:59.000-0600Date Closed:2018-01-02 08:30:36.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_musiconhold
Versions:11.6.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Debian, using Realtime MusicOnHold with a MySQL databaseAttachments:( 0) ASTERISK-22960-res_musiconhold_moh_release.diff
( 1) backtrace.txt
( 2) details.txt
Description:We have been seeing a reasonably frequent segfault in res_musiconhold, in the moh_release() function.  This was with realtime musiconhold, stores in mysql, with cachertclasses disabled (potentially relevant to the fix).

After a bit of debugging and testing we have come up with the attached patch.  There are two parts to this fix.

ast_activate_generator is passed a class and a channel, to activate MOH on that channel. If the channel already has MOH, it is disabled, before allocating the new instance. However what I think can happen is that if the channel's instances is the same as the one to activate for some reason, and the channel is the last reference to that class, the object will be destroyed, and allocating a new instance of it is then invalid.  Using MALLOC_DEBUG makes this more obvious, as the object is full of {{0xdeaddead}}.

The fix changes the order to allocate a new instance first, then unallocate from the channel, before allocating the new instance to the channel.

local_ast_moh_start checks if the class already has a MOH class saved via state->class, and if so, it decrements the reference of the local variable mohclass, before assigning mohclass = state->class. However this function will then unref mohclass at the end, I think this can take the ref count of state->class (which was assigned to mohclass) too low, so I have changed it to increment the ref count of state->class when doing that assignment.  This fixes it for us but I may have misunderstood something to do with reference counting in this code.

These appear to work, and running with REF_DEBUG enabled didn't show reference numbers going crazy, and when I terminated my test calls I saw MOH objects being destroyed as expected.

We have been able to reproduce this by creating a simple queue with a few extensions as members, and then using AMI to create a load of calls into the queue, and just dumping the other side of the call into something like echo test.  Normally with enough calls this will make Asterisk crash within a few minutes.

We haven't yet tested with cachertclasses enabled, looking at the code, that may make a difference, but we decided to chase the bug rather than try to hide it by changing settings.
Comments:By: Rusty Newton (rnewton) 2013-12-10 19:31:07.492-0600

bq. We have been able to reproduce this by creating a simple queue with a few extensions as members, and then using AMI to create a load of calls into the queue, and just dumping the other side of the call into something like echo test. Normally with enough calls this will make Asterisk crash within a few minutes.

Can you grab a backtrace from the core resulting from one of your crashes? https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationAfterACrash

How many calls are generally "enough"?

By: Michael L. Young (elguero) 2013-12-10 20:15:05.505-0600

I would be curious to see that backtrace as well.  Especially, since in the attached patch I see the reference count being incremented on state->class which is something that was noticed when working on ASTERISK-21775.  There is a patch on that issue (ASTERISK-21775) which added a reference to this object but it didn't seem to solve that issue for the reporter.

By: Ben Smithurst (bensmithurst) 2013-12-11 03:15:39.432-0600

Attached a backtrace of all threads in [^backtrace.txt] and then some more details about the crashing thread in [^details.txt].  As you can see the data in moh->parent is corrupted, most obviously name/dir.

Rusty - in this case there were 9 calls in the queue at the time of the crash.

By: Joshua C. Colp (jcolp) 2017-12-18 11:14:35.833-0600

Have you experienced this under recent versions of Asterisk? I do know there has been a little bit of work in this area, but based on at least one recently filed issue there are still issues. Just trying to see the scope and if they are related.

By: Asterisk Team (asteriskteam) 2018-01-02 08:30:36.073-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].
[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines