[Home]

Summary:ASTERISK-25663: app_queue: Segfault in queue during playback of periodic announcement - filename address out of bounds
Reporter:Conrad de Wet (Conrad)Labels:
Date Opened:2016-01-07 02:22:59.000-0600Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Applications/app_queue
Versions:11.19.0 13.18.4 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:CentosAttachments:( 0) 2016-01-06.txt
Description:The one Asterisk box crashed yesterday with a segfault. The fault was generated in in "ast_openstream_full" (from backtrace)

Yesterday was a particular busy day (not the most). This queue could have had up to 300 people waiting in it on for up to 1 hour.

Also of interest the core dump was not in /tmp/ but rather in the moh folder (/var/lib/asterisk/moh/..../core.31562). This was no big deal (other than tricky to find), but clearly pointed to the problem being part of moh or some playback sound file issue.

Last time this happened (about 3 months ago), the exact same thing happened. It was then that i converted all the moh files from .wav to .g729, in an attempt to avoid transcoding (if any). Clearly this didn't help. I see also in this case the "say_periodic_announcement" appear in the bt output - would it help to convert the announcements to .g729 also?

I know some of the values are "<value optimized out>" but please can you look at this anyway - its really the only issue we have left with Asterisk... Or give me some guidance to get the queues to operate under high load correctly. It does seem the issue is with "ast_openstream_full", just not sure what.
Comments:By: Asterisk Team (asteriskteam) 2016-01-07 02:23:01.237-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Conrad de Wet (Conrad) 2016-01-07 02:25:16.649-0600

Backtrace - but with some values optimized out

By: Mark Michelson (mmichelson) 2016-01-11 13:09:21.895-0600

I had a look at the backtrace. It's hard to tell anything concrete, but something that stands out in the full backtrace is:

{noformat}
#1  0x00000000004d6b68 in ast_openstream_full (chan=0x7fc5a45775c8, filename=0x5f746e6567412f31 <Address 0x5f746e6567412f31 out of bounds>,
{noformat}

Note that the address of the filename is out of bounds. That makes it seem like Asterisk is attempting to access memory it should not be. The address of the filename is part of the queue object itself, which may mean the queue is undergoing some sort of change (possibly due to a reload, shutdown, or restart) at the time when the crash occurred. I suspect that the problem here is that access to the sound file strings is not being properly lock-protected, but I'm not 100% sure. Do you happen to know if there were any reloads or anything similar happening at the time of the crash?

By: Conrad de Wet (Conrad) 2016-01-11 14:05:31.083-0600

Right, some background:
1 - The queues run from the MySQL real time system. So they don't go through a reload at any point.
2 - There are about 1300 queues in the database.
3 - The music on hold files have not changed in months.
4 - It's possible (but quite unlikely) the the periodic announcements where being fiddled with at the time of the crash - although i notice that the  bt also mentions "say_periodic_announcement", so maybe the  "ast_openstream_full" was trying to play a periodic announcement at the time of the crash.
5 - The only other handling of the media files is that, they (both announcements and moh), are synced from a src folder. This is kind of like a "staged" and "live" system to avoid these kind of issues. The command is: rsync -a --delete --bwlimit=500 /var/lib/asterisk/sounds-src/ /var/lib/asterisk/sounds/. All uploads get put in sounds-src, and the rsync does a sync every few minutes.

So maybe in rsync getting a lock on the file exactly at the same time that Asterisk is attempting to read it - its causing an issue. However, I don't see how this would cause "the address of the filename is out of bounds". Then again i have no idea of the file locks etc etc behind rsync.

That's, all i can think of for now,
Thanks for the feedback.

By: Mark Michelson (mmichelson) 2016-01-11 14:35:47.942-0600

Thanks for the information. In this case, I don't suspect the filesystem to be the problem, but literally the name of the file as stored in Asterisk's memory. You are correct that the problem occurred when attempting to play a periodic announcement.  If the periodic announcements were being changed in the database, that could be a possible explanation for the issue.

The queue contains an array of strings, which are the names of files to play back to a caller when waiting in a queue. I suspect that if the database's periodic announcements were changed, then that may mean that one thread in Asterisk was attempting to modify the array of file names while a separate thread was trying to access the array. Those operations should be mutually exclusive, but since they are not, it leads to the possibility of attempting to read freed/corrupted/incorrect memory from the array. I think the proper code fix within Asterisk is to ensure mutual exclusion on the operations involving the periodic announcement filenames.

As far as a workaround until the code is fixed, the only thing that I can think of is to perform database maintenance/alterations/etc. for the queue during times of inactivity if at all possible.

By: Conrad de Wet (Conrad) 2016-01-14 02:42:11.661-0600

Thank you for assisting with this - In the mean while ill configure our management environment to prevent any changes to the queue configurations while callers are waiting in the queue. A simple AMI call should be able to tell me how many callers there are, and based on that the update button could be disabled.