[Home]

Summary:ASTERISK-28625: Playback of local files impacted by large media cache
Reporter:Kevin Reeves (kevinreeves)Labels:patch
Date Opened:2019-11-20 15:29:45.000-0600Date Closed:2019-12-19 11:03:08.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/Streams
Versions:16.6.1 Frequency of
Occurrence
Constant
Related
Issues:
Environment:CentOS Linux release 7.6.1810Attachments:( 0) ASTERISK-28625.patch
Description:Playback of local files is impacted by media cache.  As the cache grows, performance of common Read and Playback (and presumably other) operations for local files is impacted.

For example, Playback command of a local file like this:
exten => s,n,Playback(thankyou)

Even though it's playing a local file, those operations are searching the media cache. As the cache size grows, the performance of that type of call is impacted.  In fact, we've seen a multi-second delay from the time the Playback command is executed to when the caller hears the audio.

The delay grows significantly over time as the cache grows bigger.  With thousands of files in the media cache while the server is under heavy load, the delay has been upwards of 15-30 seconds.  Without files in the media cache there is zero delay.

Here's a debug output snippet that shows the issue.  The Read application is playing a local file named 'pb/ac-enter-pin'.  Notice that media_cache.c line 194 fails to locate the media.

   -- Executing [s@nx-dial-session:20] Read("SIP/vitel-inbound2-00000002", "__DS_PIN,/var/lib/asterisk/sounds/pb/ac-enter-pin") in new stack
[2019-11-20 15:22:06] DEBUG[30329][C-00000003]: media_cache.c:194 ast_media_cache_retrieve: Failed to obtain media at '/var/lib/asterisk/sounds/pb/ac-enter-pin'
[2019-11-20 15:22:06] DEBUG[30329][C-00000003]: media_cache.c:194 ast_media_cache_retrieve: Failed to obtain media at '/var/lib/asterisk/sounds/pb/ac-enter-pin'

This section of the code should not be used in cases where the file is a local file.
Comments:By: Asterisk Team (asteriskteam) 2019-11-20 15:29:46.491-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Kevin Reeves (kevinreeves) 2019-11-20 15:48:31.962-0600

This file has a patch to main/file.c that checks if the filename is an absolute path and ignores the media cache check.

By: Kevin Harwell (kharwell) 2019-11-20 17:03:44.781-0600

Oddly I do not see the attachment? I can see under the history "tab" that it was added, but nothing about removal. Have you signed a license agreement [1]? If not maybe that is why? If so, we require the agreement to be signed before a patch can be attached (or submitted to gerrit). See the contribution process [2] for more details.

Thanks!

[1] https://issues.asterisk.org/jira/secure/DigiumLicense.jspa
[2] https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process

By: Kevin Harwell (kharwell) 2019-11-20 17:04:30.407-0600

Community post link for reference:

https://community.asterisk.org/t/media-cache-performance-impact-on-all-files/81720

By: Kevin Reeves (kevinreeves) 2019-11-20 17:13:35.288-0600

Kevin - I signed the agreement but I think it needs to be approved.  I haven't heard back yet.

By: Kevin Reeves (kevinreeves) 2019-11-21 09:35:24.510-0600

My contributor agreement has been approved.  Can you see the patch now?

By: Kevin Harwell (kharwell) 2019-11-21 09:49:11.256-0600

Yes, I can now see the patch.

Thanks for the contribution! If you'd like your contribution to be included faster, you should submit your patch for code review by the Asterisk Developer Community. To do so, please follow the Code Review [1] instructions on the wiki. Be sure to:
* Verify that your patch conforms to the Coding Guidelines [2]
* Review the Code Review Checklist [3] for common items reviewers will look for
* If necessary, provide tests for the Asterisk Test Suite that verify the correctness of your patch [4]

When ready, submit your patch and any tests to Gerrit [5] for code review.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Code+Review
[2] https://wiki.asterisk.org/wiki/display/AST/Coding+Guidelines
[3] https://wiki.asterisk.org/wiki/display/AST/Code+Review+Checklist
[4] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Test+Suite+Documentation
[5] https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

By: Kevin Reeves (kevinreeves) 2019-11-21 12:58:01.162-0600

Thanks.  I've packaged my changes up (different approach than the patch I attached to this) and submitted it for review.

https://gerrit.asterisk.org/c/asterisk/+/13240

By: Sean Bright (seanbright) 2019-11-21 16:01:12.471-0600

[~kevinreeves],

bq. With thousands of files in the media cache while the server is under heavy load, the delay has been upwards of 15-30 seconds.

Can you quantify "thousands of files?" 2k? 10k? more?

Are your media files on the local disk or are they stored remotely on something like an NFS share?

By: Kevin Reeves (kevinreeves) 2019-11-21 16:18:59.851-0600

I experienced the issue as the total cached media files hovered around 1,500+.  The files are stored in the local /tmp directory.  I have multiple standalone CentOS Linux release 7.5 servers with multiple CPU and 2G+ memory.  Each of them experienced some amount of the same symptoms.  In general, each one had active calls in the 100-200 range.  Load average hovering around 20-30.


By: Sean Bright (seanbright) 2019-11-21 16:24:11.239-0600

Thanks [~kevinreeves]. As a point of comparison, I populated the media cache in my development environment with ~25k items and calling {{Playback()}} on randomly chosen (local) files was pretty snappy. Granted this environment is not under any type of load.

bq. Load average hovering around 20-30

To make sure that I am clear: the 1 minute load average on these machines is between 20 and 30? And specifically how many cores are in these machines?

By: Kevin Reeves (kevinreeves) 2019-11-22 10:36:46.646-0600

Sean -

To clarify - we have 24 cores in each server.  The 1 minute load average was hovering at 20-30 but at times would spike 200+.  That load average spike didn't seem to correlate to the delayed playback times.  These servers are dedicated, non virtualized hardware.  Meaning they are not sharing their CPU with any other processes.

My team was actively debugging during this impacted time, so I went back to the logs from the incident and here's an example of what we were seeing.
[2019-11-19 14:45:24] VERBOSE[14196][C-000228fa] pbx.c: Executing [s@nx-dial-session:20] Read("SIP/aaa-002f6456", "__DS_PIN,pb/ac-enter-pin") in new stack
[2019-11-19 14:45:31] VERBOSE[14196][C-000228fa] file.c: <SIP/aaa-002f6456> Playing 'pb/ac-enter-pin.slin' (language 'en')

Nothing else was being done on that channel in-between those 2 lines.

Notice the timing of these 2 lines.  First line shows the Read command from the dialplan. The 2nd line shows the time that the audio started playing.  Notice how there's a 7-second delay between them.  The experience to the caller is dead air during that 7-seconds.  That file pb/ac-enter-pin is on a local disk - not NFS.

Here's another example on the same server later in the day:
[2019-11-19 16:20:02] VERBOSE[16688][C-000002dc] pbx.c: Executing [s@nx-dial-session:20] Read("SIP/aaa-00001905", "__DS_PIN,pb/ac-enter-pin") in new stack
[2019-11-19 16:20:11] VERBOSE[16688][C-000002dc] file.c: <SIP/aaa-00001905> Playing 'pb/ac-enter-pin.slin' (language 'en')

You see a similar delay.

I have lots of examples like this.

At the time, we had probably 1,500+ entries in the media cache.   Not nearly as massive a list as what you entered in your development environment, but it was definitely impacting our servers.

Just to complete the picture, here's an example of the same file being played instantaneously - this is good and what we want to see:
[2019-11-19 16:24:24] VERBOSE[23354][C-000002f7] pbx.c: Executing [s@nx-dial-session-ac:10] Read("SIP/aaa-00001b05", "AC_PIN,pb/ac-enter-pin") in new stack
[2019-11-19 16:24:24] VERBOSE[23354][C-000002f7] file.c: <SIP/aaa-00001b05> Playing 'pb/ac-enter-pin.slin' (language 'en')

It happened during the same timeframe as the others, which highlights the randomness of the issue.  It took us a LOT of brainstorming and debugging before we identified the media cache as the culprit.

It's worth noting that we weren't using the media cache until 4 or 5 days prior to this incident.  So we never had ANYTHING in the media cache until then.  When we started using the media cache we didn't see any issues for a few days.  It wasn't until we ran for a few days and accumulated a media cache of this size that we started to feel the impact.

Once we started to suspect the media cache, we wrote a script to clear it out using the CLI command. But, the command was incredibly slow.  That surprised us too.  

Each time that we called the command to remove an entry it took a few seconds to finish.   We were using:

media cache delete <URL of the media>

That was an interesting revelation because it made us think the Asterisk internals were probably having to go through the list linearly and perhaps that was very slow.

So I started reading through the source code, and noticed that every audio file reading operation was going through the same code and therefore was scanning through the media cache.

So that’s ultimately how we landed on suspecting this as the issue.

I hope this helps.  Let me know if you have other questions.

By: Sean Bright (seanbright) 2019-11-22 10:46:48.883-0600

[~kevinreeves],

The media is stored in a hash table which should make the lookups fairly quick.

You are using HTTP or HTTPS playback somewhere in your code I assume? Otherwise I don't know what the media cache would be populated with.

By: Sean Bright (seanbright) 2019-11-22 11:14:43.139-0600

Are you using Asterisk DB heavily?

By: Kevin Reeves (kevinreeves) 2019-11-22 11:28:49.592-0600

Yes - we're using the media cache to Playback audio files from an http address.

I'm unclear about why this seems to be impacting performance.  The hash table should be fast - which you proved with 25k items in your development environment and you're not seeing any issues.  So I don't know what impact load has on the hash searching operations.  Maybe under heavy load we introduce a contention issue for the hash table?  

Maybe when new items are coming into the media cache and it's syncing to the astdb it locks the container just long enough to block other requests for it?

What's interesting about that question specifically is that when we were experiencing these issues and our script was trying to delete items from the media cache it was going really slow on each delete operation.  I see in the code in ast_media_cache_delete() and ast_media_cache_retrieve() that it's writing to the astdb as well.  So I wonder what the experience is with other requests to search the media cache while it's writing to the db?

Not sure - just brainstorming.

But with regards to my suggested fix - I am suggesting that only remote files should be using the media cache.  Otherwise, we shouldn't be searching that hash table at all.  Or is that a bad assumption?

By: Kevin Reeves (kevinreeves) 2019-11-22 11:31:18.389-0600

Sean - No - we're not directly/intentionally using the Asterisk DB anywhere in our code.

By: Sean Bright (seanbright) 2019-11-22 11:32:56.374-0600

bq. But with regards to my suggested fix - I am suggesting that only remote files should be using the media cache. Otherwise, we shouldn't be searching that hash table at all. Or is that a bad assumption?

I think your patch is fine. I am more interested in the underlying cause of the slow down.

By: Kevin Reeves (kevinreeves) 2019-11-22 11:46:09.913-0600

I agree.  My patch doesn't really fix the underlying issue - it simply removes local file playback from messing with the media cache at all.  If my assumption is right that non-remote files would ever be in the media cache, then there's no need to ever search there anyway.

I'm interested in the potential astdb contention issues with add/delete operations.  If a dialplan has a lot of Read/Playback requests while using media cache then under heavy load you could see a lot of requests to search the media cache at the same time.  I guess it's possible that could introduce a moment in the thread where it gets blocked while astdb is updated.  And you wouldn't really experience that issue until you had the right blend of Read/Playback file operations along with a hefty media cache.

By: Friendly Automation (friendly-automation) 2019-12-19 11:03:09.732-0600

Change 13240 merged by Friendly Automation:
main/file.c: Limit media cache usage to remote files.

[https://gerrit.asterisk.org/c/asterisk/+/13240|https://gerrit.asterisk.org/c/asterisk/+/13240]

By: Friendly Automation (friendly-automation) 2019-12-19 11:26:55.540-0600

Change 13485 merged by Friendly Automation:
main/file.c: Limit media cache usage to remote files.

[https://gerrit.asterisk.org/c/asterisk/+/13485|https://gerrit.asterisk.org/c/asterisk/+/13485]

By: Friendly Automation (friendly-automation) 2019-12-19 18:44:30.754-0600

Change 13486 merged by Joshua C. Colp:
main/file.c: Limit media cache usage to remote files.

[https://gerrit.asterisk.org/c/asterisk/+/13486|https://gerrit.asterisk.org/c/asterisk/+/13486]