ASTERISK-25806: Deep ACD queueing with app_queue.c causes denial of service on systems with a high core count by excessive thread mutex locks

[Home]

Summary: ASTERISK-25806: Deep ACD queueing with app_queue.c causes denial of service on systems with a high core count by excessive thread mutex locks

Reporter: Troy Bowman (troy) Labels:

Date Opened: 2016-02-22 20:32:37.000-0600 Date Closed: 2016-06-10 09:20:09

Priority: Major Regression?

Status: Closed/Complete Components: Applications/app_queue

Versions: 13.6.0 Frequency of
Occurrence Constant

Related
Issues:

Environment: Vanilla install of Asterisk 13.6.0 (no FreePBX or the like) with a couple minor patches of mine for after-call-work automatic pausing as well as tweaked reason text auto-away pausing. I have yet to patch and update to 13.7.x, it's on my to-do list, though this issue doesn't seem to be addressed in any version subsequent to 13.6 at the time I write this. Asterisk runs within a Qemu KVM virtual machine, 4G memory, 4 virtual CPUs, Gentoo (tested various cpu configurations including bare-metal in in the description) Host is a Dell T130 with dual E5-2620 v3 cpus at 2.4GHz and 128G RAM, Gentoo. (I use Gentoo for easy customization/patching of OS managed packages.) Attachments: ( 0) cpu_usage.png
( 1) load.png
( 2) prof_output_many_cores.txt
( 3) prof_output_single_core.txt
( 4) queues.conf

Description: I posted this in community.asterisk.org, but I figured I might as well post it in issues.asterisk.org, too, since I think this is a bug in the way thread mutex locking happens in app_queue.c

---
TL;DR: Asterisk 13 app_queue.c with a high queue count causes high load and system CPU time. This is exacerbated by the CPU core count available, because the high load comes from thread mutex locks. The more cores, the more time the mutexes take, the less time for real call handling. A high number of queued callers causes choppy audio because of the lack of available CPU time, effectively denying system usability for existing bridged calls.
---

I have been battling with a scalability issue in queueing calls in app_queue.c. We currently run nine ACD queues. Seven of the nine contain about 15 people. Four of those seven queues are "Support," for Customers, Sales, Contractors, and Tenants. They contain the same people (using the whisper feature to inform the agent what the their role should be for the call).

During the week of the huge snowstorm on the east coast about a month ago, we were daily hammered with calls way above our usual call volume, each of those support queues were constantly filled with about ten to forty people waiting for an agent. We regularly had between about 40 to 120 calls waiting. When this happened, system started to have choppy audio. As I'd listen to the audio, it did NOT sound like the chops or static that packet loss sounds like, where a packet is lost every hundred milliseconds or so. It would drop audio for rather long chunks of time (still a fraction of a second per instance) before the audio would pick up again. This would happen to all of the current calls in progress, including the unbridged ones which were simply listening to hold music in the queue. The nature of how the audio chopped rendered most calls in progress useless, because there wasn't enough audio for any of the parties to understand each other.

Needless to say, our Asterisk phone system was not very popular that week. I was confused because we had handled 40 or so concurrent calls before without a hitch. But something about queueing the calls was driving the system bonkers.

In my troubleshooting, I noticed that when this would happen, the CPU load (meaning the number of processes waiting in line for CPU time) would easily raise over the number of cores in the machine, sometimes going over 30. That isn't 30% cpu usage, its 30 concurrent processes waiting for CPU time. I tried to do my due diligence as a system administrator to find the bottleneck in the system. I determined that this was absolutely not a network problem. We were far from capacity in both packets per second and bandwidth. I determined that this had little to do with any cpu-intensive things, like transcoding, recording to disk, or encoding those recordings into mp3. I am nicing mp3 encoding to the lowest priority anyway. The disk usage was relatively idle even with the recording. In watching top, and switching to threads, I noticed that the cpu usage of each Asterisk thread when things were good was about <2%, but every once in a while, when the load would skyrocket, things would exacerbate and each thread would suddenly consume over 15% CPU. Yet, I saw that this was still not using much actual user CPU computation time. Most cpus on the system had a lot of idle time left over, and most of the high CPU usage was found in the system category (which I understand is low level kernel or libc code).

My first reflex was to give the machine more CPU cores, because a high load should definitely be alleviated by more cores chomping away at the computation, right? I bumped the virtual machine from 8 virtual cores (which become 8 threads on the host) to 12 virtual cores (12 threads on the host). I did not want to go much beyond that because though the host has 24 threads, 12 of those are hardware-virtualized "hyper threads." To my surprise, this exacerbated the problem. The load often skyrocketed to near 100, and the Asterisk threads under duress consumed much more cpu and each core was spending the same time in system usage, even though there were more cores.

I thought that maybe there was contention with other virtual machines on the host, so I migrated all of the other virtual machines away, leaving the asterisk virtual machine alone on the host, having all of the cpu resources to itself. That didn't help at all.

I thought that maybe it was wasting CPU by transcoding the hold music on the fly, so I made sure that the hold music was encoded for the native call encoding. That didn't make any difference. I already had it encoded for ulaw, and adding the other encodings was superfluous -- it was already using ulaw. I tried turning the hold music off entirely. That didn't help either.

I started testing with a test virtual machine with various hardware configurations and recompiled Asterisk for debugging and started dumping gmon profiles. I used sipp to simulate callers queueing up in the various queues. It seemed to take a few more sipp connections to get the system to the level where it acted like Production, probably because it is fake RTP traffic and because my test system had zero actual calls in progress, but I was able to duplicate the problem nonetheless. In the gprof dumps, I noticed that the more CPU time was spent in thread mutex locks (like __ast_pthread_mutex_lock) and atomic integer fetches (like ast_atomic_fetchadd_int) in a system with a higher cpu count than one without. (I will try to attach the gprof dumps.) As I browsed the app_queue.c, I noticed that it ao2 locks out all other threads while it checking to see if it can hand a call off to an agent in try_calling, among other functions. That explained the high system wait time. That explained the high load. This is thread mutual exclusive lock contention.

I thought that maybe this had to do with the overhead of virtualization, and that the lock contention might be alleviated by reducing any latency that virtualization may cause. So, I revamped my test OS to fit in a Linux container instead of a KVM image. In the Linux container, Asterisk had access to the bare-hardware kernel with all 24 virtual cpus, 12 cores and 12 hyperthreads. This did not help either; it exacerbated the problem even more. For grins and giggles I tested outside of a Linux container with the same result.

In all my testing it started to become clear to me: the more CPU cores that are available to Asterisk, the more expensive queueing calls in app_queue.c is. At this point, I backed the production machine back to 4 virtual CPUs, and started to drop queued calls out of the queue to voicemail after five minutes of hold time to keep the number of calls in the queue low. Keeping the amount of queued calls in each queue under about 10 ensured that nary an RTP audio packet was dropped because of constipated CPUs. Real bridged channels use hardly any cpu in comparison, but they sure are subject to lock contention constipation.

Could Asterisk queues possibly perform better on a single CPU machine? My gut kept saying "no way" because Asterisk was consuming all 12 to 24 cores in system time alone to hold some 80 fake queued calls. But it's easy enough to test. I spun up the test Virtual Machine with only one virtual core and tested. I watched as system load increased and held between 10 and 20 (which is still scary high by the way), but there it was, queued with 120 calls, and not dropping any audio because of any cpu time contention. I pushed it up to 160 queued calls, and things began to become wiggy, but to think it was handling this capacity without the extra multiprocessing was surprising to me. This probably makes sense because the volatile variables, mutex locks, and other hot memory will sit in the L1/L2 caches of the core it's forced to run on.

As I was trying to research this problem, I found the following thread which I believe is quite related to the problem I am seeing:

https://community.asterisk.org/t/asterisk-stops-processing-calls-under-heavy-load/49218

At the end of that thread, the solution was to downgrade to the Asterisk 11.6 (Certified) version, because it apparently does not have the lock contention flaws that appears in subsequent versions in 11, and probably all of 13. This suggests that whatever happened after 11.6 seemed to make the problem rear it's ugly head. I don't have the luxury to downgrade to 11, because I've already become highly invested in the new ways that Queues are handled in Asterisk 13 in all of my queue management, reporting, and real-time monitoring scripting.

I tried using cgroups and cpuset to exclusively run Asterisk on one core, and then placing ALL other OS operations on other cores. That seemed to slightly mitigate the problem, except that I couldn't find a way to get AGI scripts and other shell-out calls from Asterisk to extricate themselves from the exclusive cpuset, as well as the fact that I actually wouldn't mind channel and bridge threads to run on other cores where plenty of CPU cpu time is available. This seems to be the only thing I can really do at this point, since I don't really have any expertise in the app_queue.c code itself.

I only I think this should be classified as a bug. If I am wrong for reporting this here, please accept my apologies. It seems to be some sloppy multiprocessing programming for app_queue.c. It seems like there might need to be a more elegant way to handle mutually exclusive locks, and a less thread-blocking way, or maybe even less aggressive way (with some sleeping between loops) to check for available agents and distribute queued calls. Another option might be to be able to set the affinity for all app_queue.c threads to a single core, so it doesn't have to contend with all of the other truly multithreaded processes, like bridges and channels.

Comments: By: Asterisk Team (asteriskteam) 2016-02-22 20:32:38.166-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Ross Beer (rossbeer) 2016-02-23 07:37:31.996-0600

I believe I am also seeing the same thing with Asterisk 13.7.2 queues, previous version of Asterisk 1.8 handle way more calls that Asterisk 13.
By: Rusty Newton (rnewton) 2016-03-13 13:07:04.118-0500

Thank you for the issue report. I've consulted with our core developers and it appears this limitation of app_queue is due to its fundamental design. The elements of app_queue at the root of the limitation have been this way a long time and changing them would require a deep re-factoring of app_queue. Therefore we will consider this a feature request. Since we typically do not track feature requests on the issue tracker without an associated patch - let me know if you intend to work on refactoring app_queue and then we can keep this open.

In addition to a deep refactoring of app_queue, any refactoring would require tests to be written for those changes in order to be accepted. As the potential for creating bugs is very high.

That being said, patches and well-tested code are always welcome! However, consider that this is not a light task and you would probably want to coordinate with other developers on the lists.digium.com asterisk-dev list before taking on that task.

For queue applications of scale beyond what you are doing, the recommended path is to "roll your own" application using an interface like [Asterisk Rest Interface|https://wiki.asterisk.org/wiki/pages/viewpage.action?pageId=29395573].
By: Asterisk Team (asteriskteam) 2016-03-28 12:00:01.644-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
By: Troy Bowman (troy) 2016-04-20 23:51:16.715-0500

Thanks for your reply. I agree that fixing app_queue would be a huge undertaking. As I said, I honestly don't have the expertise to do that. I'm not a C expert. The first idea would be to try to somehow screw in inherently thread-safe immutable data structures; which don't need to be locked every time they're read for their current state; and then only lock when it is time to update the immutable structure because any of its attributes have actually changed.

I don't want to turn this into a support ticket, but what does Digium or others in the community do to mitigate the high load imposed by the locking in app_queue? Do you simply time out queue calls? Do you move queueing to another machine that can't suffer from dropping RTP packets?

I do believe this is a rather big issue. I will try to attach load averages versus actual CPU usage for a typical day -- today. I'll remind the reader that any load above the cores in the machine means there are too many threads waiting for attention, even though actual number crunching is relatively light.

By: Asterisk Team (asteriskteam) 2016-04-20 23:51:17.054-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Troy Bowman (troy) 2016-04-21 00:04:25.013-0500

Attached image of actual CPU usage
By: Troy Bowman (troy) 2016-04-21 00:04:42.184-0500

Attached image of CPU Load
By: Rusty Newton (rnewton) 2016-06-10 09:20:09.352-0500

{quote}
I don't want to turn this into a support ticket, but what does Digium or others in the community do to mitigate the high load imposed by the locking in app_queue? Do you simply time out queue calls? Do you move queueing to another machine that can't suffer from dropping RTP packets?
{quote}

Discussing that here would be turning this into a support ticket, since we've already determined that this issue represents a known, deep design issue and not a bug. Since no one is working on a patch I'm closing it out again. Feel free to bring up the discussion on the community forums or the mailing lists to get advice on what to do specific to your usage.

https://community.asterisk.org/
http://lists.digium.com