ASTERISK-26861: cdr: Deadlock when batch is unscheduled

[Home]

Summary: ASTERISK-26861: cdr: Deadlock when batch is unscheduled

Reporter: Alex Hermann (gaaf) Labels: stasis

Date Opened: 2017-03-13 10:30:33 Date Closed:

Priority: Major Regression?

Status: Open/New Components: CDR/General

Versions: 13.14.0 Frequency of
Occurrence Constant

Related
Issues:

Environment: Debian sid Attachments: ( 0) 26861-cli-threads.txt
( 1) 26861-gdb-all-2.txt
( 2) 26861-gdb-bt-30-28043.txt
( 3) 26861-gdb-bt-307-29270.txt
( 4) 26861-gdb-threads.txt

Description: When setting many CDR variables throughout the dialplan, at some point almost all channels/threads will wait on a lock that never gets released. SIP processing stops at that point.

I don't know if it is relevant, but I set various CDR variables before the Dial command, in the contexts for Dial's 'G' and 'U' options and in the h exten for the master.

The problem only manifests itself during a bit of call load on the server (as does every deadlock). I can pretty reliable reproduce it within 15k calls at max 50 cps).

{code}
CLI> core show taskprocessors
...
subm:cdr_engine-00000003 11547039 22051 1439 4500 5000
...
{code}

I'll attach thread list and backtraces.

Comments: By: Asterisk Team (asteriskteam) 2017-03-13 10:30:34.467-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Alex Hermann (gaaf) 2017-03-13 10:42:10.288-0500

The attached files contain:

1) output of cli "core show threads"
2) output of gdb "info threads"
3) backtrace of one of the threads waiting for the lock (30-28043)
4) backtrace of the process waiting on a condition involving (holding on to) the same lock (307-2970)

By: Alex Hermann (gaaf) 2017-03-13 14:57:21.131-0500

Another datapoint, as setting a CDR variable in the h exten seems a bit controversial, I can reproduce the issue without that too.
By: Joshua C. Colp (jcolp) 2017-03-14 05:36:25.100-0500

We need the full output of "thread apply all bt" to see what every thread is doing, just not these selected ones.
By: Alex Hermann (gaaf) 2017-03-17 12:06:41.163-0500

This is the CLI and gdb thread list, taskprocessor list and gdb backtraces for all treads. This log is unrelated to the previous as it is from different process. This one stopped early on, so only a "few" threads now.

After the message:
{code}
[Mar 17 17:42:42] WARNING[21518][C-0000019d]: taskprocessor.c:888 taskprocessor_push: The 'subm:cdr_engine-00000003' task processor queue reached 5000 scheduled tasks.
{code}

Everything stopped...

By: Alex Hermann (gaaf) 2017-03-20 11:54:40.453-0500

For the last backtraces i had scheduleronly=yes instead of no. The issue seems to trigger a lot easier/earlier in this case.