ASTERISK-29474: core: Exceptionally long queue length with Record and ConfBridge

[Home]

Summary: ASTERISK-29474: core: Exceptionally long queue length with Record and ConfBridge

Reporter: N A (InterLinked) Labels:

Date Opened: 2021-06-10 07:47:08 Date Closed:

Priority: Minor Regression?

Status: Open/New Components: Applications/app_confbridge

Versions: 18.4.0 Frequency of
Occurrence Constant

Related
Issues:
is related to ASTERISK-29775 CPU spike

Environment: Attachments: ( 0) confbridge.conf
( 1) core-asterisk-running-2021-06-10T12-30-28-0400-brief.txt
( 2) core-asterisk-running-2021-06-10T12-30-28-0400-full.txt
( 3) core-asterisk-running-2021-06-10T12-30-28-0400-info.txt
( 4) core-asterisk-running-2021-06-10T12-30-28-0400-locks.txt
( 5) core-asterisk-running-2021-06-10T12-30-28-0400-thread1.txt
( 6) core-asterisk-running-2021-06-10T12-38-12-0400-brief.txt
( 7) core-asterisk-running-2021-06-10T12-38-12-0400-full.txt
( 8) core-asterisk-running-2021-06-10T12-38-12-0400-info.txt
( 9) core-asterisk-running-2021-06-10T12-38-12-0400-locks.txt
(10) core-asterisk-running-2021-06-10T12-38-12-0400-thread1.txt
(11) core-asterisk-running-2021-06-11T00-06-54+0000-brief.txt
(12) core-asterisk-running-2021-06-11T00-06-54+0000-full.txt
(13) core-asterisk-running-2021-06-11T00-06-54+0000-info.txt
(14) core-asterisk-running-2021-06-11T00-06-54+0000-locks.txt
(15) core-asterisk-running-2021-06-11T00-06-54+0000-thread1.txt
(16) core-asterisk-running-2021-06-11T00-17-11+0000-brief.txt
(17) core-asterisk-running-2021-06-11T00-17-11+0000-full.txt
(18) core-asterisk-running-2021-06-11T00-17-11+0000-info.txt
(19) core-asterisk-running-2021-06-11T00-17-11+0000-locks.txt
(20) core-asterisk-running-2021-06-11T00-17-11+0000-thread1.txt
(21) core-asterisk-running-2021-06-11T17-18-53+0000-full.txt
(22) core-asterisk-running-2021-06-25T19-54-36+0000-full.txt
(23) core-asterisk-running-2021-06-25T20-51-59+0000-brief.txt
(24) core-asterisk-running-2021-06-25T20-51-59+0000-full.txt
(25) core-asterisk-running-2021-06-25T20-51-59+0000-info.txt
(26) core-asterisk-running-2021-06-25T20-51-59+0000-locks.txt
(27) core-asterisk-running-2021-06-25T20-51-59+0000-thread1.txt
(28) coredump5.zip
(29) coredump6.zip
(30) debug_log_123456
(31) debug_log_123456.txt
(32) debug_threads.txt
(33) dialplan.txt
(34) extensions.conf
(35) extensions.conf
(36) locks.txt

Description: It's our friend "__ast_queue_frame: Exceptionally long queue length queuing to..." again.

I'm mark this as app_confbridge related because 99% of the time, deadlocks on my systems are caused by ConfBridge and using other applications if possible does not cause this issue at all.

This issue happens with ConfBridges in multiple different scenarios - below is one that replicates every single time:

- bring another channel playing audio into a confbridge
- bring another channel that records locally into the confbridge, but muted
- have that original channel join that ConfBridge

All is good. Now, as soon as, from another phone, I join that same confbridge, even if muted, this whole thing happens immediately until at least after all channels and the bridge are torn down. Usually Asterisk crashes fairly quickly when this kind of thing happens; in this specific scenario, it doesn't crash quickly but it does basically kill both channels effectively and these warnings continue until everything involved is torn down, though sometimes this persists 10-20 seconds after that. In certain other similar scenarios, Asterisk usually crashes within a minute as CPU usage climbs to 100%.

This is just one scenario that leads to this deadlock, but it can happen in several different ways. Typically it happens when multiple parties are in this aforementioned bridge using ConfBridge. Replacing ConfBridge with ChanSpy in places where possible is enough to make the issue disappear, hence why this seems to be a bug with app_confbridge.

Attached are output from astcoredumper and core show locks during the deadlock.

In this particular deadlock, there seems to be some kind of infinite loop caused by infinite progress updates, although I only see the infinite progress stuff when Asterisk is compiled with DONT_OPTIMIZE and DEBUG_THREADS. Regardless, this deadlocks occurs.

Comments: By: Asterisk Team (asteriskteam) 2021-06-10 07:47:09.560-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].
By: N A (InterLinked) 2021-06-10 11:56:22.714-0500

Detailed core dump, CLI output, core show locks, etc. attached.
By: Mark Murawski (kobaz) 2021-06-10 16:17:34.567-0500

Please submit asterisk configuration, confbridge configuration, dialplan utilized and full instructions on how to reproduce this problem on the latest branch of a plain vanilla asterisk source release
By: Kevin Harwell (kharwell) 2021-06-10 16:30:45.043-0500

From the backtraces at least I am not seeing a deadlock per se. In both currently attached backtraces I see channels being blocked by activity down in the speex resampler. For example (from [^core-asterisk-running-2021-06-10T12-30-28-0400-brief.txt]) the lock in thread 109:
{noformat}
Thread 109 (Thread 0x7f77c6f89700 (LWP 31527)):
#0 0x00007f77f539229c in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f77f538b7d1 in __GI___pthread_mutex_lock (mutex=0x556c6c3ee520) at ../nptl/pthread_mutex_lock.c:115
#2 0x0000556c67be95c0 in __ast_pthread_mutex_lock (filename=0x556c67d47151 "channel.c", lineno=1021, func=0x556c67d4b0d0 <__PRETTY_FUNCTION__.23269> "__ast_queue_frame", mutex_name=0x556c67d474cb "chan", t=0x556c6c3ee520) at lock.c:326
#3 0x0000556c67b2a0c7 in __ao2_lock (user_data=0x556c6c3ee570, lock_how=AO2_LOCK_REQ_MUTEX, file=0x556c67d47151 "channel.c", func=0x556c67d4b0d0 <__PRETTY_FUNCTION__.23269> "__ast_queue_frame", line=1021, var=0x556c67d474cb "chan") at astobj2.c:241
#4 0x0000556c67b676ff in __ast_queue_frame (chan=0x556c6c3ee570, fin=0x7f77d4085840, head=0, after=0x0) at channel.c:1021
#5 0x0000556c67b67f81 in ast_queue_frame (chan=0x556c6c3ee570, fin=0x7f77d4085840) at channel.c:1137
#6 0x0000556c67ba5a6e in unreal_queue_frame (p=0x556c6c452920, isoutbound=1, f=0x7f77d4085840, us=0x556c6c25e600, us_locked=1) at core_unreal.c:243
#7 0x0000556c67ba5ef1 in ast_unreal_write_stream (ast=0x556c6c25e600, stream_num=0, f=0x7f77d4085840) at core_unreal.c:360
#8 0x0000556c67b73dcb in tech_write (chan=0x556c6c25e600, stream=0x556c6c452b00, default_stream=0x556c6c452b00, frame=0x7f77d4085840) at channel.c:5088
#9 0x0000556c67b75739 in ast_write_stream (chan=0x556c6c25e600, stream_num=0, fr=0x7f77d4085840) at channel.c:5434
#10 0x0000556c67b5911e in bridge_channel_handle_write (bridge_channel=0x556c6c260270) at bridge_channel.c:2541
#11 0x0000556c67b599db in bridge_channel_wait (bridge_channel=0x556c6c260270) at bridge_channel.c:2853
#12 0x0000556c67b5a1cf in bridge_channel_internal_join (bridge_channel=0x556c6c260270) at bridge_channel.c:2993
#13 0x0000556c67b3cb19 in ast_bridge_join (bridge=0x7f77e465eff0, chan=0x556c6c25e600, swap=0x0, features=0x7f77c6f85840, tech_args=0x7f77c6f85868, flags=(unknown: 0)) at bridge.c:1725
#14 0x00007f77d1c1ed1e in confbridge_exec (chan=0x556c6c25e600, data=0x7f77c6f85a30 "partyline270") at app_confbridge.c:2795
#15 0x0000556c67c1f4be in pbx_exec (c=0x556c6c25e600, app=0x556c6c13a1c0, data=0x7f77c6f85a30 "partyline270") at pbx_app.c:492
#16 0x0000556c67c0a344 in pbx_extension_helper (c=0x556c6c25e600, con=0x0, context=0x556c6c25efc0 "confcrosstalk", exten=0x556c6c25f010 "1270", priority=5, label=0x0, callerid=0x0, action=E_SPAWN, found=0x7f77c6f88ccc, combined_find_spawn=1) at pbx.c:2947
#17 0x0000556c67c0e597 in ast_spawn_extension (c=0x556c6c25e600, context=0x556c6c25efc0 "confcrosstalk", exten=0x556c6c25f010 "1270", priority=5, callerid=0x0, found=0x7f77c6f88ccc, combined_find_spawn=1) at pbx.c:4206
#18 0x0000556c67c0f2b7 in __ast_pbx_run (c=0x556c6c25e600, args=0x0) at pbx.c:4380
#19 0x0000556c67c10b2d in pbx_thread (data=0x556c6c25e600) at pbx.c:4704
#20 0x0000556c67caf897 in dummy_start (data=0x7f77ec090150) at utils.c:1299
#21 0x00007f77f5388fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#22 0x00007f77f4f5f4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}
is being blocked by thread 108:
{noformat}
Thread 108 (Thread 0x7f77c8870700 (LWP 31525)):
#0 0x00007f77f1b180da in resampler_basic_interpolate_single (st=0x7f77ec18e150, channel_index=0, in=0x7f77ec080fb0, in_len=0x7f77c886f3d8, out=0x7f77ec0919d8, out_len=0x7f77c886f3d4) at speex/resample.c:473
#1 0x00007f77f1b19239 in speex_resampler_process_native (st=0x7f77ec18e150, channel_index=0, in_len=0x7f77c886f3d8, out=0x7f77ec0919d8, out_len=0x7f77c886f3d4) at speex/resample.c:870
#2 0x00007f77f1b195c6 in speex_resampler_process_int (st=0x7f77ec18e150, channel_index=0, in=0x556c6c2ba328, in_len=0x7f77c886f420, out=0x7f77ec0919d8, out_len=0x7f77c886f424) at speex/resample.c:933
#3 0x00007f77f1b1736d in resamp_framein (pvt=0x7f77ec0918b0, f=0x556c6c2ba260) at codec_resample.c:123
#4 0x0000556c67ca6c5e in framein (pvt=0x7f77ec0918b0, f=0x556c6c2ba260) at translate.c:430
#5 0x0000556c67ca74c5 in ast_translate (path=0x7f77ec0883b0, f=0x7f77d40bd000, consume=0) at translate.c:618
#6 0x0000556c67b75099 in ast_write_stream (chan=0x556c6c3ee570, stream_num=-1, fr=0x7f77d40bd000) at channel.c:5299
#7 0x0000556c67b73e4c in ast_write (chan=0x556c6c3ee570, fr=0x7f77d40bd000) at channel.c:5098
#8 0x0000556c67bc6781 in ast_readaudio_callback (s=0x7f77ec141880) at file.c:927
#9 0x0000556c67bc69da in ast_fsread_audio (data=0x7f77ec141880) at file.c:967
#10 0x0000556c67b6f797 in __ast_read (chan=0x556c6c3ee570, dropaudio=0, dropnondefault=1) at channel.c:3595
#11 0x0000556c67b71efe in ast_read (chan=0x556c6c3ee570) at channel.c:4281
#12 0x0000556c67bc87ab in waitstream_core (c=0x556c6c3ee570, breakon=0x7f77d3ccc110 "", forward=0x556c67d58360 "", reverse=0x556c67d58360 "", skip_ms=0, audiofd=-1, cmdfd=-1, context=0x0, cb=0x0) at file.c:1627
#13 0x0000556c67bc8d26 in ast_waitstream (c=0x556c6c3ee570, breakon=0x7f77d3ccc110 "") at file.c:1780
#14 0x00007f77d3ccba64 in playback_exec (chan=0x556c6c3ee570, data=0x556c6c391d10 "custom/switch/intercepts/custom/interlinked") at app_playback.c:492
#15 0x0000556c67c1f4be in pbx_exec (c=0x556c6c3ee570, app=0x556c6c8768e0, data=0x556c6c391d10 "custom/switch/intercepts/custom/interlinked") at pbx_app.c:492
#16 0x0000556c67c19489 in pbx_outgoing_exec (data=0x556c6bf1e170) at pbx.c:7661
#17 0x0000556c67caf897 in dummy_start (data=0x556c6c1d8fb0) at utils.c:1299
#18 0x00007f77f5388fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#19 0x00007f77f4f5f4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}
I see several instances of similar in both attached backtraces. It could be something is up with the speex resampler (infinite loop?), or could be the system is just running slow or the timing was right, and the backtrace snapshots just happened to capture similar thread areas.

What's the output from the following when this is going on (curious if memory is also rising):
{noformat}
$ top -p $(pidof asterisk)
{noformat}
By: Kevin Harwell (kharwell) 2021-06-10 16:31:37.479-0500

As the speex resampler may be involved can you somehow remove that from the equation/scenario and see if the problem still occurs? If it makes sense to do so I suppose.
By: N A (InterLinked) 2021-06-10 17:28:16.864-0500

I just watched top as I replicated again, CPU goes to 80% and memory goes to 17%. This is on a server with 1 GB of RAM and I believe 1 CPU.

At idle, it's a < 1% CPU and 15% memory, so I'd say this is entirely a CPU issue and not memory related.

I'm not using the speex resampler explicitly or as far as I know, but I recompiled without the speex codec to get rid of that and it's just as bad as ever.

I will need to set up a new server in order to set up an easier, minimally reproducible example.
By: Joshua C. Colp (jcolp) 2021-06-10 18:10:58.033-0500

On top of everything else requested, also provide a full log from start to finish. You've only provided a full log when the problem was happening and it doesn't show things leading up to it.
By: N A (InterLinked) 2021-06-10 18:16:10.866-0500

Sure thing, how much logging do you want, just the full CLI output at verbosity 10 or also some debug level as well?
By: Joshua C. Colp (jcolp) 2021-06-10 18:31:22.293-0500

The wiki[1] has a page which details the logging to provide.

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
By: N A (InterLinked) 2021-06-10 19:25:27.327-0500

All right, here it is reproduced from scratch on a freshly installed, non-modified Asterisk 18.4, along with a minimally reproducible example. Happened instantly the first time and always reproduces as before, the second the 2nd call comes in, all hell breaks loose.
By: N A (InterLinked) 2021-06-10 19:37:30.227-0500

Here is the CPU usage after a few minutes as well:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21564 root 20 0 2087044 77992 18248 S 172.8 7.4 33:00.11 asterisk

172% CPU, yes, you read that correctly.
By: N A (InterLinked) 2021-06-11 13:24:22.547-0500

Here's another full backtrace which provides clear suggestions on where the initial problem is: confbridge_exec -> ast_bridge_join -> bridge_channel_internal_join -> bridge_channel_wait -> bridge_channel_handle_write -> ast_write_stream -> ast_unreal_write_stream -> unreal_queue_frame -> ast_queue_frame -> __ast_pthread_mutex_lock -> __pthread_mutex_lock. At the top, it's locked (__lll_lock_wait). Essentially, this thread is waiting for mutex pthread_mutex_lock (mutex=0x7f34f800cef0) and never gets it, no matter how long Asterisk continues to run. This corresponds to a lock needed to queue a frame at channel.c, ln 1021 in this case.
By: Kevin Harwell (kharwell) 2021-06-11 13:28:56.550-0500

I took a look at one of the latest backtraces ([^core-asterisk-running-2021-06-11T00-17-11+0000-brief.txt]). I do see a number of threads attempting to get locks, but so far not seeing an obvious deadlock, or a case where at least two separate threads are waiting on each other.

For instance in Thread 98:
{noformat}
Thread 98 (Thread 0x7f9c58512700 (LWP 21726)):
#0 0x00007f9c5485a9d0 in _Unwind_Find_FDE@plt () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1 0x00007f9c54866b13 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2 0x00007f9c54868618 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#3 0x00007f9c54b64e7f in __GI___backtrace (array=<optimized out>, size=<optimized out>) at ../sysdeps/x86_64/backtrace.c:110
#4 0x000055b8a58ca0da in __ast_bt_get_addresses (bt=0x7f9c5850ae60) at backtrace.c:94
#5 0x000055b8a597b1f5 in __ast_pthread_mutex_lock (filename=0x55b8a5ad4bc0 "bridge_channel.c", lineno=1083, func=0x55b8a5ad5d30 <__PRETTY_FUNCTION__.17261> "ast_bridge_channel_queue_frame", mutex_name=0x55b8a5ad4bd1 "bridge_channel", t=0x7f9c30023910) at lock.c:270
#6 0x000055b8a58bd0ee in __ao2_lock (user_data=0x7f9c30023960, lock_how=AO2_LOCK_REQ_MUTEX, file=0x55b8a5ad4bc0 "bridge_channel.c", func=0x55b8a5ad5d30 <__PRETTY_FUNCTION__.17261> "ast_bridge_channel_queue_frame", line=1083, var=0x55b8a5ad4bd1 "bridge_channel") at astobj2.c:241
#7 0x000055b8a58e5386 in _ast_bridge_channel_lock (bridge_channel=0x7f9c30023960, file=0x55b8a5ad4bc0 "bridge_channel.c", function=0x55b8a5ad5d30 <__PRETTY_FUNCTION__.17261> "ast_bridge_channel_queue_frame", line=1083, var=0x55b8a5ad4bd1 "bridge_channel") at /usr/src/asterisk-18.4.0/include/asterisk/bridge_channel.h:234
#8 0x000055b8a58e7c0f in ast_bridge_channel_queue_frame (bridge_channel=0x7f9c30023960, fr=0x7f9c1c031330) at bridge_channel.c:1083
#9 0x000055b8a58e7f51 in ast_bridge_queue_everyone_else (bridge=0x7f9c300133f0, bridge_channel=0x0, frame=0x7f9c1c031330) at bridge_channel.c:1136
#10 0x00007f9c507e92ad in softmix_bridge_write_control (bridge=0x7f9c300133f0, bridge_channel=0x7f9c3003ffc0, frame=0x7f9c1c031330) at bridge_softmix.c:1257
#11 0x00007f9c507e94d6 in softmix_bridge_write (bridge=0x7f9c300133f0, bridge_channel=0x7f9c3003ffc0, frame=0x7f9c1c031330) at bridge_softmix.c:1354
#12 0x000055b8a58e6f39 in bridge_channel_write_frame (bridge_channel=0x7f9c3003ffc0, frame=0x7f9c1c031330) at bridge_channel.c:711
#13 0x000055b8a58ec409 in bridge_handle_trip (bridge_channel=0x7f9c3003ffc0) at bridge_channel.c:2706
#14 0x000055b8a58ec83a in bridge_channel_wait (bridge_channel=0x7f9c3003ffc0) at bridge_channel.c:2842
#15 0x000055b8a58ed03e in bridge_channel_internal_join (bridge_channel=0x7f9c3003ffc0) at bridge_channel.c:2993
#16 0x000055b8a58cfb17 in ast_bridge_join (bridge=0x7f9c300133f0, chan=0x7f9c1c022940, swap=0x0, features=0x7f9c5850e870, tech_args=0x7f9c5850e898, flags=(unknown: 0)) at bridge.c:17250x7f9c30023960
#17 0x00007f9be9f3c41d in confbridge_exec (chan=0x7f9c1c022940, data=0x7f9c5850ea50 "coincall1623370594,silentbridge,incoglistener") at app_confbridge.c:2758
#18 0x000055b8a59b0f43 in pbx_exec (c=0x7f9c1c022940, app=0x55b8a9737740, data=0x7f9c5850ea50 "coincall1623370594,silentbridge,incoglistener") at pbx_app.c:492
#19 0x000055b8a599c006 in pbx_extension_helper (c=0x7f9c1c022940, con=0x0, context=0x7f9c1c023300 "caller2", exten=0x7f9c1c023350 "s", priority=2, label=0x0, callerid=0x0, action=E_SPAWN, found=0x7f9c58511cec, combined_find_spawn=1) at pbx.c:2947
#20 0x000055b8a59a00d4 in ast_spawn_extension (c=0x7f9c1c022940, context=0x7f9c1c023300 "caller2", exten=0x7f9c1c023350 "s", priority=2, callerid=0x0, found=0x7f9c58511cec, combined_find_spawn=1) at pbx.c:4206
#21 0x000055b8a59a0df5 in __ast_pbx_run (c=0x7f9c1c022940, args=0x0) at pbx.c:4380
#22 0x000055b8a59a266e in pbx_thread (data=0x7f9c1c022940) at pbx.c:4704
#23 0x000055b8a5a414e4 in dummy_start (data=0x7f9c1c0148e0) at utils.c:1299
#24 0x00007f9c55f4d4a4 in start_thread (arg=0x7f9c58512700) at pthread_create.c:456
#25 0x00007f9c54b57d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
{noformat}
_bridge_channel_write_frame_ locks the bridge in this stack trace, and then this thread gets blocked waiting {{user_data=0x7f9c30023960}}, which turns up also in Thread 94:
{noformat}
Thread 94 (Thread 0x7f9bdc19d700 (LWP 21695)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f9c55f4fc26 in __GI___pthread_mutex_lock (mutex=0x7f9c5862b970 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007f9c54b8dd7f in __GI___dl_iterate_phdr (callback=0x7f9c54868da0, data=0x7f9bdc19c360) at dl-iteratephdr.c:41
#3 0x00007f9c5486a11e in _Unwind_Find_FDE () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#4 0x00007f9c54866b13 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#5 0x00007f9c54867d30 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#6 0x00007f9c548685e8 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7 0x00007f9c54b64e7f in __GI___backtrace (array=<optimized out>, size=<optimized out>) at ../sysdeps/x86_64/backtrace.c:110
#8 0x000055b8a58ca0da in __ast_bt_get_addresses (bt=0x7f9bdc19c930) at backtrace.c:94
#9 0x000055b8a597b1f5 in __ast_pthread_mutex_lock (filename=0x55b8a5ad8171 "channel.c", lineno=3087, func=0x55b8a5adc790 <__PRETTY_FUNCTION__.19719> "ast_waitfor_nandfds", mutex_name=0x55b8a5ad8dbd "c[x]", t=0x7f9c30025c30) at lock.c:270
#10 0x000055b8a58bd0ee in __ao2_lock (user_data=0x7f9c30025c80, lock_how=AO2_LOCK_REQ_MUTEX, file=0x55b8a5ad8171 "channel.c", func=0x55b8a5adc790 <__PRETTY_FUNCTION__.19719> "ast_waitfor_nandfds", line=3087, var=0x55b8a5ad8dbd "c[x]") at astobj2.c:241
#11 0x000055b8a5900f3d in ast_waitfor_nandfds (c=0x7f9c30023998, n=1, fds=0x7f9c30023a30, nfds=1, exception=0x0, outfd=0x7f9bdc19cd00, ms=0x7f9bdc19cd04) at channel.c:3087
#12 0x000055b8a58ec71f in bridge_channel_wait (bridge_channel=0x7f9c30023960) at bridge_channel.c:2827
#13 0x000055b8a58ed03e in bridge_channel_internal_join (bridge_channel=0x7f9c30023960) at bridge_channel.c:2993
#14 0x000055b8a58cfe0f in bridge_channel_ind_thread (data=0x7f9c30023960) at bridge.c:1797
#15 0x000055b8a5a414e4 in dummy_start (data=0x7f9c30023ac0) at utils.c:1299
#16 0x00007f9c55f4d4a4 in start_thread (arg=0x7f9bdc19d700) at pthread_create.c:456
#17 0x00007f9c54b57d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
{noformat}
Here though that _bridge_channel_ is not itself blocked but instead this is blocked by {{user_data=0x7f9c30025c80}}, which is not found elsewhere in the backtrace.

By the way if you search on the locked bridge too it turns up elsewhere but only on thread is actually being blocked on it and from it's stack trace it appears to not lock any effected objects prior to it being blocked:
{noformat}
Thread 92 (Thread 0x7f9bdc297700 (LWP 21693)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f9c55f4fc26 in __GI___pthread_mutex_lock (mutex=0x7f9c300133a0) at ../nptl/pthread_mutex_lock.c:115
#2 0x000055b8a597b2d1 in __ast_pthread_mutex_lock (filename=0x7f9c507f0989 "bridge_softmix.c", lineno=1968, func=0x7f9c507f1ac0 <__PRETTY_FUNCTION__.17530> "softmix_mixing_loop", mutex_name=0x7f9c507f10cc "bridge", t=0x7f9c300133a0) at lock.c:326
#3 0x000055b8a58bd0ee in __ao2_lock (user_data=0x7f9c300133f0, lock_how=AO2_LOCK_REQ_MUTEX, file=0x7f9c507f0989 "bridge_softmix.c", func=0x7f9c507f1ac0 <__PRETTY_FUNCTION__.17530> "softmix_mixing_loop", line=1968, var=0x7f9c507f10cc "bridge") at astobj2.c:241
#4 0x00007f9c507e5d62 in _ast_bridge_lock (bridge=0x7f9c300133f0, file=0x7f9c507f0989 "bridge_softmix.c", function=0x7f9c507f1ac0 <__PRETTY_FUNCTION__.17530> "softmix_mixing_loop", line=1968, var=0x7f9c507f10cc "bridge") at /usr/src/asterisk-18.4.0/include/asterisk/bridge.h:483
#5 0x00007f9c507eabca in softmix_mixing_loop (bridge=0x7f9c300133f0) at bridge_softmix.c:1968
#6 0x00007f9c507eaf5c in softmix_mixing_thread (data=0x7f9c30009e90) at bridge_softmix.c:2047
#7 0x000055b8a5a414e4 in dummy_start (data=0x7f9c3000f820) at utils.c:1299
#8 0x00007f9c55f4d4a4 in start_thread (arg=0x7f9bdc297700) at pthread_create.c:456
#9 0x00007f9c54b57d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
{noformat}
Even though it's not entirely obvious whether there is a deadlock one may still exist. However, given the latest backtrace my guess is either one of the channels, or bridge channels is being locked somewhere (just not seeing it yet), or another possibility is a thread locked one of those objects and then never unlocked it but that thread ended thus it does not show up in the backtrace.

In the latter case you'd be looking for a "lock" someplace, with no associated "unlock" (maybe an off nominal path). Not saying that's what's happening here, but something to keep in mind.

This probably wasn't all that helpful, but maybe someone will see something I missed.

By: N A (InterLinked) 2021-06-11 13:52:03.896-0500

Thanks, Kevin - that's very helpful actually in terms of thinking about how to interpret these.

Are you saying basically that if there's a deadlock proper, then grep "in __ao2_lock" would show 2 user_data's that are identical? I did another one just now and in my latest all 9 are unique. However, most of the files are the same value. I'm not sure if *that* there means anything or not.
By: Kevin Harwell (kharwell) 2021-06-14 11:02:14.219-0500

In a typical deadlock you can usually find at least two threads being blocked, each waiting on the other.

What I usually do is take the address of the object attempting to be locked, and search elsewhere for it in the backtrace file. If you're lucky you find that same object being used in another thread that's blocked. If so, then I look at the stack trace for that thread, and the actual code and check to see if that object has indeed been locked someplace.

If the object address search does not turn up elsewhere then that probably means the object was locked in another thread (showing as blocked in the file), but locally in a function (meaning the object was not passed up/down the stack thus doesn't appear in the stack trace). Similar to the above though you can just look through the stack traces (most are usually the same), and the code trying to find where the object was locked.

Alternatively what you could be looking for if the above doesn't turn anything up is a thread that executed, locked the object, and then ended without unlocking it. This offending thread would of course not be in the backtrace, and the object it locked would be forever locked. Anything else attemping to access/lock the object would be blocked.
By: N A (InterLinked) 2021-06-15 08:17:05.529-0500

Some more insight on this situation:

I removed the Local dial for Caller B and this still happens. (However, removing one of initial ConfBridges makes the issue go away).

Likewise, the issue also does not begin until Record() starts recording in the other listen-only channel.

I've modified the sample extensions.conf to reflect these things.

It seems that this issue is not related to ConfBridge + Local channels, but rather ConfBridge + Record(), at least in this specific case.
By: Mark Murawski (kobaz) 2021-06-15 09:43:53.115-0500

In Reply to:
It seems that this issue is not related to ConfBridge + Local channels, but rather ConfBridge + Record(), at least in this specific case.

Do you have a stable reproduction of your issue with ConfBridge + MixMonitor as well?

By: N A (InterLinked) 2021-06-15 09:59:05.669-0500

Just directly swapping Record out with MixMonitor does not trigger queue warnings but does spam the following similarly if debug is on:

[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:336 audiohook_read_frame_both: Failed to get 160 samples from write factory 0x7f34f4026ea8
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:275 audiohook_read_frame_both: Read factory 0x7f34f4026468 and write factory 0x7f34f4026ea8 both fail to provide 160 samples
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:336 audiohook_read_frame_both: Failed to get 160 samples from write factory 0x7f34f4026ea8
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:275 audiohook_read_frame_both: Read factory 0x7f34f4026468 and write factory 0x7f34f4026ea8 both fail to provide 160 samples
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:336 audiohook_read_frame_both: Failed to get 160 samples from write factory 0x7f34f4026ea8
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:275 audiohook_read_frame_both: Read factory 0x7f34f4026468 and write factory 0x7f34f4026ea8 both fail to provide 160 samples
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:336 audiohook_read_frame_both: Failed to get 160 samples from write factory 0x7f34f4026ea8
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:275 audiohook_read_frame_both: Read factory 0x7f34f4026468 and write factory 0x7f34f4026ea8 both fail to provide 160 samples
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:336 audiohook_read_frame_both: Failed to get 160 samples from write factory 0x7f34f4026ea8
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:275 audiohook_read_frame_both: Read factory 0x7f34f4026468 and write factory 0x7f34f4026ea8 both fail to provide 160 samples
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:336 audiohook_read_frame_both: Failed to get 160 samples from write factory 0x7f34f4026ea8
[Jun 15 14:50:01] DEBUG[7549][C-00000033]: audiohook.c:275 audiohook_read_frame_both: Read factory 0x7f34f4026468 and write factory 0x7f34f4026ea8 both fail to provide 160 samples

This spams the console whenever MixMonitor is active.

MixMonitor() isn't an optimal replacement for Record() in this use case, but this suggests that there may be some issues with Record() that cause things to go wrong in conjunction with ConfBridge.
By: Mark Murawski (kobaz) 2021-06-15 11:25:58.270-0500

Despite the spam... Are you locking up, or experiencing any other negative side effects... ie: is the resulting audio blank?

By: Mark Murawski (kobaz) 2021-06-15 11:27:41.346-0500

Also.. are you able to reproduce this on Master as well as the latest 16 branch?

If it's not reproducible there, then at least there's some bisection possibilities to find code changes that may have introduced a regression/bug.

By: N A (InterLinked) 2021-06-15 11:36:07.344-0500

With MixMonitor, the spam is there but the call and the recordings seem to be fine, so not sure why the spam is there. So there seems to be a problem with Record, or at least that affects Record but not MixMonitor.

I tested this on 18.4, is there a specific version you want me to try, like 16.17?

From my experience, this kind of thing has actually occured less and less over time. Lots of "Exceptionally long queuing..." on 13, less on 16, and even less on 18. So I think they're being weeded out over time, but this one hasn't been obviously.
By: Mark Murawski (kobaz) 2021-06-15 11:42:55.524-0500

Yes, latest 16 branch. 16.18 and also Master

By: N A (InterLinked) 2021-06-15 12:54:21.285-0500

What I described with Record() and MixMonitor() are the same on 16.18.
Ditto for master.
I don't personally think this specifically is a regression as IIRC this kind of thing happened quite a bit on 13 as well, this may be something that's always been there.
By: N A (InterLinked) 2021-06-24 16:23:14.707-0500

After several more crashes, the pattern I see emerging is Record() being in the picture consistently.

Asterisk has crashed 10-20 times in the past couple weeks in different scenarios, and while some have ConfBridge, all seem to involve Record() in some capacity.

Unfortunately I was not around for the last crash, but these are the kind of messages spammed in the log:
taskprocessor.c: The 'iax2_transmit' task processor queue reached 500 scheduled tasks again.
channel.c: Exceptionally long queue length queuing to Local/0116245674665125@carrierinject-000008a1;1
func_shell.c: Failed to execute shell command '....

Unlike the backtraces I posted earlier, this particular deadlock is always fatal. It took 2 minutes, 46 seconds from when this started to Asterisk crashing. The logger also broke halfway through but resumed a little bit later.
This one also involved ConfBridge and Record(), and we confirmed this does not happen with MixMonitor(). So I'm quite positive now that this is something special to using ConfBridge with Record, which creates a very unstable, locked situation. Will try to see if I can play with this further.
By: N A (InterLinked) 2021-06-25 15:07:15.710-0500

Here is a core dump from a scenario where Asterisk will crash fatally, within about 2 minutes. I did this about 10 seconds in so it wasn't crashing yet.
By: N A (InterLinked) 2021-06-25 16:02:44.602-0500

Here is a core dump from another server, where I set up a simple test.

This is even stranger. No errors whatsoever (no "Exceptionally long queue"), but not only does CPU go to 110% and Asterisk crash within about 20 seconds, the entire server crashes. Sometimes the server crashes in the middle of a core dump, so I had to do this a few times and kill Asterisk so I could get the core dump before the server crashed, since /tmp is cleared upon boot.
By: N A (InterLinked) 2021-06-25 16:21:09.829-0500

And here are two more core dumps, each of these progressively simpler than the last ones.

A precursory look seems to show deadlocks but not of the ao2_lock type.

It doesn't seem to take a lot of bridging to make the server run away with ConfBridge/Record. Just a few channels is enough to make it go to oblivion.