[Home]

Summary:ASTERISK-26632: core: Possibility of a frame "imbalance" leading to stuck channels.
Reporter:Mark Michelson (mmichelson)Labels:
Date Opened:2016-11-30 12:57:36.000-0600Date Closed:2017-02-02 23:28:28.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Constant
Related
Issues:
Environment:Attachments:
Description:Recently, a commit to Asterisk introduced an API for frame deferral. This API was then subsequently implemented in several areas of the code where manual frame deferral had previously been performed. After merging these changes, the confbridge_recording test started consistently failing.

The reason for the failure is a bit strange. In order to understand it, we'll need some background both on the test as well as some of the workings in Asterisk.

The test is one of the prototype tests for what would eventually become the AppTest module in the testsuite. A local channel is used to control how the test operates. The ;2 channel is the "business end" of the channel and is in the application being tested. The ;1 channel is under the control of the test suite and can be redirected (via AMI ) to various extensions in the dialplan. The extensions allow for media to be played into the application under test, or play DTMF, or just wait.

In this test, the ;1 channel needed to be redirected a few times throughout the test. Looking at the logs, the first redirect out of the Wait() application worked fine. The second redirect out of Wait() ... did not. The ;1 channel ended up stuck in the Wait() despite the attempt to redirect it.

Now let's dive into what's going on in Asterisk during a Wait() and an attempted redirect.

The channel that is waiting starts by being told to defer frames. It then sits in a loop where it calls ast_waitfor(), followed by ast_read(). If ast_read() returns NULL, then the loop is broken, frame deferral completes, and the wait is complete.

The Redirect operation works by setting the AST_SOFTHANGUP_ASYNCGOTO flag on the channel, then queuing an AST_FRAME_NULL frame and sending SIGURG in order to wake the channel up from ast_waitfor() or any other blocking operation that may be occurring on the channel.

So what happens to the waiting channel when the redirect occurs? This is what happens:
# Since the channel is currently in ast_waitfor(), it wakes up due to the AST_FRAME_NULL being queued.
# The waiting channel then calls ast_read()
# ast_read(), seeing that the channel has a softhangup flag set, queues an AST_CONTROL_END_OF_Q frame on the channel.
# The ast_read() operation returns the AST_FRAME_NULL that the redirect operation had queued.
# The wait loop frees the AST_FRAME_NULL and starts its loop again.
# The ast_waitfor() breaks because of the AST_CONTROL_END_OF_Q frame that was queued.
# The waiting channel then calls ast_read()
# ast_read(), seeing that the channel has a softhangup flag set, queues an AST_CONTROL_END_OF_Q frame on the channel. However, since there already is an AST_CONTROL_END_OF_Q frame on the channel's read queue, no additional frame is actually queued.
# ast_read(), seeing that AST_CONTROL_END_OF_Q is the frame it is going to return, instead returns NULL
# The wait loop sees that it has a NULL return from ast_read(), breaks out of its loop, requeues any deferred frames, and returns.
# The PBX code clears the AST_SOFTHANGUP_ASYNCGOTO flag and directs the channel to its new destination in the dialplan.

This is the ideal situation. During the test, the initial AMI redirect worked exactly like above. However, the next AMI redirect didn't quite end up that way. As it turned out, the channel read queue already had a frame in it (specifically an AST_FRAME_VOICE frame). So at step 4, ast_read() returned the AST_FRAME_VOICE frame. At step 9, it returns the AST_FRAME_NULL that was queued by the redirect operation. After that point, the waiting channel becomes stuck forever in the ast_waitfor() operation. No further frames can be queued onto the channel since it already has an AST_CONTROL_END_OF_Q on it, and this seems to prevent the channel from waking up any more.

After performing some investigations, it was discovered that the frame deferral API had a difference from the manual frame deferrals that had been performed previously. The frame deferral API would place a single call to ast_queue_frame_head() with all deferred frames. The manual deferrals would loop through the deferred frames and call ast_queue_frame_head() on each individual frame. Restoring this behavior caused the test to pass.

However, the question is, why would queuing all deferred frames at once cause a channel to end up "stuck"? It would seem that there is some sort of imbalance that can occur where all frames that end up queued do not end up being read. The ast_queue_frame() function can take either a single frame or a list of frames. ast_read() always returns a single frame, though. In ast_read(), there is the following:

{code}
       /* if the channel driver returned more than one frame, stuff the excess
          into the readq for the next ast_read call
       */
       if (AST_LIST_NEXT(f, frame_list)) {
           ast_queue_frame(chan, AST_LIST_NEXT(f, frame_list));
           ast_frfree(AST_LIST_NEXT(f, frame_list));
           AST_LIST_NEXT(f, frame_list) = NULL;
       }
{code}

There are two suspicious parts to this, based on what I read:
# The call to ast_queue_frame() will end up doing nothing if the channel has an AST_CONTROL_END_OF_Q frame at the end of its read queue.
# Since f is a frame on the channel read queue, presumably performing the ast_frfree() is just going to end up freeing all the frames we just queued onto the read queue. (This one I'm not as sure about. I think I must be misreading code somewhere).

This is worth investigating in order to determine what the issue actually is here and what can be done to fix it. If you wish to reproduce the problem, then revert https://gerrit.asterisk.org/#/c/4530 and run the tests/apps/confbridge/confbridge_recording test from the testsuite.
Comments:By: Mark Michelson (mmichelson) 2016-11-30 13:29:10.699-0600

After looking a bit more, the quoted code section will never be run. The reason is that if a channel driver returns multiple frames, those all get put onto the channel's frame queue in sequence. When a proper frame is found to return, it alone is removed from the channel frame queue. Therefore, AST_LIST_NEXT(f, frame_list) will always be NULL, so there will never be any re-queuing occurring.

It appears the problem may be more basic than originally thought. If you queue multiple frames onto a channel, we only write to the channel alert pipe a single time (and can only set the channel timingfd to continuous mode once). This means that if we queue multiple frames at once, it appears that we will only actually end up reading one of those frames.

In the case of the failing test, since we likely were queuing two frames at once when frame deferral completed, it meant that reading of frames was out of sync by a single frame. And the redirect logic will only end up working properly if the channel read queue is empty.

By: Friendly Automation (friendly-automation) 2017-02-02 23:28:29.342-0600

Change 4858 merged by George Joseph:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4858|https://gerrit.asterisk.org/4858]

By: Friendly Automation (friendly-automation) 2017-02-02 23:55:04.656-0600

Change 4883 merged by zuul:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4883|https://gerrit.asterisk.org/4883]

By: Friendly Automation (friendly-automation) 2017-02-02 23:55:14.943-0600

Change 4877 merged by zuul:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4877|https://gerrit.asterisk.org/4877]

By: Friendly Automation (friendly-automation) 2017-02-03 05:33:24.226-0600

Change 4880 merged by Joshua Colp:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4880|https://gerrit.asterisk.org/4880]