[Home]

Summary:ASTERISK-01891: Mixmonitor stopped storing
Reporter:jalsot (jalsot)Labels:
Date Opened:2006-02-10 01:44:53.000-0600Date Closed:2006-04-11 16:36:50
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Applications/app_mixmonitor
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) patch.mixmonitor1
( 1) patch.mixmonitor2
Description:I noticed an that one of my conversations monitored by mixmonnitor got incomplete.
The case:
- 3 participiants in a Meetme room (SIP phones - Grandstream BT100)
- mixmonitor started from console for one of the participiants channel
 mixmonitor start SIP/xxx.net-08221f00 /tmp/conference.wav
- the call took about 40 minutes
- converence.wav had only 15 min 16 seconds, 14657644 bytes

There was no filesystem problem, no space problems.
I'm a bit worried what can be the problem, probably I should go back to old monitor.
Comments:By: jalsot (jalsot) 2006-02-11 12:56:04.000-0600

Another file found:
Real duration: 1495 seconds
recorded file duration: 517.3 seconds

This time the Mixmonitor application was called [not from console but from dialplan] and not PCM wav, but GSM encoded wav was used.

It seems, mixmonitor is useless...

By: BJ Weschke (bweschke) 2006-02-13 21:11:18.000-0600

any debug or trace information you can append to this issue for when it is happening? might help us to troubleshoot the problem or at least understand why app_mixmonitor is ending early.

By: jalsot (jalsot) 2006-02-16 11:47:20.000-0600

Unfortunately I wasn't able to artificially induce the problem. In logs I didn't find anything, no valuable details. I'll try further...

Maybe an info: on console I didn't see that mixmonitor ended, only when the call was hanged up.

By: vortex_0_o (vortex_0_o) 2006-02-17 04:24:41.000-0600

Now using show version files app_mixmonitor.c Revision 7740
Tried it quickly with about 10 simultaneous recordings storing in g729 format and the recording files were the correct size.

Last time there did not seem to be any pattern in it though and the error rate was low.

Can we please leave this bug open for the moment.

By: BJ Weschke (bweschke) 2006-02-17 07:43:12.000-0600

Sure - Recommend that we leave it open until sometime next week and if we still cannot reproduce, then we'll close it.

With all testing though - let's make certain we're using the latest version of /trunk as there have been some significant patches to the code that have corrected issues with MixMonitor.

By: jalsot (jalsot) 2006-02-17 07:47:29.000-0600

Unfortunatly this is a production box. No way using /trunk as it is unstable and quickly changing. What I can guarantee, tests on latest 1.2 branch.
Would it be possible to backport those changes to the stable 1.2 branch?

By: BJ Weschke (bweschke) 2006-02-17 07:54:04.000-0600

latest 1.2 branch should be fine too if we're talking bug fixes.

By: vortex_0_o (vortex_0_o) 2006-02-20 03:04:06.000-0600

I am also always using latest 1.2

Considering taking the risk again and trying mixmonitor in a live environment.

anyone forsee a problem with using both Monitor and Mixmonitor in the dialplan? (just to be on the safe side)
...
exten => _0[1-9].,3,SetCDRUserField(${CALLFILE}.abc)
exten => _0[1-9].,4,MixMonitor(${CALLFILE}.g729|bv(1)V(1))
exten => _0[1-9].,5,Monitor(gsm,${CALLFILE},bm)
...

By: TÓTH, Csaba (tcsaba) 2006-02-21 13:38:20.000-0600

I have got the same problem with asterisk 1.2.1 - 1.2.4.

On my box work 20-30 operator in the same time.
I use MixMonitor to store the conversations.
I tried MinMonitor with the 'b' and without the 'b' switch. The recording stopped lot's of times after 17-18 seconds.

I can't use the old Monitor application, because the two soundchannel (in and out) lost the synchron with each other.

This issue is very important for me, because my customers need the correct recorded files.

Thanks for your help...

By: jalsot (jalsot) 2006-02-22 07:25:20.000-0600

With latest branches/1.2 [r10577] I still having the problem. No any valuable information. It happened with IAX2->Zap [PRI] and also with SIP->SIP. Nothing appears in logs [this is a production box, I cannot set debug 9].
Unfortunately I still don't know how to generate this issue artificially.

No other issues seen on the box [Intel P4 3GHz, 1GB, 3Ware SATA RAID1, Ubuntu 5.04, linux-2.6.12.5]

By: Eldad Ran (eldadran) 2006-02-28 05:51:51.000-0600

I have the same problem, running debug got me this:
Spy 'MixMonitor' on channel 'SIP/mygateway-XYZW' write queue too long, dropping frames
I hope this will give you some direction.

By: reformed (reformed) 2006-03-01 15:56:03.000-0600

I just got reports from users that they recordings were stopped even though the call stil continued.

Found cut 12 recording in the space of 5 days.
All cut recording ware SIP->ZAP calls.

We didnt find any ZAp->ZAP recording that were cut...

By: kev (kev) 2006-03-02 00:08:24.000-0600

I also have this problem.  I seem to be able to induce it if I press lots of dtmf buttons during the call (using SIP to dial into asterisk).  I'll try to get some more information if I can.

By: kev (kev) 2006-03-02 00:40:57.000-0600

I haven't been able to get any debug info, but I do have a case that causes it.

It seems that it also requires changing contexts or extensions, as well as the pressing of dtmf buttons.  I put this in my default sip context:

exten => testtest,1,Wait,1
exten => testtest,n,Answer
exten => testtest,n,Set(CALLFILENAME=${EXTEN}-${TIMESTAMP})
exten => testtest,n,Monitor(wav,${CALLFILENAME})
exten => testtest,n,MixMonitor(${CALLFILENAME}-mix.wav)
exten => testtest,n,Wait,1
exten => testtest,n(loop),Playback(one-moment-please)
exten => testtest,n,Goto(test2,s,1)             ; problem observed
;exten => testtest,n,Goto(loop)                 ; no problem observed

[test2]

exten => s,1,Playback(one-moment-please)
exten => s,n,Goto(test2,s,1)                    ; problem observed
;exten => s,n,Goto(s,1)                         ; problem observed
;exten => s,n,Goto(1)                           ; no problem observed

If I repeatedly press buttons in my sip softphone (linphone) while this is looping (I hold down the key on the keyboard, and let the keyboard repeat rate flood asterisk with dtmf signals), then I can cause it in under a minute usually.  Just letting it sit there does not cause it, the buttons seem to me to be required.

I have lablled each of the Gotos according to whether I have observed the problem or not.  If the goto changes context or extension, then the problem can occur - but if not, then it's fine.

By: reformed (reformed) 2006-03-02 01:04:49.000-0600

It seems that it could be caused in the following statement:

while (1) {
if (!(f = ast_channel_spy_read_frame(&spy, SAMPLES_PER_FRAME)))
   break;


Under some conditions there will be no frames returned to MixMonitor.
Maybe it could be due to no frames beeing available to read.
It could be cause momentary loss of connectivity between with one of the SIP and since asterisk uses sync clocking it stops both streams long enough for MixMonitor to stop at ast_channel_spy_read_frame.

Any guru's willing to tell me that i'm talking b*****it? :) I hope i'm

Short of changing the taement to this at least for testing i'm not sure what could be done:
while (1) {
if (!(f = ast_channel_spy_read_frame(&spy, SAMPLES_PER_FRAME)))
 if (!(f = ast_channel_spy_read_frame(&spy, SAMPLES_PER_FRAME)))
   if (!(f = ast_channel_spy_read_frame(&spy, SAMPLES_PER_FRAME)))
      break;

If there are no frames to read MixMonitor will go though all the if's and finally finish with a break, but if there is something to read it would pick it up in later statements.

It would probably be better handled in for loop with sleep between retries :)

By: Tiszai (ptiszai) 2006-03-02 07:14:16.000-0600

May be problem is this?:
channel.c:
struct ast_frame *ast_channel_spy_read_frame(struct ast_channel_spy *spy, unsigned int samples):
if ((spy->read_queue.samples < samples) || (spy->write_queue.samples < samples))
return NULL;

By: Roberto Lopes (rflopes3) 2006-03-06 11:30:17.000-0600

Friends, the problem occurs on the following line:

ast_channel_spy_trigger_wait(&spy);

Sometime (i don't know why) the app stops on this line and NEVER leave.
This can be confirmed putting logs before and after the call for this function. When the app_mixmonitor stops recording, is bacause it's "waiting" some trigger that never comes again. Only when the call hangs up, it leaves this function.

Any suggestions ?

By: Roberto Lopes (rflopes3) 2006-03-06 11:36:56.000-0600

...and correctme if I'm wrong, this is not a minor but a major BUG because makes this app useless!!

By: Roberto Lopes (rflopes3) 2006-03-06 17:34:21.000-0600

Surelly its a dead lock.

By: Johnathan Corgan (jcorgan) 2006-03-10 11:52:07.000-0600



By: Tilghman Lesher (tilghman) 2006-03-11 08:35:59.000-0600

It's not a deadlock.  Basically, the problem is that frames stop getting queued to the spy for some reason and because of that, no signal is sent to wake up the mechanism which duplicates and hands frames off to the child spies.  What you're seeing with ast_channel_spy_trigger_wait() is a symptom of the underlying problem, not the problem itself.

By: Tabaré Salvagno (linuxfrik) 2006-03-11 14:40:40.000-0600

Recently I detect this same behaviour/problem. I agree with Corydon76 that the real problem is that once enter "ast_channel_spy_trigger_wait()" (which mormally waits till the spyer has some frames to process) it never leaves because no frames are ready to process therefore no condition is really fullfilled to leave the "wait" . I've not done any debug yet but I guess that probably something is happening at method "queue_frame_to_spies".

Some test that seems to lead to this problem were:

1. Mute on any of the both channels/ends.
2. Low microphone volume or some loudness that automatically turns the microphone volume down.
3. Silence on the channels, here we have some interesting behaviour:

 While SILENCE never leaves ast_channel_spy_trigger_wait (logic, no frames or any frames count condition) and most of the times when NO SILENCE the normal process continues (has frames and process them) BUT another times NOT. Note that probably there is a relation between this issue and the channels participants hardware or software; for instance the scenario for my TESTS was:

a. Using two Xten softphones.
b. Using propietary softphone based on Microsoft RTC and a Xten softphone.


Hope will be useful.

By: Tabaré Salvagno (linuxfrik) 2006-03-11 15:53:05.000-0600

Maybe the issue is here:

At channel.c ("queue_frame_to_spies") there is a condition "if (queue->samples > SPY_QUEUE_SAMPLE_LIMIT)..." and at certain moment it always enters here therefore never initiate the TRIGGER to continue processing frames,

switch (ast_test_flag(spy, CHANSPY_TRIGGER_MODE)) {
case CHANSPY_TRIGGER_READ:
if (dir == SPY_READ)
ast_cond_signal(&spy->trigger);
break;
case CHANSPY_TRIGGER_WRITE:
if (dir == SPY_WRITE)
ast_cond_signal(&spy->trigger);
break;
}

By: Tabaré Salvagno (linuxfrik) 2006-03-11 17:10:30.000-0600

Well, now I'm quite sure the problem is with this line "if (other_queue->samples == 0)" it seems that here triggers a flush, obviously at one moment it never occurs cause the samples queue has increased. I came up with a easy/silly test, commented the "if (other_queue->samples == 0)" and so far the issue never appeared. I must say that is not a SOLUTION at all but I hope will help asterisk folks to easily solve this issue.

By: Tabaré Salvagno (linuxfrik) 2006-03-12 00:28:14.000-0600

Last test, I noticed that the queue samples count never decreases, teorically this code was intented to do that but...

while (queue->samples > SPY_QUEUE_SAMPLE_LIMIT) {
                               struct ast_frame *drop = queue->head;

                               queue->samples -= drop->samples;
                               queue->head = drop->next;
                               ast_frfree(drop);
                       }

if queue->samples > SPY_QUEUE_SAMPLE_LIMIT condition to drop frames I wonder what's the point to drop and decrease samples count just till the LIMIT count? Doing " while (queue->samples > 0)..." seems to do the job.

By: Peng Yong (ppyy) 2006-03-12 10:55:11.000-0600

linuxfrik is right. i debug the code and i think i find the problem.

it will stop and not trigger in following condition:

spy->read_queue.samples > SPY_QUEUE_SAMPLE_LIMIT  && spy->write_queue.samples != 0

then * drop some frame from read_queue and do nothing. and looping....

it happens when the write_queue was mute for some time, then the read_queue is big than SPY_QUEUE_SAMPLE_LIMIT. it's bad if a write frame arive at this moment.

i upload two patchs. one remove the code which drop frames ,and always flush long queue. another patch drop frame from both read and write patch.

By: Peng Yong (ppyy) 2006-03-12 11:03:58.000-0600

the mixmonitor code was write by kpfleming. i don't know why the read frames is droped instead of flushing them when the read_queue is full. maybe some one can find a better way to solve the problem.

By: Edward Eastman (whisk) 2006-03-24 04:31:32.000-0600

We've been experiencing this issue too - I'm trying patch1 and will report on results.

By: Roberto Lopes (rflopes3) 2006-03-24 07:35:10.000-0600

I have tested the patch2 on a 200 agents call center and it's working fine for 15 days.

By: BJ Weschke (bweschke) 2006-03-24 08:26:21.000-0600

ppyy - were these patches intended to be used together or is one preferred vs. the other?

By: Peng Yong (ppyy) 2006-03-24 09:17:10.000-0600

bweschke, only one need to apply.

and i prefer to choose first one instead of second, the first one will not drop frame when queue full, and only flush.

By: Gary Richardson (garyrichardson) 2006-03-27 13:16:25.000-0600

I noticed the patch didn't make it into 1.2.6.

I thought I'd add that I've been using the first patch on 1.2.5 since March 13th and it's been working great..

By: BJ Weschke (bweschke) 2006-03-29 22:47:11.000-0600

checked with kpfleming. he wasn't real comfortable with the first patch approach because we're probably still cloaking the real issue and the end result is still going to be a file with distorted audio (albeit one that does go to the end of the call now) I think we still need to understand what the underlying conditions are that are causing the system to get into this state. Has anyone been able to gather any more info towards that objective?

By: Tabaré Salvagno (linuxfrik) 2006-03-30 00:35:12.000-0600

Well bweschke, please refer to all my notes to see if what I described makes any sense at all of what the source of this issue could be and how to be easily reproduced. Seems to me that the problem is related to when samples are consumed and when not , what to do when the samples queue is full, frames drop procedures but it's quite clear that the problem occures when the TRIGGER is never initiated because some condition (e.g. 'other_queue->samples == 0').



By: Kevin P. Fleming (kpfleming) 2006-04-11 16:36:43

This should be corrected in 1.2 and trunk now; the analysis in this bug seems to be mostly correct; the queues could into a state where one was full but the other was not empty, so a flush would not get triggered.