[Home]

Summary:ASTERISK-17346: [patch] MIXMON_ARGS not processed when call being monitored via chanspy
Reporter:Jeremy Kister (jkister)Labels:
Date Opened:2011-02-03 11:02:51.000-0600Date Closed:2011-03-17 14:05:44
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_mixmonitor
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) console-extenspy.txt
( 1) console-regular.txt
( 2) extensions.conf
( 3) sip.conf
Description:x146 is talking to an outside line
x143 uses extenspy to listen on x146

x146 hangs up the call to outside line (while x143 is still monitoring)

${MIXMON_ARGS} is not honored.
Comments:By: Jeremy Kister (jkister) 2011-02-04 00:08:18.000-0600

to be clear, it's the <command> section of MixMonitor(<file>.<ext>[,<options>,<command>]) that is not ran when an extenspy is active.  I just happened to use the variable ${MIXMON_ARGS}.

By: John Covert (jcovert) 2011-02-04 20:56:06.000-0600

This is a deadlock of some sort

When there is an ExtenSpy on top of a recording (something that is VERY important for my client to allow) we get stuck:

       /* Datastore cleanup. close the filestream and wait for ds destruction */
       ast_mutex_lock(&mixmonitor->mixmonitor_ds->lock);
       mixmonitor_ds_close_fs(mixmonitor->mixmonitor_ds);
       if (!mixmonitor->mixmonitor_ds->destruction_ok) {
               ast_cond_wait(&mixmonitor->mixmonitor_ds->destruction_condition, &mixmonitor->mixmonitor_ds->lock);
       }
       ast_mutex_unlock(&mixmonitor->mixmonitor_ds->lock);

destruction_ok is false; we go into ast_cond_wait and never return

This would also explain one of my client's memory leaks on their main production system.

The thread and all of its storage remain active forever.

By: Leif Madsen (lmadsen) 2011-02-07 12:06:26.000-0600

Awesome thanks for some extra information John!

The next step is likely going to be getting a 'core show locks' and backtrace from the running process once you've reproduced the issue. Thanks!

By: John Covert (jcovert) 2011-02-07 14:08:50.000-0600

It's actually not a "lock" deadlock.

It's trivial to reproduce:
exten => 12649901,1,Goto(ringlocals,s,1)

exten => 12649902,1,MixMonitor(jrctest,,/usr/bin/say MixMonitor done)
exten => 12649902,n,Goto(12649901,1)

exten => 12649903,1,Answer()
exten => 12649903,n,ExtenSpy(s@ringlocals)

[ringlocals]
exten => s,1,Dial(SIP/localsjphone,120)
exten => s,2,Macro(fastbusy)

Example with ExtenSpy **NOT** active

   -- Executing [12649902@inbound-cnet-264:1] MixMonitor("IAX2/mainpbx-8025", "jrctest,,/usr/bin/say MixMonitor done") in new stack
 == Autochan JRC Created autochan 0x4c5ebd0 to hold channel IAX2/mainpbx-8025 (0x18d3110)
   -- Executing [12649902@inbound-cnet-264:2] Goto("IAX2/mainpbx-8025", "12649901,1") in new stack
   -- Goto (inbound-cnet-264,12649901,1)
   -- Executing [12649901@inbound-cnet-264:1] Goto("IAX2/mainpbx-8025", "ringlocals,s,1") in new stack
   -- Goto (ringlocals,s,1)
   -- Executing [s@ringlocals:1] Dial("IAX2/mainpbx-8025", "SIP/localsjphone,120") in new stack

 == Begin MixMonitor Recording IAX2/mainpbx-8025
   -- Called localsjphone
   -- SIP/localsjphone-00000000 is ringing
   -- SIP/localsjphone-00000000 is ringing
   -- SIP/localsjphone-00000000 is ringing
   -- SIP/localsjphone-00000000 answered IAX2/mainpbx-8025
 == Spawn extension (ringlocals, s, 1) exited non-zero on 'IAX2/mainpbx-8025'
 == MixMonitor jrc debug point before call to audiohook_unlock 18d4000
 == MixMonitor jrc debug point after call to audiohook_unlock
 == Autochan jrc Removed autochan 0x4c5ebd0 from the list, about to free it
 == MixMonitor jrc debug point after call to autochan_destroy
 == MixMonitor jrc debug before call to ds_close_fs 18d4000
 == MixMonitor close filestream
 == (Partial Workaround-moved up) Executing [/usr/bin/say MixMonitor done]
   -- Hungup 'IAX2/mainpbx-8025'
 == MixMonitor jrc debug point mixmonitor_ds_destroy entry
 == MixMonitor jrc debug point before cond_wait (!destruction_ok)
 == MixMonitor jrc debug point mixmonitor_ds_destroy exit
 == MixMonitor jrc debug point after cond_wait
 == MixMonitor jrc debug point before kill the audiohook
 == MixMonitor jrc debug point after kill the audiohook
 == End MixMonitor Recording IAX2/mainpbx-8025
 == MixMonitor jrc debug point free 18d4000

***

Now, with ExtenSpy running:

   -- Executing [12649902@inbound-cnet-264:1] MixMonitor("IAX2/mainpbx-485", "jrctest,,/usr/bin/say MixMonitor done") in new stack
 == Autochan JRC Created autochan 0x4c82db0 to hold channel IAX2/mainpbx-485 (0x18e5310)
   -- Executing [12649902@inbound-cnet-264:2] Goto("IAX2/mainpbx-485", "12649901,1") in new stack
   -- Goto (inbound-cnet-264,12649901,1)
   -- Executing [12649901@inbound-cnet-264:1] Goto("IAX2/mainpbx-485", "ringlocals,s,1") in new stack
   -- Goto (ringlocals,s,1)
   -- Executing [s@ringlocals:1] Dial("IAX2/mainpbx-485", "SIP/localsjphone&SIP/x26&SIP/x27,120") in new stack
   -- Called localsjphone
[Feb  7 14:54:01] WARNING[23331]: app_dial.c:2039 dial_exec_full: Unable to create channel of type 'SIP' (cause 20 - Unknown)
[Feb  7 14:54:01] WARNING[23331]: app_dial.c:2039 dial_exec_full: Unable to create channel of type 'SIP' (cause 20 - Unknown)
 == Begin MixMonitor Recording IAX2/mainpbx-485
 == Autochan JRC Created autochan 0x4c83fd0 to hold channel IAX2/mainpbx-485 (0x18e5310)
   -- SIP/localsjphone-00000001 is ringing
   -- <IAX2/mainpbx-643> Playing 'spy-iax2.ulaw' (language 'en')
 == Spying on channel IAX2/mainpbx-485
[Feb  7 14:54:02] NOTICE[23331]: app_chanspy.c:479 start_spying: Attaching IAX2/mainpbx-643 to IAX2/mainpbx-485
[Feb  7 14:54:02] NOTICE[23331]: app_chanspy.c:479 start_spying: Attaching IAX2/mainpbx-643 to IAX2/mainpbx-485
 == Spawn extension (ringlocals, s, 1) exited non-zero on 'IAX2/mainpbx-485'
   -- Hungup 'IAX2/mainpbx-485'
 == MixMonitor jrc debug point before call to audiohook_unlock 18eb000
 == MixMonitor jrc debug point after call to audiohook_unlock
 == Autochan jrc Removed autochan 0x4c82db0 from the list, about to free it
 == MixMonitor jrc debug point after call to autochan_destroy
 == MixMonitor jrc debug before call to ds_close_fs 18eb000
 == MixMonitor close filestream
 == (Partial Workaround-moved up) Executing [/usr/bin/say MixMonitor done]
 == Done Spying on channel IAX2/mainpbx-485
 == Autochan jrc Removed autochan 0x4c83fd0 from the list, about to free it
   -- <IAX2/mainpbx-643> Playing 'beep.ulaw' (language 'en')
 == MixMonitor jrc debug point before cond_wait (!destruction_ok)

Note that we are stuck in cond_wait because the destruction condition is never met, not because a lock is stuck.  Note also that each time we called mixmonitor with ExtenSpy active, we got a new entry for mixmonitor_thread.

jrcovert*CLI> core show threads
0x181b800 mixmonitor_thread    started at [  476] app_mixmonitor.c launch_monitor_thread()
0x18f8000 mixmonitor_thread    started at [  476] app_mixmonitor.c launch_monitor_thread()
0x1867800 mixmonitor_thread    started at [  476] app_mixmonitor.c launch_monitor_thread()

After we drop ExtenSpy, If we call MiXMonitor we get one more:

0x18e4e00 mixmonitor_thread    started at [  476] app_mixmonitor.c launch_monitor_thread()

But it is removed when this call hangs up (no dead one left behind).

By: John Covert (jcovert) 2011-02-16 17:02:58.000-0600

My customer is getting really annoyed by this one (it affects their operation) and hopes that a developer is working on it.

There's another failure associated with being stuck, even though I've implemented a work-around to make the command execute before the hang.

My customer's clients used to be able to monitor an agent with ExtenSpy, and continue listening from call to call.

Now the call doing the Extenspy remains hung listening to the dead spied upon call even after it hangs up, and doesn't automatically switch to the agent's next call.

Works fine as long as the agent's call isn't also being recorded.

/john

By: Jeremy Kister (jkister) 2011-02-16 17:12:42.000-0600

One hack for the MixMon argument may be to set up a System(lame [...]) command in the h extension..

By: John Covert (jcovert) 2011-02-16 18:16:05.000-0600

I've already dealt with that part of the problem with the following patch

app_mixmonitor.c.not-a-real-fix.patch

By: Jonathan Rose (jrose) 2011-03-10 13:44:31.000-0600

Alrighty, I'm having a look at this.

I've tested it with a couple of different options on both Dial and MixMonitor.  It seems that this bug only really happens though when Dial is used on a SIP channel (granted, the only other type of channel I've tested it with is DAHDI, so that isn't the best sample size).  In any event, in order for execution to occur normally in the meantime, you can use the following:

When you use Dial, do something like this:
Dial(<channel>,<timeout>,g)

The g option guarantees that once the channel is hung up on the receiver's side that it will continue to execute dial plan commands from where it left off in the extension.

Then, when you use MixMonitor(whateverfile.extension,<whateveroptions>, <whatever command>) make sure you use StopMixMonitor before you reach the end of the extension.

The end result is that the mix monitor will destruct properly and then the command will run.

So yeah, the behavior as it is right now isn't ideal, but you can use that to at least be able to spy normally without having to stop the spy channel each time to use your commands.  It's good practice to always use StopMixMonitor() with MixMonitor anyway.  We still want to have a better look at it though since the behavior isn't consistent between different channel types and that's less than desirable.

By: Jonathan Rose (jrose) 2011-03-11 12:26:29.000-0600

Good news.  I have patched what appears to have been the issue, at least on my end.  If any of you who have been having the problem would be so kind as to test patch.diff to see if they problems you have are still present, I'd really appreciate the feedback.

~Jon

By: Jeremy Kister (jkister) 2011-03-11 12:36:38.000-0600

I'm running 1.8.2.3, what version should your patch apply to ?

it does apply with 1.8.2.3 with fuzz.

By: John Covert (jcovert) 2011-03-11 12:55:13.000-0600

jrose, could you please explain the theory behind your fix, so that I can try to explain it to my customer before deploying the patch?

/john

By: Jeremy Kister (jkister) 2011-03-13 12:57:52

this does seem to have fixed the mixmon command on 1.8.2.3.

i did not test keeping extenspy on during multiple calls.

By: Jonathan Rose (jrose) 2011-03-14 11:22:38

It applies to trunk, but I suppose it should technically apply to any 1.8 version.

To try and explain it, what was happening before was that applications which hooked onto a channel wouldn't be appropriately destroyed until their channel was closed.  Unfortunately, when you had something like extenspy going, the channels are kept from being destroyed since they are 'in use' but not actually active so to speak.  Execution arguments typically occur during destruction events of the application thread, and the destruction signal wasn't being received on SIP channels since they weren't being closed.

Fixing it was actually very simple (which you can probably tell from the length of that patch).  Don't destroy the applications as part of the destruction process, destroy them as part of the hangup process.  While the channels aren't being destroyed during spying, the call itself is still being hung up, so there is no reason not to go ahead and destroy the accompanying applications/functions/etc bound to that call.  

I'm sorry if my explanation is a little sketchy.  I'm actually rather new to the team right now.

By: Digium Subversion (svnbot) 2011-03-14 16:17:15

Repository: asterisk
Revision: 310726

U   branches/1.8/main/channel.c

------------------------------------------------------------------------
r310726 | jrose | 2011-03-14 16:17:14 -0500 (Mon, 14 Mar 2011) | 13 lines

Moves data store destruction from channel destruction to hangup in channel.c

This moves the data store destruction and app signaling events for a call to ast_hangup so that threads which wait for data store destruction
don't become stuck forever when attached to an application/function/etc that keeps the channel open.

(closes issue ASTERISK-17346)
Reported by: jkister
Patches:
     patch.diff uploaded by jrose (license 1225)
Tested by: jkister, jcovert, jrose

Review: https://reviewboard.asterisk.org/r/1136/

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=310726

By: Jonathan Rose (jrose) 2011-03-15 08:07:30

Ignore that SVN commit message.  This has been undone for further investigation.



By: Jonathan Rose (jrose) 2011-03-17 14:01:50

Ok, this bug is now fully resolved.  I'm committing it as we speak to 1.8 and trunk.

By: Digium Subversion (svnbot) 2011-03-17 14:03:36

Repository: asterisk
Revision: 311197

U   branches/1.8/apps/app_chanspy.c

------------------------------------------------------------------------
r311197 | jrose | 2011-03-17 14:03:35 -0500 (Thu, 17 Mar 2011) | 11 lines

This fixes a nasty chanspy bug which was causing a channel leak every time a spied on channel made a call.

In addition to the above, it makes certain channel destruction occurs so that applications don't get stuck waiting for datastore destruction while monitored by chanspy.

(closes issue ASTERISK-17346)
Reported by: jkister
Tested by: jkister, jcovert, jrose

Review: http://reviewboard.digium.internal/r/106/


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=311197

By: Digium Subversion (svnbot) 2011-03-17 14:05:44

Repository: asterisk
Revision: 311198

_U  trunk/
U   trunk/apps/app_chanspy.c

------------------------------------------------------------------------
r311198 | jrose | 2011-03-17 14:05:43 -0500 (Thu, 17 Mar 2011) | 17 lines

Merged revisions 311197 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.8

........
 r311197 | jrose | 2011-03-17 14:03:34 -0500 (Thu, 17 Mar 2011) | 11 lines
 
 This fixes a nasty chanspy bug which was causing a channel leak every time a spied on channel made a call.
 
 In addition to the above, it makes certain channel destruction occurs so that applications don't get stuck waiting for datastore destruction while monitored by chanspy.
 
 (closes issue ASTERISK-17346)
 Reported by: jkister
 Tested by: jkister, jcovert, jrose
 
 Review: http://reviewboard.digium.internal/r/106/
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=311198