[Home]

Summary:ASTERISK-19493: ChanSpy onto a Local channel can leave a hung channel
Reporter:laszlovl (lvl)Labels:
Date Opened:2012-03-06 18:25:58.000-0600Date Closed:2012-03-16 15:23:19
Priority:MajorRegression?
Status:Closed/CompleteComponents:Applications/app_chanspy
Versions:10.5.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-19501 Channel group on a Local channel not released after it has been Chanspy'd upon
Environment:CentOS 6, 64bitAttachments:( 0) bt2.txt
( 1) bugASTERISK-19493.patch
( 2) chanspy_bt.txt
( 3) debug
Description:In certain situations, the channel running ChanSpy will get stuck and hang indefinitely.

The setup is simple: agents are receiving queue calls through a local channel, supervisors run chanspy on those local channels. I can reproduce the issue pretty constantly (though not every single time) with the following setup:

- Agent receives a queue call
- Supervisor starts spying
- Queue call is hung up
- Chanspy keeps running, waiting for a new call to spy upon
- Agent receives another queue call
- Chanspy attaches to this new call
- Queue call is hung up *by the caller*
-> Chanspy attaches itself to a leg that has already been hung up
-> The channel running ChanSpy gets stuck indefinitely

A debug log is as follows:

{quote}
[Mar  7 00:21:43] VERBOSE[27510] pbx.c:   == Spawn extension (callphone, s, 11) exited non-zero on 'Local/571@phones-2017;2'
[Mar  7 00:21:43] VERBOSE[27504] app_chanspy.c:   == Done Spying on channel Local/571@phones-2017;2
[Mar  7 00:21:47] VERBOSE[27504] app_chanspy.c:   == Spying on channel Local/571@phones-2017;1
[Mar  7 00:21:47] NOTICE[27504] app_chanspy.c: Attaching SIP/phone52-00000003 to Local/571@phones-2017;1
{quote}

In the first line of the debug log, you can see that the Local/571@phones-2017;2 channel is hungup, so it's a mystery to me how it can attach to Local/571@phones-2017;1 in the first place: the ;1 channel was hung up simultaneously with the ;2 channel.

At this point, ChanSpy is spying on a channel that doesn't actually exist anymore, and will gladly keep doing so indefinitely. As soon as you hang up the channel you'll see the following in the logs:

{quote}
[Mar  7 00:22:25] WARNING[27486] chan_sip.c: Autodestruct on dialog '44c1a9b536c14383932b5a461e89d227' with owner in place (Method: BYE)
[Mar  7 00:23:21] NOTICE[27486] chan_sip.c: Disconnecting call 'SIP/phone52-00000003' for lack of RTP activity in 62 seconds
{quote}

The channel is not actually hung up though and will remain in 'core show channels' indefinitely:
{quote}
Channel              Location             State   Application(Data)            
SIP/phone52-00000003 s@bargeinsub:4       Up      ChanSpy(Local/571@phones)
{quote}

'core show locks' reports no locks, will attach backtrace in a moment. I've tested with both r357815 and r357834 (see http://svnview.digium.com/svn/asterisk/trunk/apps/app_chanspy.c?view=log) patched in, but neither solves the issue. If more debug info is needed, please let me know!
Comments:By: laszlovl (lvl) 2012-03-06 18:26:43.591-0600

GDB backtrace

By: Michael L. Young (elguero) 2012-03-07 08:15:37.062-0600

I wonder if this is related to ASTERISK-19461?  There was a fix committed that you might want to try out and see if it resolves this.

By: laszlovl (lvl) 2012-03-07 08:21:23.418-0600

I actually already tried the patch for that issue (r357815), it did fix other ChanSpy related issues that we were suffering from, but I can still reproduce this specific scenario with that patch applied.

By: laszlovl (lvl) 2012-03-08 07:36:58.519-0600

I created a very simple environment in which the issue can be reproduced. Tested just now on a trunk checkout, r358576. A new observation is that the bug only appears to manifest when spying on a Local channel - spying directly on the SIP channel seems to work fine!

Create 3 sip users, 'caller', 'agent' and 'supervisor' and log onto them with a SIP phone.

Create extensions.conf:
{quote}
[incoming]
exten => caller,1,Queue(queue1)
exten => supervisor,1,ChanSpy(Local/s@phones,qs)

[phones]
exten => s,1,Dial(SIP/agent)
{quote}

Queues.conf:
{quote}
[queue1]
member => Local/s@phones/n
{quote}

Then:
- From the supervisor phone, dial 'supervisor' to start ChanSpy'ing
- From the caller phone, dial 'caller' to get in the queue
- Answer the queue call on the agent phone
- Hang up the queue call on the 'caller' phone

Around 33% of the times the bug will manifest:
{quote}
  -- Executing [supervisor@incoming:1] ChanSpy("SIP/supervisor-00000000", "Local/s@phones,qs") in new stack
   -- Called SIP/agent
   -- SIP/agent-00000002 is ringing
   -- Local/s@phones-792b;1 is ringing
 == Spying on channel Local/s@phones-792b;2
NOTICE[31078]: app_chanspy.c:487 start_spying: Attaching SIP/supervisor-00000000 to Local/s@phones-792b;2
   -- SIP/agent-00000002 answered Local/s@phones-792b;2
   -- Local/s@phones-792b;1 answered SIP/caller-00000001
 == Spawn extension (incoming, caller, 1) exited non-zero on 'SIP/caller-00000001'
 == Spawn extension (phones, s, 1) exited non-zero on 'Local/s@phones-792b;2'
 == Done Spying on channel Local/s@phones-792b;2
 == Spying on channel Local/s@phones-792b;1
NOTICE[31078]: app_chanspy.c:487 start_spying: Attaching SIP/supervisor-00000000 to Local/s@phones-792b;1

WARNING[31071]: chan_sip.c:3898 __sip_autodestruct: Autodestruct on dialog '339b60a20b774253bac4c5c044bc6f4c' with owner in place (Method: BYE)
NOTICE[31071]: chan_sip.c:26824 check_rtp_timeout: Disconnecting call 'SIP/supervisor-00000000' for lack of RTP activity in 62 seconds

Channel              Location             State   Application(Data)            
SIP/supervisor-00000 supervisor@incoming: Up      ChanSpy(Local/s@phones,qs)
{quote}

By: laszlovl (lvl) 2012-03-08 07:38:31.564-0600

Attaching a second - more complete - backtrace in bt2.txt

By: laszlovl (lvl) 2012-03-08 08:31:30.541-0600

Just like in ASTERISK-19501, the bug only seems to manifest when ChanSpy (in its initial attempt) decides to spy on the *;2* leg of the Local channel; when it starts out spying on the *;1* leg, everything is fine.

There must be some connection here, but I don't know where to look next.

By: laszlovl (lvl) 2012-03-08 09:25:00.748-0600

Add debug log, which reveals the following:

{quote}
[Mar  8 16:04:39] DEBUG[31362] autochan.c: Created autochan 0x143e758 to hold channel Local/s@phones-cd93;2 (0x1453ad0)
[Mar  8 16:04:39] DEBUG[31362] channel.c: Scheduling timer at (50 requested / 50 actual) timer ticks per second
[Mar  8 16:04:39] DEBUG[31362] audiohook.c: Failed to get 160 samples from read factory 0x7f86d17dfab8
[Mar  8 16:04:39] DEBUG[31362] audiohook.c: Failed to get 160 samples from write factory 0x7f86d17e05f8
{quote}

(the last two lines repeated alot of times)

By: Jonathan Rose (jrose) 2012-03-15 15:12:36.821-0500

I've been working on this for a decent chunk of time today...

Here's basically the bulk of what I've figured out so far:
This bug happens because one of the local channels fails to get destroyed. It appears to be gone if you just look at 'core show channels', but in reality it never reached its destructor.

I hacked up some stuff to back this up:

*CLI>   == Using SIP RTP CoS mark 5
   -- Executing [037@sipphones:1] Queue("SIP/123-0000000a", "markq") in new stack
   -- Started music on hold, class 'default', on SIP/123-0000000a
   -- Executing [s@phones:1] Dial("Local/s@phones-3ff9;2", "SIP/gold") in new stack
 == Using SIP RTP CoS mark 5
   -- Called SIP/gold
   -- Local/s@phones-3ff9;1 connected line has changed. Saving it until answer for SIP/123-0000000a
   -- spy on Local/s@phones-3ff9;2
 == Spying on channel Local/s@phones-3ff9;2
[2012-03-1515:09:24] NOTICE[28193]: app_chanspy.c:484 start_spying: Attaching SIP/101-00000005 to Local/s@phones-3ff9;2
   -- SIP/gold-0000000b is ringing
   -- Local/s@phones-3ff9;1 is ringing
   -- Local/s@phones-3ff9;1 connected line has changed. Saving it until answer for SIP/123-0000000a
   -- SIP/gold-0000000b answered Local/s@phones-3ff9;2
   -- Local/s@phones-3ff9;1 answered SIP/123-0000000a
   -- Stopped music on hold on SIP/123-0000000a
 == Spawn extension (sipphones, 037, 1) exited non-zero on 'SIP/123-0000000a'
   -- channel destroyed: SIP/123-0000000a
   -- channel destroyed: SIP/gold-0000000b
 == Spawn extension (phones, s, 1) exited non-zero on 'Local/s@phones-3ff9;2'
 == Done Spying on channel Local/s@phones-3ff9;2
   -- channel destroyed: Local/s@phones-3ff9;2
   -- spy on Local/s@phones-3ff9;1
 == Spying on channel Local/s@phones-3ff9;1
[2012-03-1515:09:30] NOTICE[28193]: app_chanspy.c:484 start_spying: Attaching SIP/101-00000005 to Local/s@phones-3ff9;1

You'll notice that s@phones-3ff9;1 never gets destroyed and then the chanspy immediately jumps to the ;1 when it gets back into the wait loop.

Since ;1 is some kind of nearly dead but never dying channel (not a zombie) at this point, it's going to keep watching this broken channel forever...

From my experience, this only seems to happen when chanspy gets bound to ;2 at the start.

Honestly, I'm really not sure if local channels are even meant to be used in this kind of way.  I'm currently working on figuring out why the first local channel stays 'alive' in these instances.

By: Jonathan Rose (jrose) 2012-03-16 11:35:44

This patch is a present from me to you. Try it out and if you would, drop a little feedback on me.

https://reviewboard.asterisk.org/r/1819/

By: laszlovl (lvl) 2012-03-16 18:43:39.038-0500

I can't test it right now because someone broke my development machine, I'll definitely get back to it next week though to confirm. Cheers!

By: laszlovl (lvl) 2012-03-20 18:10:23.043-0500

Yep, the issue is gone in my test scenario aswell!