[Home]

Summary:ASTERISK-21960: ooh323 channels stuck
Reporter:Dmitry Melekhov (slesru)Labels:
Date Opened:2013-06-26 05:31:05Date Closed:2013-12-19 02:25:03.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Addons/chan_ooh323
Versions:11.4.0 Frequency of
Occurrence
Related
Issues:
Environment:Centos 6/x86-64Attachments:( 0) ASTERISK-21690.patch
( 1) ASTERISK-21960-stacklockup.patch
( 2) ASTERISK-21960-stacklockup-2.patch
( 3) h323_log
Description:Hello!

I just ( about 10 minutes ago ) got link problem between asterisk and gnugk - link is down , switched to backup channel, and I got stuck channels:


ast-nsk*CLI> core show channels
Channel              Location             State   Application(Data)            
DAHDI/i1/7171-237c   2077@dahdi:5         Ring    Dial(OOH323/2077)            
DAHDI/i1/7171-237b   2170@dahdi:5         Ring    Dial(OOH323/2170)            
DAHDI/i1/7171-237e   2077@dahdi:5         Ring    Dial(OOH323/2077)            
SIP/ast-ngdu1-000000 2311@ast-ngdu1:1     Ringing AppDial((Outgoing Line))      
OOH323/2077-8965     (None)               Down    (None)                        
OOH323/2077-8963     (None)               Down    (None)                        
OOH323/2077-8974     (None)               Down    (None)                        
OOH323/2077-8970     (None)               Down    (None)                        
OOH323/2077-8971     (None)               Down    (None)                        
OOH323/4222-8954     (None)               Down    (None)                        
DAHDI/i1/7005-2379   6842@dahdi:5         Ring    Dial(OOH323/6842)            
OOH323/2041-8973     (None)               Down    (None)                        
OOH323/6842-8960     (None)               Down    (None)                        
OOH323/6278-8956     (None)               Down    (None)                        
OOH323/6278-8953     (None)               Down    (None)                        
OOH323/6278-8952     (None)               Down    (None)                        
OOH323/2073-8961     (None)               Down    (None)                        
OOH323/2073-8966     (None)               Down    (None)                        
OOH323/2073-8964     (None)               Down    (None)                        
DAHDI/i1/7103-2382   6441@dahdi:5         Ring    Dial(OOH323/6441)            
DAHDI/i1/7103-2381   6441@dahdi:5         Ring    Dial(OOH323/6441)            
DAHDI/i1/7047-2372   6278@dahdi:5         Ring    Dial(OOH323/6278)            
DAHDI/i1/7047-2371   6278@dahdi:5         Ring    Dial(OOH323/6278)            
DAHDI/i1/7047-2375   6278@dahdi:5         Ring    Dial(OOH323/6278)            
OOH323/2170-8972     (None)               Down    (None)                        
OOH323/2170-8967     (None)               Down    (None)                        
OOH323/2170-8962     (None)               Down    (None)                        
DAHDI/i1/7130-2373   4222@dahdi:5         Ring    Dial(OOH323/4222)            
OOH323/6441-8968     (None)               Down    (None)                        
OOH323/6441-8969     (None)               Down    (None)                        
DAHDI/i1/7025-237f   2073@dahdi:5         Ring    Dial(OOH323/2073)            
DAHDI/i1/7025-237d   2073@dahdi:5         Ring    Dial(OOH323/2073)            
OOH323/6441-8955     (None)               Down    (None)                        
DAHDI/i1/7025-237a   2073@dahdi:5         Ring    Dial(OOH323/2073)            
OOH323/6441-8957     (None)               Down    (None)                        
OOH323/6441-8959     (None)               Down    (None)                        
OOH323/6441-8958     (None)               Down    (None)                        
DAHDI/i1/-2315       2311@dahdi:2         Ring    Dial(SIP/2311@ast-ngdu1)      
DAHDI/i1/7103-2378   6441@dahdi:5         Ring    Dial(OOH323/6441)            
DAHDI/i1/7103-2374   6441@dahdi:5         Ring    Dial(OOH323/6441)            
DAHDI/i1/7103-2376   6441@dahdi:5         Ring    Dial(OOH323/6441)            
DAHDI/i1/7103-2377   6441@dahdi:5         Ring    Dial(OOH323/6441)            
DAHDI/i1/7171-2386   2041@dahdi:5         Ring    Dial(OOH323/2041)            
DAHDI/i1/7171-2387   2077@dahdi:5         Ring    Dial(OOH323/2077)            
DAHDI/i1/7171-2384   2077@dahdi:5         Ring    Dial(OOH323/2077)            
DAHDI/i1/7171-2385   2170@dahdi:5         Ring    Dial(OOH323/2170)            
DAHDI/i1/7171-2383   2077@dahdi:5         Ring    Dial(OOH323/2077)            
DAHDI/i1/7171-2380   2170@dahdi:5         Ring    Dial(OOH323/2170)            
48 active channels
24 active calls
9068 calls processed
ast-nsk*CLI> exit

Sorry, I restarted asterisk, just because users complained.
I don't know how to reproduce this, may be I'll have an idea ;-)
And I understand that there is no enough info :-( , but may be problem can be found just from looking into code.

patch from ASTERISK-21800 is applied.
Thank you!
Comments:By: Alexander Anikin (may213) 2013-06-26 06:05:01.581-0500

Hi Dmitry,

Please specify when stuck channels was created. They was created when link between asterisk and gk was lost or after link was restored?
I'll analyze trouble but it will be helpful if you could make same test with asterisk compiled with DEBUG_THREADS options and attach 'core show locks' output.

By: Dmitry Melekhov (slesru) 2013-06-26 06:22:05.004-0500

Hello!

To be honest- I don't know for sure when these calls were placed- after channel problem or before, just because there is storm on another end of channel ( i.e. where asterisk is ) and I see that channel changed up/down state several times before finally down and ospf switched it to backup channel, so calls can be in any state during this channels problem, real problem, imho, is that calls were not destroyed after they failed, i.e. ooh323  channels in down state... I guess this is main problem- down, but still exists from asterisk point of view...
btw, I tried ooh323 reload and it doesn't help.
Anyway, problem with placing calls when asterisk is not registered in gatekeeper is already fixed.
If I'll find a way to reproduce- I'll provide more info, but I don't sure it is possible :-( Anyway, right now I can't run asterisk with debug threads just because server hardware performance is not enough, but I'll have new server in month or so.
Thank you!



By: Alexander Anikin (may213) 2013-06-26 06:48:19.447-0500

Dmitry, thank for explanation, i will research trouble. I guess incorrect gk timeout processing there.

By: Dmitry Melekhov (slesru) 2013-06-28 05:11:14.477-0500

btw, I forgot to add that I have
rtptimeout=0
just because this timeout doesn't work correctly sometimes.
I guess that calls can be disconnected by this timeout in this case, but , anyway it doesn't work and sometimes just disconnects real calls :-)



By: Alexander Anikin (may213) 2013-08-12 11:42:04.957-0500

Hello Dmitry,

I was busy on another project but i am here now ;)
Main issue here is shutdown gatekeeper client and ooh323 stack completely if gk client error occurs.
I attach patch that change this behavior, there is new gk client state 'shutdown' and gk client reinit automatically
by monitor thread and can reinit also by 'ooh323 reload' cli command.
So please try with patch.

By: Alexander Anikin (may213) 2013-08-12 11:43:20.482-0500

Patch to auto restart gk client attached

By: Dmitry Melekhov (slesru) 2013-08-12 22:27:27.664-0500

Hello!

Thank you!
I'll apply patch and test, but test can take long time just because I don't know how to reproduce, so I need to wait for natural ;-) cause.

Thank you!


By: Alexander Anikin (may213) 2013-08-12 22:35:23.001-0500

Dmitry, i did test with using iptables on GK.
Just insert/remove DROP or REJECT rule for test system source ip in INPUT chain on GK ;)

By: Dmitry Melekhov (slesru) 2013-08-12 22:42:22.235-0500

Thank you for hint.
I'll test tomorrow then - according to internal policy I need to warn users in advance if service may be unavailable :-)

Thank you!




By: Dmitry Melekhov (slesru) 2013-08-14 01:48:53.256-0500

Hello!

After some timeout I get Restart stopped gatekeeper client
and hangup, so I hope that this problem is fixed.

Thank you!


By: Dmitry Melekhov (slesru) 2013-09-06 02:19:10.758-0500

Hello!

Wrote the same to you directly.
I tried to reproduce this by another way.
I create nonexistant peer:

{noformat}
[dm]
type=friend
;type=peer
context=default
ip=192.168.22.22
port=1720
;e164=101
disallow=all
allow=alaw
allow=ulaw
;allow=g729
fastStart=no
h245tunneling=yes
;canreinvite=no
directmedia=yes
directrtpsetup=yes
dtmfmode=inband
{noformat}

Then called it:

{noformat}
exten => 6016,1,SET(FAXOPT(t38gateway)=yes)
exten => 6016,n,Dial(OOH323/6401@dm)
{noformat}

Hear nothing for several minutes.
Then hang up.

See:
{noformat}
asterisk*CLI> core show channels
Channel              Location             State Application(Data)
OOH323/dm-176        6401@default:1       Down (None)
DAHDI/i1/6401-2ce    6016@default:2       Ring Dial(OOH323/6401@dm)
2 active channels
1 active call
{noformat}


but, what is more interesting , all other calls to ooh323 stuck too:

{noformat}
asterisk*CLI> pri show channels
PRI       B    Chan Call       PRI  Channel
Span Chan Chan Idle Level      Call Name
  1    1 Yes  Yes  Idle       No  
  1    2 Yes  Yes  Idle       No  
  1    3 Yes  No   Proceeding Yes  DAHDI/i1/6401-2ce
  1    4 Yes  Yes  Idle       No  
  1    5 Yes  Yes  Idle       No  
  1    6 Yes  Yes  Idle       No  
  1    7 Yes  Yes  Idle       No  
  1    8 Yes  No   Proceeding Yes  DAHDI/i1/6469-2d9
  1    9 Yes  Yes  Idle       No  
  1   10 Yes  Yes  Idle       No  
  1   11 Yes  Yes  Idle       No  
  1   12 Yes  Yes  Idle       No  
  1   13 Yes  Yes  Idle       No  
  1   14 Yes  Yes  Idle       No  
  1   15 Yes  Yes  Idle       No  
  1   17 Yes  Yes  Idle       No  
  1   18 Yes  Yes  Idle       No  
  1   19 Yes  Yes  Idle       No  
  1   20 Yes  Yes  Idle       No  
  1   21 Yes  Yes  Idle       No  
  1   22 Yes  No   Proceeding Yes  DAHDI/i1/6207-2d0
  1   23 Yes  Yes  Idle       No  
  1   24 Yes  No   Proceeding Yes  DAHDI/i1/6469-2d3
  1   25 Yes  No   Proceeding Yes  DAHDI/i1/6469-2d2
  1   26 Yes  No   Proceeding Yes  DAHDI/i1/6469-2d1
  1   27 Yes  Yes  Idle       No  
  1   28 Yes  Yes  Idle       No  
  1   29 Yes  Yes  Idle       No  
  1   30 Yes  Yes  Idle       No  
  1   31 Yes  No   Proceeding Yes  DAHDI/i1/6469-2d8
{noformat}

{noformat}
asterisk*CLI> core show channels
Channel              Location             State   Application(Data)            
SIP/6054-00000103    (None)               Up      AppDial((Outgoing Line))      
DAHDI/i1/6469-2d1    116011@default:5     Ring    Dial(OOH323/6011@neftisa,,g)  
DAHDI/i1/6469-2d2    116011@default:5     Ring    Dial(OOH323/6011@neftisa,,g)  
DAHDI/i1/6469-2d3    116031@default:5     Ring    Dial(OOH323/6031@neftisa,,g)  
DAHDI/i1/6469-2d8    116030@default:5     Ring    Dial(OOH323/6030@neftisa,,g)  
DAHDI/i1/6469-2d9    116024@default:5     Ring    Dial(OOH323/6024@neftisa,,g)  
OOH323/neftisa-178   6066@default:1       Down    (None)                        
OOH323/neftisa-179   6011@default:1       Down    (None)                        
OOH323/neftisa-181   6031@default:1       Down    (None)                        
OOH323/neftisa-180   6011@default:1       Down    (None)                        
OOH323/neftisa-183   6024@default:1       Down    (None)                        
OOH323/neftisa-182   6030@default:1       Down    (None)                        
OOH323/dm-176        6401@default:1       Down    (None)                        
DAHDI/i1/6401-2ce    6016@default:2       Ring    Dial(OOH323/6401@dm)          
DAHDI/i1/6207-2d0    116066@default:5     Ring    Dial(OOH323/6066@neftisa,,g)  
DAHDI/i1/6287-2db    6054@default:1       Up      Dial(SIP/6054)                
16 active channels
8 active calls
{noformat}

So real problem is deeper...


By: Dmitry Melekhov (slesru) 2013-09-06 02:26:36.044-0500

btw, asked our pri switch admin- he said that there are no  busy channels on their side...


By: Dmitry Melekhov (slesru) 2013-09-06 02:37:25.040-0500

and channel can't be hanged up:

asterisk*CLI> channel request hangup  OOH323/dm-4
Requested Hangup on channel 'OOH323/dm-4'
   -- Remote UNIX connection
   -- Remote UNIX connection
   -- Remote UNIX connection disconnected
   -- Remote UNIX connection disconnected
   -- Remote UNIX connection
   -- Remote UNIX connection disconnected
asterisk*CLI> core show channels
Channel              Location             State   Application(Data)            
OOH323/dm-4          6401@default:1       Down    (None)                        
DAHDI/i1/6401-b      6016@default:2       Ring    Dial(OOH323/6401@dm)          
2 active channels
1 active call


By: Alexander Anikin (may213) 2013-09-06 11:26:03.391-0500

Dmitry, it is possible  that neftisa was unreachable same time as dm?
As i said previously there must not be global lockup in h323 stack but only one call thread on connect syscall.

The channel that is busy on connect syscall can't hangup really, i will think about connect timeout handling.


By: Dmitry Melekhov (slesru) 2013-09-07 08:44:09.261-0500

Hello!

Honestly , I don't know. It can be.
I didn't check neftisa connectivity :-( , but, during this problem  I contacted pri switch admin and he said that there is no calls to asterisk, although there were several calls as you can see before.
Problem was solved by restarting asterisk... But this can be coincidence too- connectivity can be restored- my monitoring checks coonectivity once per 5 minutes, so ...
Another possibility is kernel bug which leads to such problems with sockets or this is bug outside of chan_ooh323.

Next several days I'll try to reproduce this problem every morning- while there are no users yet ;-)

Thank you!

By: Dmitry Melekhov (slesru) 2013-09-08 22:47:03.059-0500

Hi!

Just reproduced problem after several attempts.
ooh323 stuck completely- it even don't write to log :-(
This is what I see at last call I tried:

asterisk*CLI> ooh323 set debug
OOH323 Debugging Enabled
   -- Accepting call from '6401' to '116973' on channel 0/26, span 1
   -- Executing [116973@default:1] GotoIf("DAHDI/i1/6401-8", "1?cid") in new stack
   -- Goto (default,116973,4)
   -- Executing [116973@default:4] Set("DAHDI/i1/6401-8", "CALLERID(num)=766401") in new stack
   -- Executing [116973@default:5] Dial("DAHDI/i1/6401-8", "OOH323/6973@neftisa,,g") in new stack
---   ooh323_request - data 6973@neftisa format (alaw)
---   ooh323_alloc
+++   ooh323_alloc
---   find_peer "neftisa"
comparing with "192.168.22.29"
comparing with "192.168.6.4"
found matching peer
+++   find_peer "neftisa"
---   ooh323_new - neftisa
+++   h323_new


I'll attach full log, which contains some calls which passed successfully, hope that it is possible to find cause.

I called 6016 as before , dm peer address is 192.168.22.29.

By: Dmitry Melekhov (slesru) 2013-09-08 22:48:16.316-0500

all today's log, hope it contains stuck call...


By: Dmitry Melekhov (slesru) 2013-09-08 22:52:04.206-0500

btw, as I see there is no call to 192.168.22.29 in log :-(
last thing is gatekeeper communications.
but this is call chan_ooh323 stuck on...
may be problem somewhere there- I use gatekeeper, but I do call peer without gatekeeper?

thank you!

By: Dmitry Melekhov (slesru) 2013-09-09 03:31:11.806-0500

by the way- I guess logs I provided are useless.
Could you tell me how can I provide useful info? if I'll be able to reproduce problem again...
thank you!

By: Dmitry Melekhov (slesru) 2013-09-09 22:53:21.790-0500

Hello!

More info, not sure it is useful.
Reproduced again ( from about 50 try ;-) ):





asterisk*CLI> ooh323 set debug
OOH323 Debugging Enabled
   -- Accepting call from '6401' to '6016' on channel 0/2, span 1
   -- Executing [6016@default:1] Set("DAHDI/i1/6401-13", "FAXOPT(t38gateway)=yes") in new stack
   -- Executing [6016@default:2] Dial("DAHDI/i1/6401-13", "OOH323/6789@dm") in new stack
---   ooh323_request - data 6789@dm format (alaw)
---   ooh323_alloc
+++   ooh323_alloc
---   find_peer "dm"
comparing with "192.168.22.22"
found matching peer
+++   find_peer "dm"
---   ooh323_new - dm
+++   h323_new


this call stuck.
call neftisa then

asterisk*CLI>
   -- Accepting call from '6401' to '116973' on channel 0/4, span 1
   -- Executing [116973@default:1] GotoIf("DAHDI/i1/6401-14", "1?cid") in new stack
   -- Goto (default,116973,4)
   -- Executing [116973@default:4] Set("DAHDI/i1/6401-14", "CALLERID(num)=766401") in new stack
   -- Executing [116973@default:5] Dial("DAHDI/i1/6401-14", "OOH323/6973@neftisa,,g") in new stack
---   ooh323_request - data 6973@neftisa format (alaw)
---   ooh323_alloc
+++   ooh323_alloc
---   find_peer "neftisa"
comparing with "192.168.22.22"
comparing with "192.168.6.4"
found matching peer
+++   find_peer "neftisa"
---   ooh323_new - neftisa
+++   h323_new

stuck too.

there is no socket, so you are right- problem is not in socket

[root@asterisk ~]# netstat -np | grep 192.168.22.22
tcp        0      0 192.168.22.19:22            192.168.22.229:46290        ESTABLISHED 9153/sshd          
tcp        0      0 192.168.22.19:22            192.168.22.229:46320        ESTABLISHED 10722/sshd

thank you!

By: Dmitry Melekhov (slesru) 2013-09-11 23:05:43.076-0500

Hello!

just got the same stuck with peer , which is available, but behind nat ( tested nat), but I don't think this is nat related.
I guess this is somehow related to gatekeeper ( my asterisk is registered in gatekeeper ) and direct calls to peers, something like call during gatekeeper reregistration ( I have TimeToLive=600 in gnugk config ). Just a guess, no real info, sorry :-(

Thank you!

By: Alexander Anikin (may213) 2013-09-12 12:38:07.350-0500

Dmitry,

Looks like you're right about relation to GK reregister. There is global lockup in some situation and it's specific for asterisk 11 (bug is in codes that support direct rtp that implemented from 11 version).

There is infinitely wait of condition variable which mean that rtp structure for call is created, but if Gk client isn't in correct state
then stack call creation routine just not called and condition variable doesn't receive signal never.
And there is global stack command lock that wraps waiting of call create and we have global lockup of ooh323 processing as result.

The attached patch change from ast_cond_wait to ast_cond_timedwait for call creation with 3 seconds timeout. This is not a better way but must work.
I will think how to perform here right.
Please test with attached patch.



By: Dmitry Melekhov (slesru) 2013-09-12 22:56:05.325-0500

Hello!

Just applied patch and tried to reproduce problem by dialing again and again in 20 minutes- can't reproduce problem, so patch is in right place.
And yes, I run one server with asterisk 10 as h323-pri gateway and there are no channels stuck there, but I just thought this is because of better data channel to this location. ;-)
Thank you very much.
I guess that if bug is in direct rtp- best way is to fix it there....
btw, anyway, I'd like to have an ability to set tcp timeout for signalling :-)
Thank you!



By: Alexander Anikin (may213) 2013-09-13 03:12:14.059-0500

Hello, Dmitry

The issue isn't on direct rtp directly ;) but on preparing data structure for early remote rtp bridging in chan_ooh323. Current way with timed waiting is workaround but it's could be this is only way to solve issue. Will think about it.

And could you please open new issue for tcp timeout due to it's subject of separate issue.

By: Dmitry Melekhov (slesru) 2013-09-21 09:28:58.277-0500

Hello!

Just had electricity failure :-)
What is more interesting- it is impossible to call peer directly if there is no connection to gatekeeper.
Can be easily reproduced by
iptables -A INPUT -s 192.168.22.254 -j DROP
where 22.254 is address of gatekeeper.
I think this is wrong behaviour :-)

Thank you!

By: Dmitry Melekhov (slesru) 2013-12-17 23:13:39.150-0600

Hello!

Alexander, is it possible to include current (may be not optimal, etc) patch in current tree, so it will be in next version and I'll have to not apply patch during upgrade :-) ?

Thank you!

By: Alexander Anikin (may213) 2013-12-18 12:20:08.581-0600

Hello Dmitry,

Please test again with stacklookup-2 patch instead of first stacklookup patch. It do more correct and stack command lock removed from codes as it not need (this must be done two year ago ;))
If it will ok we can commit patches and close this issue.


By: Alexander Anikin (may213) 2013-12-18 12:28:05.584-0600

And about wrong behaviour with direct peers call with gatekeeper routed call model.
We can discuss and provide some patches in the separate issue if this task must be done.

By: Dmitry Melekhov (slesru) 2013-12-19 00:42:43.084-0600

applied

ASTERISK-21690.patch
and
ASTERISK-21960-stacklockup-2.patch

to 11.7.0
looks good, at least I can't reproduce stuck channles :-)
real usage will show, I think this issue should be closed, if I'll have another (related) issue, I'll open new one :-) like about direct calls (will create right now).

Thank you!




By: Alexander Anikin (may213) 2013-12-19 02:00:56.849-0600

Dmitry, thank you for heavy testing.
I will commit changes and close this issue.