Summary: | ASTERISK-21960: ooh323 channels stuck | ||
Reporter: | Dmitry Melekhov (slesru) | Labels: | |
Date Opened: | 2013-06-26 05:31:05 | Date Closed: | 2013-12-19 02:25:03.000-0600 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | Addons/chan_ooh323 |
Versions: | 11.4.0 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Centos 6/x86-64 | Attachments: | ( 0) ASTERISK-21690.patch ( 1) ASTERISK-21960-stacklockup.patch ( 2) ASTERISK-21960-stacklockup-2.patch ( 3) h323_log |
Description: | Hello!
I just ( about 10 minutes ago ) got link problem between asterisk and gnugk - link is down , switched to backup channel, and I got stuck channels: ast-nsk*CLI> core show channels Channel Location State Application(Data) DAHDI/i1/7171-237c 2077@dahdi:5 Ring Dial(OOH323/2077) DAHDI/i1/7171-237b 2170@dahdi:5 Ring Dial(OOH323/2170) DAHDI/i1/7171-237e 2077@dahdi:5 Ring Dial(OOH323/2077) SIP/ast-ngdu1-000000 2311@ast-ngdu1:1 Ringing AppDial((Outgoing Line)) OOH323/2077-8965 (None) Down (None) OOH323/2077-8963 (None) Down (None) OOH323/2077-8974 (None) Down (None) OOH323/2077-8970 (None) Down (None) OOH323/2077-8971 (None) Down (None) OOH323/4222-8954 (None) Down (None) DAHDI/i1/7005-2379 6842@dahdi:5 Ring Dial(OOH323/6842) OOH323/2041-8973 (None) Down (None) OOH323/6842-8960 (None) Down (None) OOH323/6278-8956 (None) Down (None) OOH323/6278-8953 (None) Down (None) OOH323/6278-8952 (None) Down (None) OOH323/2073-8961 (None) Down (None) OOH323/2073-8966 (None) Down (None) OOH323/2073-8964 (None) Down (None) DAHDI/i1/7103-2382 6441@dahdi:5 Ring Dial(OOH323/6441) DAHDI/i1/7103-2381 6441@dahdi:5 Ring Dial(OOH323/6441) DAHDI/i1/7047-2372 6278@dahdi:5 Ring Dial(OOH323/6278) DAHDI/i1/7047-2371 6278@dahdi:5 Ring Dial(OOH323/6278) DAHDI/i1/7047-2375 6278@dahdi:5 Ring Dial(OOH323/6278) OOH323/2170-8972 (None) Down (None) OOH323/2170-8967 (None) Down (None) OOH323/2170-8962 (None) Down (None) DAHDI/i1/7130-2373 4222@dahdi:5 Ring Dial(OOH323/4222) OOH323/6441-8968 (None) Down (None) OOH323/6441-8969 (None) Down (None) DAHDI/i1/7025-237f 2073@dahdi:5 Ring Dial(OOH323/2073) DAHDI/i1/7025-237d 2073@dahdi:5 Ring Dial(OOH323/2073) OOH323/6441-8955 (None) Down (None) DAHDI/i1/7025-237a 2073@dahdi:5 Ring Dial(OOH323/2073) OOH323/6441-8957 (None) Down (None) OOH323/6441-8959 (None) Down (None) OOH323/6441-8958 (None) Down (None) DAHDI/i1/-2315 2311@dahdi:2 Ring Dial(SIP/2311@ast-ngdu1) DAHDI/i1/7103-2378 6441@dahdi:5 Ring Dial(OOH323/6441) DAHDI/i1/7103-2374 6441@dahdi:5 Ring Dial(OOH323/6441) DAHDI/i1/7103-2376 6441@dahdi:5 Ring Dial(OOH323/6441) DAHDI/i1/7103-2377 6441@dahdi:5 Ring Dial(OOH323/6441) DAHDI/i1/7171-2386 2041@dahdi:5 Ring Dial(OOH323/2041) DAHDI/i1/7171-2387 2077@dahdi:5 Ring Dial(OOH323/2077) DAHDI/i1/7171-2384 2077@dahdi:5 Ring Dial(OOH323/2077) DAHDI/i1/7171-2385 2170@dahdi:5 Ring Dial(OOH323/2170) DAHDI/i1/7171-2383 2077@dahdi:5 Ring Dial(OOH323/2077) DAHDI/i1/7171-2380 2170@dahdi:5 Ring Dial(OOH323/2170) 48 active channels 24 active calls 9068 calls processed ast-nsk*CLI> exit Sorry, I restarted asterisk, just because users complained. I don't know how to reproduce this, may be I'll have an idea ;-) And I understand that there is no enough info :-( , but may be problem can be found just from looking into code. patch from ASTERISK-21800 is applied. Thank you! | ||
Comments: | By: Alexander Anikin (may213) 2013-06-26 06:05:01.581-0500 Hi Dmitry, Please specify when stuck channels was created. They was created when link between asterisk and gk was lost or after link was restored? I'll analyze trouble but it will be helpful if you could make same test with asterisk compiled with DEBUG_THREADS options and attach 'core show locks' output. By: Dmitry Melekhov (slesru) 2013-06-26 06:22:05.004-0500 Hello! To be honest- I don't know for sure when these calls were placed- after channel problem or before, just because there is storm on another end of channel ( i.e. where asterisk is ) and I see that channel changed up/down state several times before finally down and ospf switched it to backup channel, so calls can be in any state during this channels problem, real problem, imho, is that calls were not destroyed after they failed, i.e. ooh323 channels in down state... I guess this is main problem- down, but still exists from asterisk point of view... btw, I tried ooh323 reload and it doesn't help. Anyway, problem with placing calls when asterisk is not registered in gatekeeper is already fixed. If I'll find a way to reproduce- I'll provide more info, but I don't sure it is possible :-( Anyway, right now I can't run asterisk with debug threads just because server hardware performance is not enough, but I'll have new server in month or so. Thank you! By: Alexander Anikin (may213) 2013-06-26 06:48:19.447-0500 Dmitry, thank for explanation, i will research trouble. I guess incorrect gk timeout processing there. By: Dmitry Melekhov (slesru) 2013-06-28 05:11:14.477-0500 btw, I forgot to add that I have rtptimeout=0 just because this timeout doesn't work correctly sometimes. I guess that calls can be disconnected by this timeout in this case, but , anyway it doesn't work and sometimes just disconnects real calls :-) By: Alexander Anikin (may213) 2013-08-12 11:42:04.957-0500 Hello Dmitry, I was busy on another project but i am here now ;) Main issue here is shutdown gatekeeper client and ooh323 stack completely if gk client error occurs. I attach patch that change this behavior, there is new gk client state 'shutdown' and gk client reinit automatically by monitor thread and can reinit also by 'ooh323 reload' cli command. So please try with patch. By: Alexander Anikin (may213) 2013-08-12 11:43:20.482-0500 Patch to auto restart gk client attached By: Dmitry Melekhov (slesru) 2013-08-12 22:27:27.664-0500 Hello! Thank you! I'll apply patch and test, but test can take long time just because I don't know how to reproduce, so I need to wait for natural ;-) cause. Thank you! By: Alexander Anikin (may213) 2013-08-12 22:35:23.001-0500 Dmitry, i did test with using iptables on GK. Just insert/remove DROP or REJECT rule for test system source ip in INPUT chain on GK ;) By: Dmitry Melekhov (slesru) 2013-08-12 22:42:22.235-0500 Thank you for hint. I'll test tomorrow then - according to internal policy I need to warn users in advance if service may be unavailable :-) Thank you! By: Dmitry Melekhov (slesru) 2013-08-14 01:48:53.256-0500 Hello! After some timeout I get Restart stopped gatekeeper client and hangup, so I hope that this problem is fixed. Thank you! By: Dmitry Melekhov (slesru) 2013-09-06 02:19:10.758-0500 Hello! Wrote the same to you directly. I tried to reproduce this by another way. I create nonexistant peer: {noformat} [dm] type=friend ;type=peer context=default ip=192.168.22.22 port=1720 ;e164=101 disallow=all allow=alaw allow=ulaw ;allow=g729 fastStart=no h245tunneling=yes ;canreinvite=no directmedia=yes directrtpsetup=yes dtmfmode=inband {noformat} Then called it: {noformat} exten => 6016,1,SET(FAXOPT(t38gateway)=yes) exten => 6016,n,Dial(OOH323/6401@dm) {noformat} Hear nothing for several minutes. Then hang up. See: {noformat} asterisk*CLI> core show channels Channel Location State Application(Data) OOH323/dm-176 6401@default:1 Down (None) DAHDI/i1/6401-2ce 6016@default:2 Ring Dial(OOH323/6401@dm) 2 active channels 1 active call {noformat} but, what is more interesting , all other calls to ooh323 stuck too: {noformat} asterisk*CLI> pri show channels PRI B Chan Call PRI Channel Span Chan Chan Idle Level Call Name 1 1 Yes Yes Idle No 1 2 Yes Yes Idle No 1 3 Yes No Proceeding Yes DAHDI/i1/6401-2ce 1 4 Yes Yes Idle No 1 5 Yes Yes Idle No 1 6 Yes Yes Idle No 1 7 Yes Yes Idle No 1 8 Yes No Proceeding Yes DAHDI/i1/6469-2d9 1 9 Yes Yes Idle No 1 10 Yes Yes Idle No 1 11 Yes Yes Idle No 1 12 Yes Yes Idle No 1 13 Yes Yes Idle No 1 14 Yes Yes Idle No 1 15 Yes Yes Idle No 1 17 Yes Yes Idle No 1 18 Yes Yes Idle No 1 19 Yes Yes Idle No 1 20 Yes Yes Idle No 1 21 Yes Yes Idle No 1 22 Yes No Proceeding Yes DAHDI/i1/6207-2d0 1 23 Yes Yes Idle No 1 24 Yes No Proceeding Yes DAHDI/i1/6469-2d3 1 25 Yes No Proceeding Yes DAHDI/i1/6469-2d2 1 26 Yes No Proceeding Yes DAHDI/i1/6469-2d1 1 27 Yes Yes Idle No 1 28 Yes Yes Idle No 1 29 Yes Yes Idle No 1 30 Yes Yes Idle No 1 31 Yes No Proceeding Yes DAHDI/i1/6469-2d8 {noformat} {noformat} asterisk*CLI> core show channels Channel Location State Application(Data) SIP/6054-00000103 (None) Up AppDial((Outgoing Line)) DAHDI/i1/6469-2d1 116011@default:5 Ring Dial(OOH323/6011@neftisa,,g) DAHDI/i1/6469-2d2 116011@default:5 Ring Dial(OOH323/6011@neftisa,,g) DAHDI/i1/6469-2d3 116031@default:5 Ring Dial(OOH323/6031@neftisa,,g) DAHDI/i1/6469-2d8 116030@default:5 Ring Dial(OOH323/6030@neftisa,,g) DAHDI/i1/6469-2d9 116024@default:5 Ring Dial(OOH323/6024@neftisa,,g) OOH323/neftisa-178 6066@default:1 Down (None) OOH323/neftisa-179 6011@default:1 Down (None) OOH323/neftisa-181 6031@default:1 Down (None) OOH323/neftisa-180 6011@default:1 Down (None) OOH323/neftisa-183 6024@default:1 Down (None) OOH323/neftisa-182 6030@default:1 Down (None) OOH323/dm-176 6401@default:1 Down (None) DAHDI/i1/6401-2ce 6016@default:2 Ring Dial(OOH323/6401@dm) DAHDI/i1/6207-2d0 116066@default:5 Ring Dial(OOH323/6066@neftisa,,g) DAHDI/i1/6287-2db 6054@default:1 Up Dial(SIP/6054) 16 active channels 8 active calls {noformat} So real problem is deeper... By: Dmitry Melekhov (slesru) 2013-09-06 02:26:36.044-0500 btw, asked our pri switch admin- he said that there are no busy channels on their side... By: Dmitry Melekhov (slesru) 2013-09-06 02:37:25.040-0500 and channel can't be hanged up: asterisk*CLI> channel request hangup OOH323/dm-4 Requested Hangup on channel 'OOH323/dm-4' -- Remote UNIX connection -- Remote UNIX connection -- Remote UNIX connection disconnected -- Remote UNIX connection disconnected -- Remote UNIX connection -- Remote UNIX connection disconnected asterisk*CLI> core show channels Channel Location State Application(Data) OOH323/dm-4 6401@default:1 Down (None) DAHDI/i1/6401-b 6016@default:2 Ring Dial(OOH323/6401@dm) 2 active channels 1 active call By: Alexander Anikin (may213) 2013-09-06 11:26:03.391-0500 Dmitry, it is possible that neftisa was unreachable same time as dm? As i said previously there must not be global lockup in h323 stack but only one call thread on connect syscall. The channel that is busy on connect syscall can't hangup really, i will think about connect timeout handling. By: Dmitry Melekhov (slesru) 2013-09-07 08:44:09.261-0500 Hello! Honestly , I don't know. It can be. I didn't check neftisa connectivity :-( , but, during this problem I contacted pri switch admin and he said that there is no calls to asterisk, although there were several calls as you can see before. Problem was solved by restarting asterisk... But this can be coincidence too- connectivity can be restored- my monitoring checks coonectivity once per 5 minutes, so ... Another possibility is kernel bug which leads to such problems with sockets or this is bug outside of chan_ooh323. Next several days I'll try to reproduce this problem every morning- while there are no users yet ;-) Thank you! By: Dmitry Melekhov (slesru) 2013-09-08 22:47:03.059-0500 Hi! Just reproduced problem after several attempts. ooh323 stuck completely- it even don't write to log :-( This is what I see at last call I tried: asterisk*CLI> ooh323 set debug OOH323 Debugging Enabled -- Accepting call from '6401' to '116973' on channel 0/26, span 1 -- Executing [116973@default:1] GotoIf("DAHDI/i1/6401-8", "1?cid") in new stack -- Goto (default,116973,4) -- Executing [116973@default:4] Set("DAHDI/i1/6401-8", "CALLERID(num)=766401") in new stack -- Executing [116973@default:5] Dial("DAHDI/i1/6401-8", "OOH323/6973@neftisa,,g") in new stack --- ooh323_request - data 6973@neftisa format (alaw) --- ooh323_alloc +++ ooh323_alloc --- find_peer "neftisa" comparing with "192.168.22.29" comparing with "192.168.6.4" found matching peer +++ find_peer "neftisa" --- ooh323_new - neftisa +++ h323_new I'll attach full log, which contains some calls which passed successfully, hope that it is possible to find cause. I called 6016 as before , dm peer address is 192.168.22.29. By: Dmitry Melekhov (slesru) 2013-09-08 22:48:16.316-0500 all today's log, hope it contains stuck call... By: Dmitry Melekhov (slesru) 2013-09-08 22:52:04.206-0500 btw, as I see there is no call to 192.168.22.29 in log :-( last thing is gatekeeper communications. but this is call chan_ooh323 stuck on... may be problem somewhere there- I use gatekeeper, but I do call peer without gatekeeper? thank you! By: Dmitry Melekhov (slesru) 2013-09-09 03:31:11.806-0500 by the way- I guess logs I provided are useless. Could you tell me how can I provide useful info? if I'll be able to reproduce problem again... thank you! By: Dmitry Melekhov (slesru) 2013-09-09 22:53:21.790-0500 Hello! More info, not sure it is useful. Reproduced again ( from about 50 try ;-) ): asterisk*CLI> ooh323 set debug OOH323 Debugging Enabled -- Accepting call from '6401' to '6016' on channel 0/2, span 1 -- Executing [6016@default:1] Set("DAHDI/i1/6401-13", "FAXOPT(t38gateway)=yes") in new stack -- Executing [6016@default:2] Dial("DAHDI/i1/6401-13", "OOH323/6789@dm") in new stack --- ooh323_request - data 6789@dm format (alaw) --- ooh323_alloc +++ ooh323_alloc --- find_peer "dm" comparing with "192.168.22.22" found matching peer +++ find_peer "dm" --- ooh323_new - dm +++ h323_new this call stuck. call neftisa then asterisk*CLI> -- Accepting call from '6401' to '116973' on channel 0/4, span 1 -- Executing [116973@default:1] GotoIf("DAHDI/i1/6401-14", "1?cid") in new stack -- Goto (default,116973,4) -- Executing [116973@default:4] Set("DAHDI/i1/6401-14", "CALLERID(num)=766401") in new stack -- Executing [116973@default:5] Dial("DAHDI/i1/6401-14", "OOH323/6973@neftisa,,g") in new stack --- ooh323_request - data 6973@neftisa format (alaw) --- ooh323_alloc +++ ooh323_alloc --- find_peer "neftisa" comparing with "192.168.22.22" comparing with "192.168.6.4" found matching peer +++ find_peer "neftisa" --- ooh323_new - neftisa +++ h323_new stuck too. there is no socket, so you are right- problem is not in socket [root@asterisk ~]# netstat -np | grep 192.168.22.22 tcp 0 0 192.168.22.19:22 192.168.22.229:46290 ESTABLISHED 9153/sshd tcp 0 0 192.168.22.19:22 192.168.22.229:46320 ESTABLISHED 10722/sshd thank you! By: Dmitry Melekhov (slesru) 2013-09-11 23:05:43.076-0500 Hello! just got the same stuck with peer , which is available, but behind nat ( tested nat), but I don't think this is nat related. I guess this is somehow related to gatekeeper ( my asterisk is registered in gatekeeper ) and direct calls to peers, something like call during gatekeeper reregistration ( I have TimeToLive=600 in gnugk config ). Just a guess, no real info, sorry :-( Thank you! By: Alexander Anikin (may213) 2013-09-12 12:38:07.350-0500 Dmitry, Looks like you're right about relation to GK reregister. There is global lockup in some situation and it's specific for asterisk 11 (bug is in codes that support direct rtp that implemented from 11 version). There is infinitely wait of condition variable which mean that rtp structure for call is created, but if Gk client isn't in correct state then stack call creation routine just not called and condition variable doesn't receive signal never. And there is global stack command lock that wraps waiting of call create and we have global lockup of ooh323 processing as result. The attached patch change from ast_cond_wait to ast_cond_timedwait for call creation with 3 seconds timeout. This is not a better way but must work. I will think how to perform here right. Please test with attached patch. By: Dmitry Melekhov (slesru) 2013-09-12 22:56:05.325-0500 Hello! Just applied patch and tried to reproduce problem by dialing again and again in 20 minutes- can't reproduce problem, so patch is in right place. And yes, I run one server with asterisk 10 as h323-pri gateway and there are no channels stuck there, but I just thought this is because of better data channel to this location. ;-) Thank you very much. I guess that if bug is in direct rtp- best way is to fix it there.... btw, anyway, I'd like to have an ability to set tcp timeout for signalling :-) Thank you! By: Alexander Anikin (may213) 2013-09-13 03:12:14.059-0500 Hello, Dmitry The issue isn't on direct rtp directly ;) but on preparing data structure for early remote rtp bridging in chan_ooh323. Current way with timed waiting is workaround but it's could be this is only way to solve issue. Will think about it. And could you please open new issue for tcp timeout due to it's subject of separate issue. By: Dmitry Melekhov (slesru) 2013-09-21 09:28:58.277-0500 Hello! Just had electricity failure :-) What is more interesting- it is impossible to call peer directly if there is no connection to gatekeeper. Can be easily reproduced by iptables -A INPUT -s 192.168.22.254 -j DROP where 22.254 is address of gatekeeper. I think this is wrong behaviour :-) Thank you! By: Dmitry Melekhov (slesru) 2013-12-17 23:13:39.150-0600 Hello! Alexander, is it possible to include current (may be not optimal, etc) patch in current tree, so it will be in next version and I'll have to not apply patch during upgrade :-) ? Thank you! By: Alexander Anikin (may213) 2013-12-18 12:20:08.581-0600 Hello Dmitry, Please test again with stacklookup-2 patch instead of first stacklookup patch. It do more correct and stack command lock removed from codes as it not need (this must be done two year ago ;)) If it will ok we can commit patches and close this issue. By: Alexander Anikin (may213) 2013-12-18 12:28:05.584-0600 And about wrong behaviour with direct peers call with gatekeeper routed call model. We can discuss and provide some patches in the separate issue if this task must be done. By: Dmitry Melekhov (slesru) 2013-12-19 00:42:43.084-0600 applied ASTERISK-21690.patch and ASTERISK-21960-stacklockup-2.patch to 11.7.0 looks good, at least I can't reproduce stuck channles :-) real usage will show, I think this issue should be closed, if I'll have another (related) issue, I'll open new one :-) like about direct calls (will create right now). Thank you! By: Alexander Anikin (may213) 2013-12-19 02:00:56.849-0600 Dmitry, thank you for heavy testing. I will commit changes and close this issue. |