[Home]

Summary:ASTERISK-26716: ari: Channels with pre-dial handlers cannot be hung up via ARI
Reporter:Tom Pawelek (tompaw)Labels:
Date Opened:2017-01-11 17:11:24.000-0600Date Closed:2017-02-02 17:30:55.000-0600
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Applications/app_stasis
Versions:13.13.1 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:- platform independent -Attachments:( 0) ari_h_error
( 1) ari_h_error.txt
( 2) tmp_core.tar
Description:A local channel with a pre-dial handler will fail to be properly hung up in 90% of the cases - Asterisk does not send BYE/CANCEL on .hangup()

; Original channel
{noformat}
ch = client.channels.originate(endpoint="Local/12345@cont-test", app="ari-test")
{noformat}
; Dialplan
{noformat}
[test-handler]
exten => testcall,1,NoOp(foo)
exten => testcall,n,Return()

[cont-test]
exten => _XXX.,1,Answer()
exten => _XXX.,n,Dial(PJSIP/bar@pstn,,b(test-handler^testcall^1(foo)))
{noformat}
; Attempt to hang up
{noformat}
client.channels.hangup(channelId=ch_id)
{noformat}
Only b( ) handlers cause the issue to appear, B( ) works perfectly fine.
Comments:By: Asterisk Team (asteriskteam) 2017-01-11 17:11:25.395-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2017-01-12 09:07:55.666-0600

We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Rusty Newton (rnewton) 2017-01-12 09:09:32.743-0600

Strange! Can you collect the debug log mentioned in the previous comment when you reproduce the issue?

Be sure to include the SIP trace and that warning,error,notice,verbose,debug channels are all enabled, and both verbose and debug are turned up to 5 or above.

By: Tom Pawelek (tompaw) 2017-01-12 16:55:10.784-0600

I will be on it ASAP, due to high volume nature of our systems it is a little tricky, but I will try and pull one tonight.

By: Rusty Newton (rnewton) 2017-01-12 18:09:56.961-0600

Thanks [~tompaw] !

By: Tom Pawelek (tompaw) 2017-01-14 08:23:21.611-0600

Update: while trying to isolate the issue I stumbled upon a plausible cause. It seems that the use of b( ) causes rogue basic (non-stasis) bridges to be created, which in turn are not being handled by our ARI stack. I am in the process of gathering more data and will update this bug as soon as I have something solid.

By: Rusty Newton (rnewton) 2017-01-14 15:35:05.120-0600

Weird, thanks for following up.

By: Tom Pawelek (tompaw) 2017-01-14 18:19:12.224-0600

I managed to capture a debug log showing the behavior. In this case, I originate a "pstn" Local channel, which executes a Dial(PJSIP/foo,,b(fix-cli)). The fix-cli subroutine is required, because this channel is inside a stasis bridge which overwrites is native CLI.

Upon sending a .hangup() instruction, only ;1 leg of the Local channel hangs up. The PJSIP (;2) never sends a BYE/CANCEL and as you can see continues with the call a few seconds later. Logs below:

;; ARI Call:

client.channels.hangup(channelId=ch_id)


;; Debug:

[2017-01-14 09:32:46] DEBUG[2887] channel.c: Soft-Hanging (0x20) up channel 'Local/13051002000-18001002000@cont-fas-00000054;1'
[2017-01-14 09:32:46] DEBUG[2887] res_ari.c: Examining ARI response:
204 No Content


[2017-01-14 09:32:46] DEBUG[2887] http.c: HTTP keeping session open.  status_code:204
[2017-01-14 09:32:46] DEBUG[2893] bridge_channel.c: Setting 0x7f4970085560(Local/13051002000-18001002000@cont-fas-00000054;1) state from:0 to:1
[2017-01-14 09:32:46] DEBUG[2893] bridge_channel.c: Bridge cid-836297: pulling 0x7f4970085560(Local/13051002000-18001002000@cont-fas-00000054;1)
[2017-01-14 09:32:46] DEBUG[1317] res_http_websocket.c: Writing websocket string of length 691
[2017-01-14 09:32:46] DEBUG[1317] res_http_websocket.c: Writing websocket text frame, length 691
[2017-01-14 09:32:46] VERBOSE[2893] bridge_channel.c: Channel Local/13051002000-18001002000@cont-fas-00000054;1 left 'simple_bridge' stasis-bridge <cid-836297>
[2017-01-14 09:32:46] DEBUG[2893] bridge_channel.c: Bridge cid-836297: 0x7f4970085560(Local/13051002000-18001002000@cont-fas-00000054;1) is leaving simple_bridge technology
[2017-01-14 09:32:46] DEBUG[2893] bridge_native_rtp.c: Bridge 'cid-836297' can not use native RTP bridge as two channels are required
[2017-01-14 09:32:46] DEBUG[2893] bridge.c: Bridge technology native_rtp is not compatible with properties of existing bridge.
[2017-01-14 09:32:46] DEBUG[2893] bridge.c: Bridge technology holding_bridge does not have any capabilities we want.
[2017-01-14 09:32:46] DEBUG[2893] bridge.c: Bridge technology softmix does not have any capabilities we want.
[2017-01-14 09:32:46] DEBUG[2893] bridge.c: Chose bridge technology simple_bridge
[2017-01-14 09:32:46] DEBUG[2893] bridge.c: Bridge cid-836297 is already using the new technology.
[2017-01-14 09:32:46] DEBUG[1317] res_http_websocket.c: Writing websocket string of length 697
[2017-01-14 09:32:46] DEBUG[1317] res_http_websocket.c: Writing websocket text frame, length 697
[2017-01-14 09:32:46] DEBUG[2893] stasis/control.c: 1484404357.679, cid-836297: Channel leaving bridge
[2017-01-14 09:32:46] DEBUG[2893] stasis/app.c: bridge 'cid-836297': is 2 interested in mg-bridges
[2017-01-14 09:32:46] DEBUG[2893] stasis/control.c:   reason: Channel was departed from bridge
[2017-01-14 09:32:46] DEBUG[2891] stasis/control.c: 1484404357.679: Channel departing bridge
[2017-01-14 09:32:46] DEBUG[2891] bridge.c: Waiting for 0x7f4970085560(Local/13051002000-18001002000@cont-fas-00000054;1) bridge thread to die.
[2017-01-14 09:32:46] DEBUG[1317] res_http_websocket.c: Writing websocket string of length 702
[2017-01-14 09:32:46] DEBUG[1317] res_http_websocket.c: Writing websocket text frame, length 702
[2017-01-14 09:32:46] DEBUG[2891] stasis/app.c: channel '1484404357.679': is 1 interested in mg-bridges
[2017-01-14 09:32:46] DEBUG[2891] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000054;1'


;; Few seconds later:

[2017-01-14 09:33:07] DEBUG[1256] res_pjsip_session.c: Received response
[2017-01-14 09:33:07] DEBUG[1256] res_pjsip_session.c: Response is 183 Session Progress
[2017-01-14 09:33:07] DEBUG[1256] res_pjsip_session.c: Received response
[2017-01-14 09:33:07] DEBUG[1256] res_pjsip_session.c: Response is 183 Session Progress
[2017-01-14 09:33:07] VERBOSE[2892][C-00000054] app_dial.c: PJSIP/flowroute-00000081 is making progress passing it to Local/13051002000-18001002000@cont-fas-00000054;2

;; ^^ ZOMBIE

*NOTE* It is the mere presence of b( ) that triggers the behavior. I've tested it with a simple subroutine of NoOp + Return and had the same issue with hangup.

By: Tom Pawelek (tompaw) 2017-01-15 12:24:51.190-0600

*UPDATE:* As I mentioned before, the issues only happens in ~75% of the cases. I've managed to capture a debug where 2 call attempts are placed without any changes in the dialplan: the first one is hung up properly, the second one is not.


+Changes to core_local.c+ (My futile attempts at locating the bug)

{code}
static int local_hangup(struct ast_channel *ast)
{
       ast_debug(1, "*TMP* Entering local_hangup\n");

       struct local_pvt *p = ast_channel_tech_pvt(ast);
       int res;

       if (!p) {
               ast_debug(1, "*TMP* No_p\n");
               return -1;
       }

       /* give the pvt a ref to fulfill calling requirements. */
       ao2_ref(p, +1);
       res = ast_unreal_hangup(&p->base, ast);
       if (!res) {
               ast_debug(1, "*TMP* No_res\n");
               int unlink;

               ao2_lock(p);
               unlink = !p->base.owner && !p->base.chan;
               ao2_unlock(p);
               if (unlink) {
                       ast_debug(1, "*TMP* Unlink_locals\n");
                       ao2_unlink(locals, p);
               }
       }
       ao2_ref(p, -1);

       return res;
}
{code}

+Correct hang up+

_C-00000005 correctly hangs up PJSIP/flowroute-00000006_

[2017-01-15 12:58:32] DEBUG[26445] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000005;1'
[2017-01-15 12:58:32] DEBUG[26445] core_local.c: *TMP* Entering local_hangup
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 881
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 881
[2017-01-15 12:58:32] DEBUG[26445] core_local.c: *TMP* No_res
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 648
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 648
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] channel.c: Hanging up channel 'PJSIP/flowroute-00000006'
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] chan_pjsip.c: AST hangup cause 0 (no match found in PJSIP)
[2017-01-15 12:58:32] DEBUG[26291] devicestate.c: Changing state for Local/13051002000-18001002000@cont-fas - state 1 (Not in use)
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 704
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] app_dial.c: Exiting with DIALSTATUS=CANCEL.
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 704
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] pbx.c: Spawn extension (cont-fas,13051002000-18001002000,5) exited non-zero on 'Local/13051002000-18001002000@cont-fas-00000005;2'
[2017-01-15 12:58:32] VERBOSE[26446][C-00000005] pbx.c: Spawn extension (cont-fas, 13051002000-18001002000, 5) exited non-zero on 'Local/13051002000-18001002000@cont-fas-00000005;2'
[2017-01-15 12:58:32] DEBUG[26307] res_pjsip_session.c: Method is CANCEL
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] channel.c: Soft-Hanging (0x10) up channel 'Local/13051002000-18001002000@cont-fas-00000005;2'
[2017-01-15 12:58:32] DEBUG[26307] res_config_pgsql.c: PostgreSQL query attempt 1
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000005;2'
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 674
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] core_local.c: *TMP* Entering local_hangup
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 674
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] core_local.c: *TMP* No_res
[2017-01-15 12:58:32] DEBUG[26446][C-00000005] core_local.c: *TMP* Unlink_locals
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 704
[2017-01-15 12:58:32] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 704
[2017-01-15 12:58:32] DEBUG[26291] core_local.c: Checking if extension 13051002000-18001002000@cont-fas exists (devicestate)
[2017-01-15 12:58:32] DEBUG[26338] stasis/app.c: channel '1484503110.30': is 0 interested in mg-bridges
[2017-01-15 12:58:32] DEBUG[26338] stasis/app.c: channel '1484503110.30' unsubscribed from mg-bridges
[2017-01-15 12:58:32] DEBUG[26291] devicestate.c: Changing state for Local/13051002000-18001002000@cont-fas - state 1 (Not in use)


+Zombie hang up+

_No attempts to hang up PJSIP (i.e. Local/;2)_

[2017-01-15 12:58:44] DEBUG[26455] bridge.c: Waiting for 0x7f680403d900(Local/13051002000-18001002000@cont-fas-00000006;1) bridge thread to die.
[2017-01-15 12:58:44] DEBUG[26455] stasis/app.c: channel '1484503123.40': is 1 interested in mg-bridges
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 749
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 749
[2017-01-15 12:58:44] DEBUG[26455] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000006;1'
[2017-01-15 12:58:44] DEBUG[26455] core_local.c: *TMP* Entering local_hangup
[2017-01-15 12:58:44] DEBUG[26455] core_local.c: *TMP* No_res
[2017-01-15 12:58:44] DEBUG[26456][C-00000006] res_rtp_asterisk.c: No remote address on RTP instance '0x7f6810028060' so dropping frame
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 881
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 881
[2017-01-15 12:58:44] DEBUG[26291] devicestate.c: Changing state for Local/13051002000-18001002000@cont-fas - state 1 (Not in use)
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 648
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 648
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 704
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 704
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 676
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 676
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 704
[2017-01-15 12:58:44] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 704
[2017-01-15 12:58:44] DEBUG[26338] stasis/app.c: channel '1484503123.40': is 0 interested in mg-bridges
[2017-01-15 12:58:44] DEBUG[26338] stasis/app.c: channel '1484503123.40' unsubscribed from mg-bridges
[2017-01-15 12:58:44] DEBUG[26447] http.c: HTTP Request URI is /ari/bridges/cid-819901


+Zombie follow up+

[2017-01-15 12:58:46] VERBOSE[26456][C-00000006] app_dial.c: PJSIP/flowroute-00000008 is making progress passing it to Local/13051002000-18001002000@cont-fas-00000006;2

;; 10 minutes later I finally hang up from the PSTN

[2017-01-15 13:10:26] DEBUG[26320] channel.c: Soft-Hanging (0x20) up channel 'Local/13051002000-18001002000@cont-fas-00000006;2'
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge_channel.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925: pulling 0x7f6804005b90(Local/13051002000-18001002000@cont-fas-0
0000006;2)
[2017-01-15 13:10:26] VERBOSE[26456][C-00000006] bridge_channel.c: Channel Local/13051002000-18001002000@cont-fas-00000006;2 left 'simple_bridge' basic-bridge <d03ac899-2abb
-4bca-8274-f52accf08925>
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge_channel.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925: 0x7f6804005b90(Local/13051002000-18001002000@cont-fas-00000006;
2) is leaving simple_bridge technology
[2017-01-15 13:10:26] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 692
[2017-01-15 13:10:26] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 692
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925 is dissolved, not performing smart bridge operation.
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925: actually destroying basic bridge, nobody wants it anymore
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925: calling basic bridge destructor
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925: calling simple_bridge technology stop
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] bridge.c: Bridge d03ac899-2abb-4bca-8274-f52accf08925: calling simple_bridge technology destructor
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] app_dial.c: Exiting with DIALSTATUS=ANSWER.
[2017-01-15 13:10:26] DEBUG[26338] res_http_websocket.c: Writing websocket string of length 702
[2017-01-15 13:10:26] DEBUG[26338] res_http_websocket.c: Writing websocket text frame, length 702
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] pbx.c: Spawn extension (cont-fas,13051002000-18001002000,5) exited non-zero on 'Local/13051002000-18001002000@cont-fas-0000000
6;2'
[2017-01-15 13:10:26] VERBOSE[26456][C-00000006] pbx.c: Spawn extension (cont-fas, 13051002000-18001002000, 5) exited non-zero on 'Local/13051002000-18001002000@cont-fas-000
00006;2'
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] channel.c: Soft-Hanging (0x10) up channel 'Local/13051002000-18001002000@cont-fas-00000006;2'
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000006;2'
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] core_local.c: *TMP* Entering local_hangup
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] core_local.c: *TMP* No_res
[2017-01-15 13:10:26] DEBUG[26456][C-00000006] core_local.c: *TMP* Unlink_locals

Zombie bridge (simple / audio): *d03ac899-2abb-4bca-8274-f52accf08925*
Zombie leg: *Local/13051002000-18001002000@cont-fas-00000006;2*

By: Tom Pawelek (tompaw) 2017-01-15 17:52:19.678-0600

*UPDATE*: I have started digging in core_unreal and core_local. Here are my findings:

+Proper hangup+

[2017-01-15 18:35:16] DEBUG[963] core_local.c: *TMP* Entering local_hangup
[2017-01-15 18:35:16] DEBUG[963] core_unreal.c: *TMP* ast_unreal_lock_all(Local/13051002000-18001002000@cont-fas-00000004;1) -> owner: Local/13051002000-18001002000@cont-fas-00000004;1, chan: Local/13051002000-18001002000@cont-fas-00000004;2
[2017-01-15 18:35:16] DEBUG[963] core_unreal.c: *TMP* Inside unreal_hangup(Local/13051002000-18001002000@cont-fas-00000004;1) P_CHAN_IF_2
[2017-01-15 18:35:16] DEBUG[963] channel.c: *TMP* Received ast_queue_hangup_with_cause for Local/13051002000-18001002000@cont-fas-00000004;2
[2017-01-15 18:35:16] DEBUG[963] channel.c: *TMP* ast_queue_hangup_with_cause returning 0 for Local/13051002000-18001002000@cont-fas-00000004;2
[2017-01-15 18:35:16] DEBUG[963] core_local.c: *TMP* No_res
[2017-01-15 18:35:16] DEBUG[909] res_http_websocket.c: Writing websocket string of length 648
[2017-01-15 18:35:16] DEBUG[909] res_http_websocket.c: Writing websocket text frame, length 648
[2017-01-15 18:35:16] DEBUG[964][C-00000004] channel.c: Hanging up channel 'PJSIP/flowroute-00000003'
[2017-01-15 18:35:16] DEBUG[964][C-00000004] chan_pjsip.c: AST hangup cause 0 (no match found in PJSIP)
[2017-01-15 18:35:16] DEBUG[964][C-00000004] app_dial.c: Exiting with DIALSTATUS=CANCEL.
[2017-01-15 18:35:16] DEBUG[909] res_http_websocket.c: Writing websocket string of length 704
[2017-01-15 18:35:16] DEBUG[909] res_http_websocket.c: Writing websocket text frame, length 704
[2017-01-15 18:35:16] DEBUG[964][C-00000004] pbx.c: Spawn extension (cont-fas,13051002000-18001002000,5) exited non-zero on 'Local/13051002000-18001002000@cont-fas-00000004;2'
[2017-01-15 18:35:16] DEBUG[859] devicestate.c: Changing state for Local/13051002000-18001002000@cont-fas - state 1 (Not in use)
[2017-01-15 18:35:16] VERBOSE[964][C-00000004] pbx.c: Spawn extension (cont-fas, 13051002000-18001002000, 5) exited non-zero on 'Local/13051002000-18001002000@cont-fas-00000004;2'
[2017-01-15 18:35:16] DEBUG[964][C-00000004] channel.c: Soft-Hanging (0x10) up channel 'Local/13051002000-18001002000@cont-fas-00000004;2'
[2017-01-15 18:35:16] DEBUG[875] res_pjsip_session.c: Method is CANCEL
[2017-01-15 18:35:16] DEBUG[964][C-00000004] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000004;2'
[2017-01-15 18:35:16] DEBUG[875] res_config_pgsql.c: PostgreSQL query attempt 1
[2017-01-15 18:35:16] DEBUG[964][C-00000004] core_local.c: *TMP* Entering local_hangup
[2017-01-15 18:35:16] DEBUG[964][C-00000004] core_unreal.c: *TMP* ast_unreal_lock_all(Local/13051002000-18001002000@cont-fas-00000004;2) -> owner: NULL, chan: Local/13051002000-18001002000@cont-fas-00000004;2
[2017-01-15 18:35:16] DEBUG[964][C-00000004] core_local.c: *TMP* No_res
[2017-01-15 18:35:16] DEBUG[964][C-00000004] core_local.c: *TMP* Unlink_locals

+Zombie hangup+

[2017-01-15 18:35:39] DEBUG[978] channel.c: Hanging up channel 'Local/13051002000-18001002000@cont-fas-00000005;1'
[2017-01-15 18:35:39] DEBUG[978] core_local.c: *TMP* Entering local_hangup
[2017-01-15 18:35:39] DEBUG[978] core_unreal.c: *TMP* ast_unreal_lock_all(Local/13051002000-18001002000@cont-fas-00000005;1) -> owner: Local/13051002000-18001002000@cont-fas-00000005;1, chan: Local/13051002000-18001002000@cont-fas-00000005;2
[2017-01-15 18:35:39] DEBUG[978] core_unreal.c: *TMP* Inside unreal_hangup(Local/13051002000-18001002000@cont-fas-00000005;1) P_CHAN_IF_2
[2017-01-15 18:35:39] DEBUG[909] res_http_websocket.c: Writing websocket string of length 749
[2017-01-15 18:35:39] DEBUG[978] channel.c: *TMP* Received ast_queue_hangup_with_cause for Local/13051002000-18001002000@cont-fas-00000005;2
[2017-01-15 18:35:39] DEBUG[909] res_http_websocket.c: Writing websocket text frame, length 749
[2017-01-15 18:35:39] DEBUG[978] channel.c: *TMP* ast_queue_hangup_with_cause returning 0 for Local/13051002000-18001002000@cont-fas-00000005;2
[2017-01-15 18:35:39] DEBUG[978] core_local.c: *TMP* No_res
*[2017-01-15 18:35:39] DEBUG[979][C-00000005] res_rtp_asterisk.c: No remote address on RTP instance '0x7f862c03c310' so dropping frame*
[...]
[2017-01-15 18:35:40] VERBOSE[979][C-00000005] app_dial.c: PJSIP/flowroute-00000005 is making progress passing it to Local/13051002000-18001002000@cont-fas-00000005;2

+Observations+

* First of all, the flow through hangup functions is identical. P_CHAN_IF_2 means both cases call the ast_queue_hangup_with_cause(p->chan, cause) function inside ast_unreal_hangup.
* In both cases, ast_queue_hangup_with_cause returns 0 for the ;2 channels, which means the issue arises +after+ the ;2 hangup has been sheduled.
* Every time the problem manifests itself, there is this pesky "res_rtp_asterisk.c: No remote address on RTP instance" error, which does not appear in successful hangups. Maybe it's related? Could an RTP issue prevent a channel from hanging up properly?
* In both calls, the ;2 channels go through the same "Don't queue anything" case inside __ast_queue_frame.

By: Tom Pawelek (tompaw) 2017-01-15 22:34:44.877-0600

*UPDATE* I managed to confirm both ;2 channels end up inside the following condition (__ast_read):

{code}
if (ast_channel_softhangup_internal_flag(chan)) {
       ast_debug(1, "*TMP* ast_read_soft(%s)\n", ast_channel_name(chan));
       ast_queue_control(chan, AST_CONTROL_END_OF_Q);
}
{code}

I will continue down the rabbit hole tomorrow.

By: Tom Pawelek (tompaw) 2017-01-16 14:13:52.971-0600

*UPDATE* It seems that the faulty ;2 channel never reaches either of the cases in https://github.com/asterisk/asterisk/blob/13.13/main/channel.c#L3940

The proper ;2 hangup hits both AST_CONTROL_HANGUP and AST_CONTROL_END_OF_Q

By: Tom Pawelek (tompaw) 2017-01-16 15:09:01.775-0600

Rogue leg: Local/13051002000-18001002000@cont-fas-00000003;2

By: Tom Pawelek (tompaw) 2017-01-17 10:53:19.088-0600

I have managed to go around the original CLI issue (which required b() in the first place) by using __CONNECTED_LINE_SEND_SUB.

Even though this too is a subroutine it causes no issues as of now (4hrs live @ 15 production outbound servers).

By: Richard Mudgett (rmudgett) 2017-01-19 16:43:43.720-0600

[~tompaw] The https://gerrit.asterisk.org/#/c/4530/ patch will for the most part fix the problem documented by this issue.  We had earlier discussed that it wasn't possible for that patch to fix the problem.  However, I think that the missing log message forming that conclusion was because the debug message simply was not in the code at the time [^ari_h_error.txt] was captured.  Several of the debug log messages are missing line terminators in the log while the altered code in [^tmp_core.tar] has line terminators on all added debug messages.

I also managed to simulate the [^ari_h_error.txt] problem yesterday.  I initially set up the scenario but was not able to reproduce the problem even though it should have happened.  The key was finding out that you have to be using the timerfd timing module in that scenario for the problem to happen.  The timerfd module is the only timer that uses the alert-pipe with the channel read queue.  When the alert-pipe has fewer entries in it than the read queue you can get "stuck" local channels because the alert-pipe is the only trigger to read frames out of the local channel.  I normally use the DAHDI timer module because I have cards in my test box.

By: Tom Pawelek (tompaw) 2017-01-21 08:02:19.924-0600

@rmudgett Thanks for the update, I will test the patch Sunday or Monday, as today I am traveling. Did you manage to reproduce the error after switching to software timers?

I am pretty sure I only added the line terminators after capturing the log for the sake of clarity - the [^ari_h_error.txt] was captured while all the debug messages were already present in the asterisk binary. I know that the "h" series logs were the very last I captured.

The referenced issue you pointed out mentioned frame-related problems caused by confbridge_recording. For what it's worth - all my bridges have a Snoop channel participant, which records the conversation. I don't know if confbridge_recording has anything to do with stasis Snooping, though.

I wonder how interception routines are different from pre-dial handlers. Both show up as "Internal Subroutine" in debug log, which suggests they use the same internals, however only the latter cause this issue to arise.

By: Friendly Automation (friendly-automation) 2017-02-02 17:30:56.108-0600

Change 4881 merged by zuul:
Frame deferral: Revert API refactoring.

[https://gerrit.asterisk.org/4881|https://gerrit.asterisk.org/4881]

By: Friendly Automation (friendly-automation) 2017-02-02 17:56:54.111-0600

Change 4856 merged by zuul:
Frame deferral: Revert API refactoring.

[https://gerrit.asterisk.org/4856|https://gerrit.asterisk.org/4856]

By: Friendly Automation (friendly-automation) 2017-02-02 18:09:38.225-0600

Change 4878 merged by Joshua Colp:
Frame deferral: Revert API refactoring.

[https://gerrit.asterisk.org/4878|https://gerrit.asterisk.org/4878]

By: Friendly Automation (friendly-automation) 2017-02-02 18:09:56.159-0600

Change 4875 merged by Joshua Colp:
Frame deferral: Revert API refactoring.

[https://gerrit.asterisk.org/4875|https://gerrit.asterisk.org/4875]

By: Friendly Automation (friendly-automation) 2017-02-02 22:12:05.059-0600

Change 4857 merged by zuul:
res_agi: Prevent an AGI from eating frames it should not. (Re-do)

[https://gerrit.asterisk.org/4857|https://gerrit.asterisk.org/4857]

By: Friendly Automation (friendly-automation) 2017-02-02 23:22:36.498-0600

Change 4882 merged by zuul:
res_agi: Prevent an AGI from eating frames it should not. (Re-do)

[https://gerrit.asterisk.org/4882|https://gerrit.asterisk.org/4882]

By: Friendly Automation (friendly-automation) 2017-02-02 23:28:30.788-0600

Change 4858 merged by George Joseph:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4858|https://gerrit.asterisk.org/4858]

By: Friendly Automation (friendly-automation) 2017-02-02 23:55:05.628-0600

Change 4883 merged by zuul:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4883|https://gerrit.asterisk.org/4883]

By: Friendly Automation (friendly-automation) 2017-02-02 23:55:08.018-0600

Change 4876 merged by zuul:
res_agi: Prevent an AGI from eating frames it should not. (Re-do)

[https://gerrit.asterisk.org/4876|https://gerrit.asterisk.org/4876]

By: Friendly Automation (friendly-automation) 2017-02-02 23:55:13.442-0600

Change 4877 merged by zuul:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4877|https://gerrit.asterisk.org/4877]

By: Friendly Automation (friendly-automation) 2017-02-03 05:33:10.457-0600

Change 4879 merged by Joshua Colp:
res_agi: Prevent an AGI from eating frames it should not. (Re-do)

[https://gerrit.asterisk.org/4879|https://gerrit.asterisk.org/4879]

By: Friendly Automation (friendly-automation) 2017-02-03 05:33:22.311-0600

Change 4880 merged by Joshua Colp:
channel.c: Fix unbalanced read queue deadlocking local channels.

[https://gerrit.asterisk.org/4880|https://gerrit.asterisk.org/4880]