Summary: | ASTERISK-28960: bridge: System gets into state where bridge is terminated after joining | ||
Reporter: | Robert Sutton (rsutton@noojee.com.au) | Labels: | fax webrtc |
Date Opened: | 2020-06-24 00:36:55 | Date Closed: | 2020-07-13 12:00:02 |
Priority: | Minor | Regression? | |
Status: | Closed/Complete | Components: | Bridges/bridge_simple |
Versions: | 16.10.0 | Frequency of Occurrence | Occasional |
Related Issues: | |||
Environment: | ubunut 18.04, docker asterisk 16.10 | Attachments: | |
Description: | system gets into state where bridges collapse within a few seconds
calls kept flowing into the system, but bridges would collapse without error within a few seconds of starting. Seems to be happening about once a week. This particular system has only ever run 16.10 I have a core dump from the running system and full logs including a period with debug set to 4 below is a typical call flow while the system was in this state. {noformat} Jun 24 10:53:29 -- Executing [555555555@routesv2-inbound:1] Set("PJSIP/trunk-00000fa4", "AGIEXITONHANGUP=yes") in new stack Jun 24 10:53:29 -- Executing [555555555@routesv2-inbound:2] Set("PJSIP/trunk-00000fa4", "recovery-did=555555555") in new stack Jun 24 10:53:29 -- Executing [555555555@routesv2-inbound:3] AGI("PJSIP/trunk-00000fa4", "agi://127.0.0.1/route?targ=555555555&ctx=Pre-Inbound") in new stack Jun 24 10:53:34 -- AGI Script Executing Application: (mixmonitor) Options: (/var/spool/asterisk/monitor/1592960009.5408-4444444444-3333333333-I-1.wav,,curl 'http://127.0.0.1:8080/rest/recordingMigration/migrate?guid=1592960014936-35584&apiKey=xxxxxxxxxxxx') Jun 24 10:53:34 == Begin MixMonitor Recording PJSIP/trunk-00000fa4 Jun 24 10:53:34 -- Started music on hold, class 'queue17', on channel 'PJSIP/trunk-00000fa4' Jun 24 10:53:35 -- PJSIP/417-00000faa is ringing Jun 24 10:53:35 -- PJSIP/417-00000faa is ringing Jun 24 10:53:40 -- PJSIP/417-00000faa answered Jun 24 10:53:40 -- Executing [activity-agi@activity:1] Set("PJSIP/417-00000faa", "AGIEXITONHANGUP=yes") in new stack Jun 24 10:53:40 -- Executing [activity-agi@activity:2] AGI("PJSIP/417-00000faa", "agi://127.0.0.1/activityAgi") in new stack Jun 24 10:53:40 == Begin MixMonitor Recording PJSIP/trunk-00000fa4 Jun 24 10:53:40 -- AGI Script Executing Application: (bridge) Options: (PJSIP/trunk-00000fa4,x) Jun 24 10:53:40 -- Stopped music on hold on PJSIP/trunk-00000fa4 Jun 24 10:53:40 == Spawn extension (routesv2-inbound, 555555555, 3) exited non-zero on 'Surrogate/PJSIP/trunk-00000fa4' Jun 24 10:53:40 -- Channel PJSIP/trunk-00000fa4 joined 'simple_bridge' basic-bridge <67256011-e289-430d-bc1c-08b96c97811e> Jun 24 10:53:40 -- Channel PJSIP/417-00000faa joined 'simple_bridge' basic-bridge <67256011-e289-430d-bc1c-08b96c97811e> Jun 24 10:53:40 == Begin MixMonitor Recording PJSIP/trunk-00000fa4 Jun 24 10:53:40 OK == End MixMonitor Recording PJSIP/trunk-00000fa4 Jun 24 10:53:40 OK == End MixMonitor Recording PJSIP/trunk-00000fa4 Jun 24 10:53:41 -- Channel PJSIP/trunk-00000fa4 left 'simple_bridge' basic-bridge <67256011-e289-430d-bc1c-08b96c97811e> Jun 24 10:53:41 -- Channel PJSIP/417-00000faa left 'simple_bridge' basic-bridge <67256011-e289-430d-bc1c-08b96c97811e> Jun 24 10:53:41 -- Manager 'njcontact' from 127.0.0.1, hanging up channel: PJSIP/417-00000faa Jun 24 10:53:41 == Spawn extension (activity, activity-agi, 2) exited non-zero on 'PJSIP/417-00000faa' Jun 24 10:53:42 OK == End MixMonitor Recording PJSIP/trunk-00000fa4 {noformat} for future reference: {noformat} the issue started somewhere between 10:38 and 10:46 debug set to level 4 at 10:56:08 core dump triggered at 10:55:32 system restart at 10:57:30 approx {noformat} | ||
Comments: | By: Asterisk Team (asteriskteam) 2020-06-24 00:36:56.603-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur. By: Kevin Harwell (kharwell) 2020-06-24 16:23:48.770-0500 Would it be possible to also enable SIP tracing in the Asterisk log: {noformat} $CLI> pjsip set logger on {noformat} Or get a pcap during an occurrence? Also could you try bumping the debug up to level 5 as well? Thanks! By: Robert Sutton (rsutton@noojee.com.au) 2020-06-24 19:49:00.787-0500 I have a pcap from yesterdays event. here is data extracted from it around a single call. The call recording from the point the bridge was initiated only contains the agent's voice {noformat} cat call-agent.txt |Time | agentip | | | | 192.168.160.59 | |7043.977842| 100 Trying| |SIP Status 100 Trying | |(64091) ------------------> (5060) | |7044.164403| 180 Ringing |SIP Status 180 Ringing | |(64091) ------------------> (5060) | |7045.293780| 200 OK SDP (g711U te |SIP Status 200 OK | |(64091) ------------------> (5060) | |7045.294888| ACK | |SIP ACK From: "E CPAP DIRECT GOLD COAST RECEPTION" <sip:3333333333@192.168.160.59> To:<sip:112@agentip> CSeq:25407 | |(64091) <------------------ (5060) | |7045.294888| ACK | |SIP ACK From: "E CPAP DIRECT GOLD COAST RECEPTION" <sip:3333333333@192.168.160.59> To:<sip:112@agentip> CSeq:25407 | |(64091) <------------------ (5060) | |7056.095022| BYE | |SIP Request BYE CSeq:25408 | |(64091) ------------------> (5060) | |7056.095022| BYE | |SIP Request BYE CSeq:25408 | |(64091) ------------------> (5060) | |7056.095412| 200 OK | |SIP Status 200 OK | |(64091) <------------------ (5060) | |7056.095412| 200 OK | |SIP Status 200 OK | |(64091) <------------------ (5060) | cat call-caller.txt |Time | 103.47.234.6 | | | | 192.168.160.59 | |6943.106587| INVITE SDP (g711A g7 |SIP INVITE From: <sip:22222222222@providerip> To:<sip:11111111111@externalip:5060;line=fsadhxs> Call-ID:CLA_XRQCIWh2blh6VFVeWQtQW3dcRgN2fGouAy9UBgdYA0oONRdaBGh3d2pFf09dR1kFUEV3 CSeq:21926098 | |(5060) ------------------> (5060) | |6943.106587| INVITE SDP (g711A g7 |SIP INVITE From: <sip:22222222222@providerip> To:<sip:11111111111@externalip:5060;line=fsadhxs> Call-ID:CLA_XRQCIWh2blh6VFVeWQtQW3dcRgN2fGouAy9UBgdYA0oONRdaBGh3d2pFf09dR1kFUEV3 CSeq:21926098 | |(5060) ------------------> (5060) | |6943.108645| 100 Trying| |SIP Status 100 Trying | |(5060) <------------------ (5060) | |6943.108645| 100 Trying| |SIP Status 100 Trying | |(5060) <------------------ (5060) | |6943.113836| 183 Session Progress |SIP Status 183 Session Progress | |(5060) <------------------ (5060) | |6943.113836| 183 Session Progress |SIP Status 183 Session Progress | |(5060) <------------------ (5060) | |6943.113964| 200 OK SDP (g711U g7 |SIP Status 200 OK | |(5060) <------------------ (5060) | |6943.113964| 200 OK SDP (g711U g7 |SIP Status 200 OK | |(5060) <------------------ (5060) | |6943.140271| ACK | |SIP Request INVITE ACK 200 CSeq:21926098 | |(5060) ------------------> (5060) | |6943.140271| ACK | |SIP Request INVITE ACK 200 CSeq:21926098 | |(5060) ------------------> (5060) | |7056.096775| BYE | |SIP Request BYE CSeq:26142 | |(5060) <------------------ (5060) | |7056.096775| BYE | |SIP Request BYE CSeq:26142 | |(5060) <------------------ (5060) | |7056.116854| 200 OK | |SIP Status 200 OK | |(5060) ------------------> (5060) | |7056.116854| 200 OK | |SIP Status 200 OK | |(5060) ------------------> (5060) | egrep -i '3333333333|1592959781|00000EF7|6f508be4eab4|00000f25' asterisk-logs.txt Jun 24 10:49:41 -- AGI Script Executing Application: (mixmonitor) Options: (/var/spool/asterisk/monitor/1592959775.5179-anonymous-03//127.0.0.1:8080/rest/recordingMigration/migrate?guid=1592959781143-20668&apiKey=xxxxxxxxxxxxxxx') Jun 24 10:49:41 -- Executing [11111111111@routesv2-inbound:1] Set("PJSIP/trunk-00000ef7", "AGIEXITONHANGUP=yes") in new stack Jun 24 10:49:41 -- Executing [11111111111@routesv2-inbound:2] Set("PJSIP/trunk-00000ef7", "recovery-did=11111111111") in new stack Jun 24 10:49:41 -- Executing [11111111111@routesv2-inbound:3] AGI("PJSIP/trunk-00000ef7", "agi://127.0.0.1/route?targ=11111111111&ctx= Jun 24 10:49:47 -- AGI Script Executing Application: (mixmonitor) Options: (/var/spool/asterisk/monitor/1592959781.5182-3333333333-0://127.0.0.1:8080/rest/recordingMigration/migrate?guid=1592959787095-23261&apiKey=xxxxxxxxxxxxxxx') Jun 24 10:49:47 == Begin MixMonitor Recording PJSIP/trunk-00000ef7 Jun 24 10:49:47 -- Started music on hold, class 'queue', on channel 'PJSIP/trunk-00000ef7' Jun 24 10:49:58 == Executing [curl 'http://127.0.0.1:8080/rest/recordingMigration/migrate?guid=1592959781143-20668&apiKey=8d26f182-978 Jun 24 10:51:22 -- PJSIP/112-00000f25 is ringing Jun 24 10:51:22 -- PJSIP/112-00000f25 is ringing Jun 24 10:51:23 -- PJSIP/112-00000f25 answered Jun 24 10:51:23 -- Executing [activity-agi@noojee-activity:1] Set("PJSIP/112-00000f25", "AGIEXITONHANGUP=yes") in new stack Jun 24 10:51:23 -- Executing [activity-agi@noojee-activity:2] AGI("PJSIP/112-00000f25", "agi://127.0.0.1/activityAgi") in new stack Jun 24 10:51:23 == Begin MixMonitor Recording PJSIP/trunk-00000ef7 Jun 24 10:51:23 -- AGI Script Executing Application: (bridge) Options: (PJSIP/trunk-00000ef7,x) Jun 24 10:51:23 -- Stopped music on hold on PJSIP/trunk-00000ef7 Jun 24 10:51:23 -- Channel PJSIP/trunk-00000ef7 joined 'simple_bridge' basic-bridge <a96a4bc4-e774-414a-8688-6f508be4eab4> Jun 24 10:51:23 == Spawn extension (routesv2-inbound, 11111111111, 3) exited non-zero on 'Surrogate/PJSIP/trunk-00000ef7' Jun 24 10:51:23 -- Channel PJSIP/112-00000f25 joined 'simple_bridge' basic-bridge <a96a4bc4-e774-414a-8688-6f508be4eab4> Jun 24 10:51:23 == Begin MixMonitor Recording PJSIP/trunk-00000ef7 Jun 24 10:51:24 OK == End MixMonitor Recording PJSIP/trunk-00000ef7 Jun 24 10:51:24 OK == End MixMonitor Recording PJSIP/trunk-00000ef7 Jun 24 10:51:34 -- Channel PJSIP/112-00000f25 left 'simple_bridge' basic-bridge <a96a4bc4-e774-414a-8688-6f508be4eab4> Jun 24 10:51:34 -- Channel PJSIP/trunk-00000ef7 left 'simple_bridge' basic-bridge <a96a4bc4-e774-414a-8688-6f508be4eab4> Jun 24 10:51:34 == Spawn extension (noojee-activity, activity-agi, 2) exited non-zero on 'PJSIP/112-00000f25' Jun 24 10:51:34 [Jun 24 10:51:34] NOTICE[418]: manager.c:4491 action_hangup: Request to hangup non-existent channel: PJSIP/112-00000F25 {noformat} By: Kevin Harwell (kharwell) 2020-06-25 13:45:25.696-0500 Looks like from the log the mixmonitor stopped recording for one leg of the call just after the bridge was joined. Not sure why it stopped, but probably why you only hear the agent's voice. I'm having trouble though following the call trace, and want to make sure I'm reading it right. For the "call-agent" is that Asterisk ----> agent endpoint? And for "call-caller" is that the caller endpoint ----> Asterisk? For the caller everything is doubled but has the same timestamp, so I'm assuming this is an artifact of extracting the data? Is there any way we could get a copy of the original pcap? If it has private data in it you don't want to attach here (made public) you can send it as an attachment to asteriskteam@digium.com with ASTERISK-28960 as the subject instead. By: Kevin Harwell (kharwell) 2020-06-25 13:46:24.699-0500 Also can you attach or post the relevant pjsip endpoint configuration, and executing dialplan? By: Robert Sutton (rsutton@noojee.com.au) 2020-06-25 18:52:19.154-0500 I've provided pcap, cli output, and live core dump via "asteriskteam@digium.com with ASTERISK-28960 as the subject" You are correct about "call-agent" is Asterisk ----> agent endpoint and "call-caller" is caller endpoint ----> Asterisk I don't know why there is doubling in the log, this is how wireshark presented the data. could this be a symptom of the underlying cause? below is the dialplan involved, as you can see all the action takes place in AGI In the AGI the caller is played hold music and possibly other messages while waiting, we are not using asterisk's native queues One or more agents are originated via dialplan to AGI. When an agent answers, a bridge is initiated via AGI {noformat} [routesv2-inbound] exten => _[s0-9].,1,set(AGIEXITONHANGUP=yes) exten => _[s0-9].,n,set(recovery-did=${EXTEN}) exten => _[s0-9].,n,AGI(agi://${TOMCAT_HOST}/route?targ=${EXTEN}&ctx=Pre-Inbound) exten => _[s0-9].,n,goto(routesv2-checks,${EXTEN},1) exten => _[s0-9].,n,Hangup [activity] exten => activity-agi,1,set(AGIEXITONHANGUP=yes) exten => activity-agi,n,AGI(agi://${TOMCAT_HOST}/activityAgi) exten => activity-agi,n,wait(0.5) exten => activity-agi,n,goto(activity-agi|1) {noformat} agent endpoint configuration (realtime) {noformat} ParameterName : ParameterValue ========================================================= 100rel : yes accept_multiple_sdp_answers : false accountcode : acl : deny/permit aggregate_mwi : true allow : (ulaw|g729|ilbc) allow_overlap : true allow_subscribe : true allow_transfer : true aors : 112 asymmetric_rtp_codec : false auth : 112 bind_rtp_to_media_address : false bundle : false call_group : callerid : Email Desk callerid_privacy : allowed_not_screened callerid_tag : connected_line_method : invite contact_acl : context : routesv2-handset cos_audio : 0 cos_video : 0 device_state_busy_at : 2 direct_media : false direct_media_glare_mitigation : none direct_media_method : invite disable_direct_media_on_nat : false dtls_auto_generate_cert : No dtls_ca_file : dtls_ca_path : dtls_cert_file : dtls_cipher : dtls_fingerprint : SHA-256 dtls_private_key : dtls_rekey : 0 dtls_setup : active dtls_verify : No dtmf_mode : rfc4733 fax_detect : false fax_detect_timeout : 0 follow_early_media_fork : true force_avp : false force_rport : true from_domain : from_user : g726_non_standard : false ice_support : true identify_by : username,ip ignore_183_without_sdp : false inband_progress : false incoming_mwi_mailbox : language : mailboxes : 112 max_audio_streams : 1 max_video_streams : 1 media_address : media_encryption : no media_encryption_optimistic : false media_use_received_transport : false message_context : moh_passthrough : false moh_suggest : default2 mwi_from_user : mwi_subscribe_replaces_unsolicited : no named_call_group : named_pickup_group : notify_early_inuse_ringing : false one_touch_recording : false outbound_auth : 112 outbound_proxy : pickup_group : 1 preferred_codec_only : false record_off_feature : automixmon record_on_feature : automixmon refer_blind_progress : true rewrite_contact : true rpid_immediate : false rtcp_mux : false rtp_engine : asterisk rtp_ipv6 : false rtp_keepalive : 0 rtp_symmetric : true rtp_timeout : 0 rtp_timeout_hold : 0 sdp_owner : - sdp_session : Asterisk send_connected_line : yes send_diversion : true send_pai : false send_rpid : false set_var : srtp_tag_32 : false sub_min_expiry : 0 subscribe_context : suppress_q850_reason_headers : false t38_udptl : false t38_udptl_ec : none t38_udptl_ipv6 : false t38_udptl_maxdatagram : 0 t38_udptl_nat : false timers : yes timers_min_se : 90 timers_sess_expires : 1800 tone_zone : tos_audio : 0 tos_video : 0 transport : transport-udp-nat trust_connected_line : yes trust_id_inbound : true trust_id_outbound : true use_avpf : false use_ptime : false user_eq_phone : false voicemail_extension : *3 webrtc : no {noformat} By: Robert Sutton (rsutton@noojee.com.au) 2020-06-25 18:56:18.859-0500 I was reading the issue title and saw that it is ambiguous. The system remains stable for long periods of time, but when it gets into this state bridges fail. By: Joshua C. Colp (jcolp) 2020-06-29 03:57:50.727-0500 You're using the Bridge dialplan application which does more than just bridge, it also steals a channel away from another thread which does involve setting a temporary hangup flag. When this occurs do bridges that do not occur as a result of the Bridge dialplan application work? By: Robert Sutton (rsutton@noojee.com.au) 2020-06-29 04:27:28.526-0500 On this particular instance I didn't check other bridges. But on another occasion when we had similar symptoms, we shut down the AGI and tried to used asterisk queues, but the calls were still unable to bridge. By: Joshua C. Colp (jcolp) 2020-06-29 05:02:07.815-0500 I would suggest adding some additional logging to the dialplan[1] to state who hung up in the situation. This will narrow down the potential area. As it is I'm not seeing anything standing out, except a lot of BYE requests coming in with a non-normal reason (Recovery on timer expiry) suddenly. You can see it by filtering using "sip.reason_cause_q850 == 102". There's a few at the start, but about 7000 packets in there is a lot. If this is what your trunk is - then I'd expect that channel to be hung up and things to terminate. That may be what is happening. In fact, in some cases Asterisk has already bridged things enough that the cause code is getting passed through to the remote side (not in all cases mind you, but some). [1] https://wiki.asterisk.org/wiki/pages/viewpage.action?pageId=20185363 By: Asterisk Team (asteriskteam) 2020-07-13 12:00:01.273-0500 Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1]. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines |