[Home]

Summary:ASTERISK-29035: chan_local: Multistream support breaks T.38 faxing
Reporter:Matthias Hensler (cubbi)Labels:fax patch
Date Opened:2020-08-18 16:14:05Date Closed:2021-03-22 07:48:54
Priority:MinorRegression?Yes
Status:Closed/CompleteComponents:Channels/chan_local
Versions:16.12.0 18.1.0 Frequency of
Occurrence
Constant
Related
Issues:
is duplicated byASTERISK-29167 T.38 fax relay won't pass through local channels in recent versions
is duplicated byASTERISK-29206 T.38 does not work through Local Channels
Environment:CentOS 7Attachments:( 0) ASTERISK-29035.diff
( 1) Inbound-16.11.0-LocalChannel.pcap
( 2) Inbound-16.12.0-LocalChannel.pcap
( 3) Inbound-16.12.0-NoLocal.pcap
Description:After updating from 16.11.0 to 16.12.0 all faxes via SendFax() and ReceiveFax()-applications break, when placed on a Local-channel and T.38 is used.

Our dialplan makes use of Local-channels when dialing internal numbers and is also used for issuing outgoing fax-traffic via an originate-call using the manager-API. With the rework of chan_local in 16.12.0 (I suspect ASTERISK-28938 here) this breaks our outgoing fax-traffic when the destination uses T.38. If G.711a is used the problem does not occur.

When looking at some pcap-traces I can see, that the destination reinvites to T.38 after a call is established and sends lots of T.38 packets. However Asterisk simple ignores any T.38-packet and does not respond to any of these packets, until the call is then finally dropped and fax-transmission ultimately fails.


Fortunately the problem is very easy to reproduce with an inbound fax also. For that I used an older test-instance of Asterisk 17.1.0 which dials in to our real Asterisk instance where the call is placed via a `Dial(Local/...)` to the appropriate extension where ReceiveFax() is running.

When running Asterisk 16.11.0 everything works as expected (see Inbound-16.11.0-LocalChannel.pcap) where the connection is established with T.38 and the fax is received properly.

When switching to Asterisk 16.12.0 on the receiving end the exact same setup breaks (see Inbound-16.12.0-LocalChannel.pcap). The connection is established with T.38 and packets are send from the test-instance, but no replies are issued by Asterisk 16.12.0 (we are running ReceiveFax() in debug mode and it does not log anything with 16.12.0, so obviously chan_local does not properly relays the T.38-frames).

Just by removing the `Dial(Local/...)` from the dialplan and jumping directly to the ReceiveFax()-extension makes the setup work again on Asterisk 16.12.0 (see Inbound-16.12.0-NoLocal.pcap). So I am pretty sure that the culprit results from the recent changes to chan_local.


I know that using local-channels is not the nicest thing to do, but we rely on it to process call redirections on local numbers, while still keeping proper call data records for correct billing.

If more debug output besides the pcap-traces, or a proper dialplan example is needed, I am happy to provide it of course.
Comments:By: Asterisk Team (asteriskteam) 2020-08-18 16:14:06.355-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Matthias Hensler (cubbi) 2020-08-18 16:15:53.340-0500

Attached are pcap-traces between Asterisk 17.1.0 and Asterisk 16.11.0 (working with chan_local) / 16.12.0 (not working with chan_local, but fine when no local-channel is used)

By: Hendrik Wedhorn (hwedhorn) 2020-08-28 09:55:44.637-0500

Same setup, same issue here.

By: Matthias Hensler (cubbi) 2020-08-28 10:01:41.206-0500

FYI: we worked around the problem by adding "F" to SendFax()/ReceiveFax() and force G711a for the time being.
see https://wiki.asterisk.org/wiki/display/AST/Asterisk+16+Application_SendFAX_res_fax


By: Hendrik Wedhorn (hwedhorn) 2020-08-31 06:21:00.738-0500

In our case we need T.38. I just had success with reverting the changes from core_local.c from asterisk-16.11.1 -> asterisk-16.12.0. Maybe i find a solution without removing that entire new fearure from core_local.c - If that's the case i will provide a patch in the upcoming hours.

By: Dang Thanh Loi (loi.dangthanh@gmail.com) 2020-11-02 20:50:49.089-0600

Same setup for SendFax here, resulted in the same issue for both 16.12.0 and the latest tag 16.14.0
I tried reverting the exact commit 563f2f9 issued from ASTERISK-28938 then T.38 + Local channel get back working properly.
Any fix to the new release appreciated. Thanks

By: Steve Sether (stevesether) 2020-12-02 10:16:23.005-0600

We have this same problem as well.  Unfortunately this is preventing us from upgrading to Asterisk 16, something we've had in-flight for a while now.

I'd certainly be willing to test any patches released to fix this.

By: Steve Sether (stevesether) 2020-12-02 11:05:36.516-0600

For anyone reading this, it looks like Asterisk 16-CERT4  removed the problematic Multistream support that causes this bug.  Thanks to whomever remove this from the CERT version.  I'll be switching the CERT5.

By: Bill Kervaski (bkervaski) 2020-12-08 13:31:46.268-0600

Same with 16.15 on Centos 8.

This is a breaking change for T.38 and anybody that uses Local Channels and T.38 will be stuck on 16.11.

I've confirmed it with SpanDSP and SendFAX/ReceiveFAX as well as using T.38 analog adapters on both ends.  

Anytime there is a Local Channel in the loop T.38 fails.

By: Benjamin Keith Ford (bford) 2021-01-11 14:14:01.795-0600

I've just attached a patch for this. If someone is able to test it, please let me know what the results are. It looks like there was an issue that already existed with SendFax when trying to use a local channel, but for ReceiveFax, it appears to resolve the issue.

By: Matthias Hensler (cubbi) 2021-01-12 17:37:11.927-0600

I just added the patch to our main server running Asterisk 16.15.1 and can confirm that it seems to fix the issue. At least I cannot reproduce the problem with ReceiveFax any longer. I have to check some productive calls over the next days if it fixes all issues on incoming and outgoing faxes.

By: Matthias Hensler (cubbi) 2021-01-14 05:34:17.218-0600

Unfortunately I had to revert the patch, as our Asteriskinstance stopped working properly after less than <36 hours with this patch.

Symptoms:
* I was able to connect via the CLI and issue commands like "core show channels verbose". At the time of the problem there were 303 active channels, 154 active calls and 37134 calls processed.
* In that moment the manager-API did not work properly and yielded no results (connection was possible, but in some way no results were given). Unfortunately I had no time to investigate further
* active calls were still being processed and working
* new calls were no longer possible
* according to a tcpdump a lot of register-requests where dispatched to asterisk but not being replied to
* our verboselog did only show manager-logins and warnings about chan_sip not able to send 10 type frames (which is perfectly normal).

The last interesting messages on the verboselog were:
[2021-01-14 12:08:10] VERBOSE[26758][C-0000cedf] res_agi.c: agi://127.0.0.1/faxhandler,outhangup,"97416","SUCCESS","OK","10","9600","+49 241/....-.....  ","8031x7700","",TERM: Faxhandler: Aktion=outhangup, ID=97416
[2021-01-14 12:08:10] VERBOSE[26758][C-0000cedf] res_agi.c: <Local/0049241....@routing_billing-000008d5;1>AGI Script agi://127.0.0.1/faxhandler completed, returning 0
[2021-01-14 12:08:10] VERBOSE[26758][C-0000cedf] pbx.c: Executing [h@outbound_fax:7] NoOp("Local/0049241....@routing_billing-000008d5;1", "Handler-Result: 1") in new stack
[2021-01-14 12:08:10] VERBOSE[26758][C-0000cedf] pbx.c: Executing [h@outbound_fax:8] Hangup("Local/0049241....@routing_billing-000008d5;1", "") in new stack
[2021-01-14 12:08:10] VERBOSE[26758][C-0000cedf] pbx.c: Spawn extension (outbound_fax, h, 8) exited non-zero on 'Local/0049241....@routing_billing-000008d5;1'
[2021-01-14 12:08:11] WARNING[4818][C-0000c148] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:11] WARNING[5183][C-0000c15c] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:12] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:12] VERBOSE[27384][C-0000cf06] res_srtp.c: SRTCP unprotect failed on SSRC 1334759288 because of authentication failure
[2021-01-14 12:08:13] WARNING[4818][C-0000c148] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:13] WARNING[5183][C-0000c15c] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:13] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:14] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:14] VERBOSE[29430] manager.c: Manager 'relaix_intern' logged on from 127.0.0.1
[2021-01-14 12:08:14] VERBOSE[29430] manager.c: Manager 'relaix_intern' logged off from 127.0.0.1
[2021-01-14 12:08:15] WARNING[5183][C-0000c15c] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:15] WARNING[4818][C-0000c148] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:15] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:17] WARNING[4818][C-0000c148] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:17] WARNING[5183][C-0000c15c] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:17] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:17] VERBOSE[27384][C-0000cf06] res_srtp.c: SRTCP unprotect failed on SSRC 1334759288 because of authentication failure
[2021-01-14 12:08:18] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:19] WARNING[4818][C-0000c148] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:19] WARNING[5183][C-0000c15c] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:19] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:20] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:21] WARNING[4818][C-0000c148] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:21] WARNING[5183][C-0000c15c] chan_sip.c: Can't send 10 type frames with SIP write
[2021-01-14 12:08:22] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:22] VERBOSE[27384][C-0000cf06] res_srtp.c: SRTCP unprotect failed on SSRC 1334759288 because of authentication failure
[2021-01-14 12:08:23] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:24] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:26] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:27] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:27] VERBOSE[27384][C-0000cf06] res_srtp.c: SRTCP unprotect failed on SSRC 1334759288 because of authentication failure
[2021-01-14 12:08:28] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1
[2021-01-14 12:08:29] WARNING[26759][C-0000cede] channel.c: Exceptionally long voice queue length queuing to Local/0049241....@routing_billing-000008d5;1


So it seems that asterisk entered a deadlock after an outgoing (SendFax)-call was finished and already hang up.

As this is our productionsystem and the issue happend at a prime time, i had to roll back to our previous version for the moment.

By: Benjamin Keith Ford (bford) 2021-01-14 08:56:56.743-0600

Thanks for the feedback! We'll look further into the SendFax issues.

By: Benjamin Keith Ford (bford) 2021-01-27 10:16:00.806-0600

There's a new patch up on Gerrit that handles the locking differently and should fix the issue. If anyone can test this before we merge it to ensure it resolves the locking issue, we'd really appreciate it!

https://gerrit.asterisk.org/c/asterisk/+/15332

By: Matthias Hensler (cubbi) 2021-02-10 15:32:25.428-0600

Sorry for the late feedback. I had to plan a bit ahead because of the trouble we had with the last patch, and also because I had an unrelated roadblock with a broken glibc (isnanl() on EL7).

I just rolled out the (new) patch on our production environment. T38-faxes on local channels now work. As for the locking problem I think we will need some days to see if we run in a problem again.

As the locking issue was triggered within the first day, I will try to give some feedback beginning next week if no issue arises till then.

By: Matthias Hensler (cubbi) 2021-02-16 04:18:26.909-0600

As promised some feedback: so far we did not encounter any issues. I do not know what triggered the first lockup, but it was encountered in less than 24 hours with the first patch. With the new patch we have an uptime of >5 days now and 90k Calls processed.

So thumbs up for the moment.

By: Benjamin Keith Ford (bford) 2021-02-16 09:32:42.170-0600

Thanks for the feedback! This patch should be going into the next release of Asterisk.

By: Friendly Automation (friendly-automation) 2021-02-16 16:36:23.953-0600

Change 15363 merged by Friendly Automation:
core_unreal: Fix T.38 faxing when using local channels.

[https://gerrit.asterisk.org/c/asterisk/+/15363|https://gerrit.asterisk.org/c/asterisk/+/15363]

By: Friendly Automation (friendly-automation) 2021-02-16 18:13:27.567-0600

Change 15332 merged by Kevin Harwell:
core_unreal: Fix T.38 faxing when using local channels.

[https://gerrit.asterisk.org/c/asterisk/+/15332|https://gerrit.asterisk.org/c/asterisk/+/15332]

By: Friendly Automation (friendly-automation) 2021-02-16 18:13:55.925-0600

Change 15362 merged by Kevin Harwell:
core_unreal: Fix T.38 faxing when using local channels.

[https://gerrit.asterisk.org/c/asterisk/+/15362|https://gerrit.asterisk.org/c/asterisk/+/15362]

By: Matthias Hensler (cubbi) 2021-03-19 04:14:06.757-0500

Unfortunately I have to reopen this issue. The current patch (used on Asterisk 16.16.2) did cause another deadlock on our asterisk. As last time I could connect to the console and check open channels, but no new call was possible and trying to gracefully shut down did not work either. At the time of the deadlock we had 144 channels active (112516 calls processed).

The last successful call was a T38-fax on a local channel. According to the logs and our external monitoring that fax was processed, however some seconds before the fax was finished new calls were no longer possible. INVITES to asterisk were not replied to. Also the problematic T38-call (while the fax was already finished) did not hang up properly in the sip-trace (the caller sent several "BYE" without any answer from asterisk).

I add some of the last logfilelines, although I doubt they a of much help.

This is the excerpt for the fax-call from our verbose log:
{quote}
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:9] NoOp("Local/004940XXX@routing_intern-00001c77;2", "Registrierung folgt") in new stack
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:10] Set("Local/004940XXX@routing_intern-00001c77;2", "CDR(EinkaufTelefonieProviderID)=0") in new stack
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:11] Set("Local/004940XXX@routing_intern-00001c77;2", "AGI_SUCCESS=0") in new stack
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:12] AGI("Local/004940XXX@routing_intern-00001c77;2", "agi://127.0.0.1/faxhandler,inreg,"004940XXX","004940XXX",TERM") in new stack
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] res_agi.c: agi://127.0.0.1/faxhandler,inreg,"004940XXX","004940XXX",TERM: Faxhandler: Aktion=inreg, Rufnummer=004940XXX, Callerid=004940XXX
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] res_agi.c: agi://127.0.0.1/faxhandler,inreg,"004940XXX","004940XXX",TERM: Faxhandler: zugewiesene ID=104168, FN=/var/spool/asterisk/fax/0000104168-inbound-4399.tif
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] res_agi.c: <Local/004940XXX@routing_intern-00001c77;2>AGI Script agi://127.0.0.1/faxhandler completed, returning 0
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:13] GotoIf("Local/004940XXX@routing_intern-00001c77;2", "1?:Overlimit") in new stack
[2021-03-19 08:47:46] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:14] Answer("Local/004940XXX@routing_intern-00001c77;2", "") in new stack
[2021-03-19 08:47:46] VERBOSE[27306][C-0002ec8b] app_dial.c: Local/004940XXX@routing_intern-00001c77;1 answered SIP/trunk_XXXXXX-00033467
[2021-03-19 08:47:46] VERBOSE[27311][C-0002ec8b] bridge_channel.c: Channel Local/004940XXX@routing_intern-00001c77;1 joined 'simple_bridge' basic-bridge <a722361d-b600-4cd9-ac30-b4b62a6c4394>
[2021-03-19 08:47:46] VERBOSE[27306][C-0002ec8b] bridge_channel.c: Channel SIP/trunk_XXXXXX-00033467 joined 'simple_bridge' basic-bridge <a722361d-b600-4cd9-ac30-b4b62a6c4394>
[2021-03-19 08:47:46] VERBOSE[4141][C-0002ec8b] netsock2.c: Using UDPTL TOS bits 184
[2021-03-19 08:47:46] VERBOSE[4141][C-0002ec8b] netsock2.c: Using UDPTL TOS bits 184 in TCLASS field.
[2021-03-19 08:47:46] VERBOSE[4141][C-0002ec8b] netsock2.c: Using UDPTL CoS mark 5
[2021-03-19 08:47:47] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:15] Wait("Local/004940XXX@routing_intern-00001c77;2", "1") in new stack
[2021-03-19 08:47:48] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:16] ReceiveFAX("Local/004940XXX@routing_intern-00001c77;2", "/var/spool/asterisk/fax/0000104168-inbound-4399.tif,fd") in new stack
[2021-03-19 08:47:48] VERBOSE[27308][C-0002ec8b] res_fax.c: Channel 'Local/004940XXX@routing_intern-00001c77;2' receiving FAX '/var/spool/asterisk/fax/0000104168-inbound-4399.tif'
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [004940XXX@inbound_fax:17] Hangup("Local/004940XXX@routing_intern-00001c77;2", "") in new stack
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Spawn extension (inbound_fax, 004940XXX, 17) exited non-zero on 'Local/004940XXX@routing_intern-00001c77;2'
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [h@inbound_fax:1] NoOp("Local/004940XXX@routing_intern-00001c77;2", "Hangup occured, processing fax") in new stack
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [h@inbound_fax:2] NoOp("Local/004940XXX@routing_intern-00001c77;2", "Status=SUCCESS, Str=OK, Error=") in new stack
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [h@inbound_fax:3] NoOp("Local/004940XXX@routing_intern-00001c77;2", "Station=040XXX        , Pages=3, Rate=14400, Resolution=8031x7700") in new stack
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [h@inbound_fax:4] Set("Local/004940XXX@routing_intern-00001c77;2", "AGI_SUCCESS=0") in new stack
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [h@inbound_fax:5] AGI("Local/004940XXX@routing_intern-00001c77;2", "agi://127.0.0.1/faxhandler,inhangup,"104168","SUCCESS","OK","3","14400","040XXX        ","8031x7700","",TERM") in new stack
[2021-03-19 08:48:36] VERBOSE[27308][C-0002ec8b] res_agi.c: agi://127.0.0.1/faxhandler,inhangup,"104168","SUCCESS","OK","3","14400","040XXX        ","8031x7700","",TERM: Faxhandler: Aktion=inhangup, ID=104168
[2021-03-19 08:48:37] VERBOSE[27308][C-0002ec8b] res_agi.c: <Local/004940XXX@routing_intern-00001c77;2>AGI Script agi://127.0.0.1/faxhandler completed, returning 0
[2021-03-19 08:48:37] VERBOSE[27308][C-0002ec8b] pbx.c: Executing [h@inbound_fax:6] NoOp("Local/004940XXX@routing_intern-00001c77;2", "Handler-Result: 1") in new stack
[2021-03-19 08:48:37] VERBOSE[27311][C-0002ec8b] bridge_channel.c: Channel Local/004940XXX@routing_intern-00001c77;1 left 'simple_bridge' basic-bridge <a722361d-b600-4cd9-ac30-b4b62a6c4394>
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] bridge_channel.c: Channel SIP/trunk_XXXXXX-00033467 left 'simple_bridge' basic-bridge <a722361d-b600-4cd9-ac30-b4b62a6c4394>
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Spawn extension (gosub-outbound_peer_intern, 004940XXXXXted non-zero on 'SIP/trunk_XXXXXX-00033467'
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Executing [h@gosub-outbound_peer_intern:1] Return("SIP/trunk_XXXXXX-00033467", "0") in new stack
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Executing [004940XXX@routing_full:9] NoOp("SIP/trunk_XXXXXX-00033467", "Macro outbound_peer_intern Retval=0") in new stack
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Executing [004940XXX@routing_full:10] GotoIf("SIP/trunk_XXXXXX-00033467", "1?SkipFax:") in new stack
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx_builtins.c: Goto (routing_full,004940XXX,13)
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Executing [004940XXX@routing_full:13] GotoIf("SIP/trunk_XXXXXX-00033467", "0?LokalFailed:") in new stack
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Executing [004940XXX@routing_full:14] Hangup("SIP/trunk_XXXXXX-00033467", "") in new stack
[2021-03-19 08:48:37] VERBOSE[27306][C-0002ec8b] pbx.c: Spawn extension (routing_full, 004940XXX, 14) exited non-zero on 'SIP/trunk_XXXXXX-00033467'
{quote}

And the last message from our fax-log:
{quote}
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Page no = 3
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Image size = 1728 x 2287 pixels
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Image resolution = 8031/m x 7700/m
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Compression = T.6 (3)
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Compressed image size = 11837 bytes
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from state 13 to 14
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Tx:  MCF with final frame tag
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Tx:  ff 13 8c
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   760: IFP c0 01 10
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   760: (0) data v21/hdlc-sig-end + 0 byte(s)
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Receive complete in phase T30_PHASE_D_RX, state 14
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 HDLC signal status is Carrier down (-1) in state 14
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from phase T30_PHASE_D_RX to T30_PHASE_D_TX
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set rx type 0
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set tx type 4
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 No signal is present
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   761: IFP 00
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   761: indicator no-signal
[2021-03-19 08:48:33] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    61: indicator v21-preamble
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    62: (0) data v21/hdlc-data + 1 byte(s)
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    62: IFP c0 01 80 00 00 ff
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    63: (0) data v21/hdlc-data + 1 byte(s)
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    63: IFP c0 01 80 00 00 c8
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    64: (0) data v21/hdlc-data + 1 byte(s)
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    64: IFP c0 01 80 00 00 31
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Send complete in phase T30_PHASE_D_TX, state 14
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    65: (0) data v21/hdlc-fcs-OK-sig-end + 0 byte(s)
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    65: IFP c0 01 40
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Send complete in phase T30_PHASE_D_TX, state 14
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from phase T30_PHASE_D_TX to T30_PHASE_D_RX
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set rx type 4
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set tx type 0
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Start T4
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Tx    66: indicator no-signal
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   762: IFP 06
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   762: indicator v21-preamble
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 A signal is present
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 HDLC signal status is Carrier up (-2) in state 14
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 HDLC signal status is Framing OK (-6) in state 14
[2021-03-19 08:48:34] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Start T4A
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   763: IFP c0 01 80 00 01 ff c8
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   763: (0) data v21/hdlc-data + 2 byte(s)
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   764: IFP c0 02 80 00 00 df 20
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   764: (0) data v21/hdlc-data + 1 byte(s)
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   764: (1) data v21/hdlc-fcs-OK + 0 byte(s)
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Type DCN - CRC OK (clean)
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Stop T4A (15520 remaining)
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Rx:  DCN with final frame tag
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Rx:  ff 13 fb
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Rx final frame in state 14
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Disconnecting
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from phase T30_PHASE_D_RX to T30_PHASE_E
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set rx type 0
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set tx type 1
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from state 14 to 2
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   765: IFP c0 01 10
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   765: (0) data v21/hdlc-sig-end + 0 byte(s)
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Receive complete in phase T30_PHASE_E, state 2
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 HDLC signal status is Carrier down (-1) in state 2
[2021-03-19 08:48:35] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 No signal is present
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   766: IFP 00
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38 Rx   766: indicator no-signal
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Send complete in phase T30_PHASE_E, state 2
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from state 2 to 32
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.30 Changing from phase T30_PHASE_E to T30_PHASE_CALL_FINISHED
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set rx type 9
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T Set tx type 9
[2021-03-19 08:48:36] FAX[27308][C-0002ec8b] res_fax.c: FLOW T.38T FAX exchange complete
{quote}

By: Asterisk Team (asteriskteam) 2021-03-19 04:14:07.570-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Matthias Hensler (cubbi) 2021-03-19 04:23:48.427-0500

Maybe unrelated, but upon inspecting the problematic call, I see that the Caller used a private unrouteable IP in the SDP for the RTP-stream. Since the call was in T38 that caused no issue, however just after the fax-transfer finished the asterisk sent an INVITE to that private IP.

I can provide a pcap-trace for that call, but because of sensible data I would like to provide it through a different channel and not attaching it to the public jira-ticket.

By: Joshua C. Colp (jcolp) 2021-03-19 04:38:26.029-0500

For a deadlock a backtrace[1] would need to be provided to see what is specifically deadlocked (in this case the channel, chan_sip, Local channel, what). Without a backtrace it's incredibly hard to determine what happened, in particular for something that you seem to have only experienced once after using the patch for so long.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Joshua C. Colp (jcolp) 2021-03-19 04:39:44.409-0500

{panel:title=Private Submission of Information Disclaimer}You have indicated that you wish to submit unredacted information privately. It is not recommended to do this as it will substantially restrict the number of individuals who can help with your issue, as submitted information is only available to Sangoma. Note that submission of such information does not change the priorization of this issue. If you still wish to proceed you may do so by sending it to asteriskteam@digium.com with the issue number in the subject. For large files please send a link where they can then be downloaded. By sending this information you agree to the Website Terms of Use available on the Sangoma website at https://www.sangoma.com/legal/. Any exchange of private information between you and an Asterisk community member outside of the Asterisk JIRA is not subject to the Website Terms of Use and should be privately discussed between yourself and the Asterisk community member.{panel}



By: Matthias Hensler (cubbi) 2021-03-19 08:42:54.691-0500

Hi Joshua,

I am not sure if we just hit a corner case or just one of the few users with the patch already in production. Unfortunately I cannot run this version on our production system for further debugging, as triggering the issue seems to be difficult and might occur with bad timing. So for the moment I have to rely on an unpatched version with T38 disabled.

I am almost positive that the patch causes the deadlock, as we never had any similar issues without the patch. I will try to do some testing on my development box, maybe I am able to track it down further there.

By: Joshua C. Colp (jcolp) 2021-03-19 08:45:58.965-0500

How long were you running with the patch and how much faxing was done?

We could try to reproduce the issue too, but if it is timing related or relies on something else then the chances decrease further, and without a backtrace or being able to reproduce it to get a backtrace it further becomes difficult to isolate and resolve.

By: Joshua C. Colp (jcolp) 2021-03-19 08:48:58.878-0500

Actually, I have an idea.

By: Joshua C. Colp (jcolp) 2021-03-19 08:58:48.126-0500

A patch is now up.

By: Matthias Hensler (cubbi) 2021-03-19 09:19:16.038-0500

To answer the first question: the instance was running since saturday, so for just shy of 6 days. However we already had 16.16.1 with that patch running from feb 18th (and 16.16.0 from Feb 10th) without issues. Since the diff between 16.16.1 and 16.16.2 is very short (and although related to T38 it is linked to res_pjsip and not chan_sip) I would suspect that the issue is just difficult to reproduce (a lot of days in between).

Unfortunately I cannot check for statistics we had with 16.16.0 and 16.16.1, but for the now deadlocked 16.16.2 we had 112k calls in total since start. In that time we had 321 successful SendFax()-instances and 183 ReceiveFax()-instances. All SendFax() was done via Local-Channel, most ReceiveFax()-instances too. As for how often T38 vs G711a was used I have no details, sorry.

This deadlock seemed to be triggered on a ReceiveFax()-instance, while the deadlock on Jan 14th was on SendFax()

By: Matthias Hensler (cubbi) 2021-03-19 09:22:40.205-0500

As for the patch I suspect f6c9a508af95ce754a4cfba9638fd614e4a310ac on the asterisk-16 branch? (https://gerrit.asterisk.org/c/asterisk/+/15651/3/main/core_unreal.c).

I will try that.

By: Joshua C. Colp (jcolp) 2021-03-19 09:25:43.598-0500

Yes, that is the patch. I've asked the person familiar with locking in that area to take a look as well.

By: Friendly Automation (friendly-automation) 2021-03-22 07:48:56.313-0500

Change 15651 merged by Joshua Colp:
core_unreal: Fix deadlock with T.38 control frames.

[https://gerrit.asterisk.org/c/asterisk/+/15651|https://gerrit.asterisk.org/c/asterisk/+/15651]

By: Friendly Automation (friendly-automation) 2021-03-22 07:49:26.872-0500

Change 15653 merged by Joshua Colp:
core_unreal: Fix deadlock with T.38 control frames.

[https://gerrit.asterisk.org/c/asterisk/+/15653|https://gerrit.asterisk.org/c/asterisk/+/15653]

By: Friendly Automation (friendly-automation) 2021-03-22 07:49:51.061-0500

Change 15637 merged by Joshua Colp:
core_unreal: Fix deadlock with T.38 control frames.

[https://gerrit.asterisk.org/c/asterisk/+/15637|https://gerrit.asterisk.org/c/asterisk/+/15637]

By: Friendly Automation (friendly-automation) 2021-03-22 07:50:08.954-0500

Change 15650 merged by Joshua Colp:
core_unreal: Fix deadlock with T.38 control frames.

[https://gerrit.asterisk.org/c/asterisk/+/15650|https://gerrit.asterisk.org/c/asterisk/+/15650]

By: Friendly Automation (friendly-automation) 2021-03-22 07:50:31.556-0500

Change 15652 merged by Joshua Colp:
core_unreal: Fix deadlock with T.38 control frames.

[https://gerrit.asterisk.org/c/asterisk/+/15652|https://gerrit.asterisk.org/c/asterisk/+/15652]