[Home]

Summary:ASTERISK-28209: res_pjsip_t38: Passthrough causes crash when re-invite collision
Reporter:Gregory Massel (gmza)Labels:fax pjsip
Date Opened:2018-12-14 11:53:02.000-0600Date Closed:2019-09-06 09:27:03
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_pjsip_session Resources/res_pjsip_t38
Versions:16.1.0 Frequency of
Occurrence
Constant
Related
Issues:
is related toASTERISK-28495 res_pjsip_t38: 200 OK with SDP answer with declined stream causes crash
Environment:Ubuntu 16.04 LTS, E3-1285, 32GB RAMAttachments:( 0) core-dump-2018-12-13-08h04-brief.txt
( 1) core-dump-2018-12-13-08h04-full.txt
( 2) core-dump-2018-12-13-08h04-locks.txt
( 3) core-dump-2018-12-13-08h04-thread1.txt
( 4) core-dump-2018-12-13-08h42-brief.txt
( 5) core-dump-2018-12-13-08h42-full.txt
( 6) core-dump-2018-12-13-08h42-locks.txt
( 7) core-dump-2018-12-13-08h42-thread1.txt
( 8) t38call.pcap
( 9) t38call2.pcap
Description:After upgrading to 16.1.0, Asterisk crashed six times in the space of an hour. After downgrading to 13.24.0, it has been rock solid for a number of days. Prior to that it was on 13.23.1 and rock solid for weeks.

I cannot isolate which component is causing the issue, however, the one possible area in which this server is unique (versus another similar server that is running stable on 16.1.0) is that it uses res_odbc and func_odbc extensively. But blaming those components would be speculative.

I have attached two backtraces from two different crash events.
Comments:By: Asterisk Team (asteriskteam) 2018-12-14 11:53:03.788-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Gregory Massel (gmza) 2018-12-14 11:54:24.305-0600

backtraces from core dumps attached

By: Joshua C. Colp (jcolp) 2018-12-14 11:58:21.761-0600

This appears to be the result of faxing. We would need to see the configuration, console output, and SIP trace to see the interaction.

By: Gregory Massel (gmza) 2018-12-15 04:11:02.382-0600

Thanks. Based on that I can isolate the two matching calls. Attached are the PCAPs

By: Gregory Massel (gmza) 2018-12-15 04:11:50.196-0600

PCAPs of 2x T.38 calls at times corresponding with crashes

By: Gregory Massel (gmza) 2018-12-15 04:16:27.126-0600

Endpoint configs have:

allow=!all,alaw,ulaw,g729,gsm

direct_media = no
force_rport = yes
rewrite_contact = yes

rtp_keepalive = 30
rtp_symmetric = yes
rtp_timeout = 60
rtp_timeout_hold = 300

t38_udptl = yes
t38_udptl_nat = yes
t38_udptl_maxdatagram = 400
t38_udptl_ec = redundancy


By: Joshua C. Colp (jcolp) 2018-12-15 04:38:20.618-0600

We need not just the pcaps but also the Asterisk console output so we can see the interaction there and ideally the dialplan as well.

By: Gregory Massel (gmza) 2018-12-15 04:40:40.094-0600

The logs are absolutely massive and there is nothing of interest at the exact time of crash, however, I've attempted to extract what is relevant and corresponding to the fax calls.

[Dec 13 08:04:39] VERBOSE[3868][C-00000cd4] app_dial.c: PJSIP/0216718077-00001c97 answered PJSIP/floraniche307-00001c96
[Dec 13 08:04:39] VERBOSE[3868][C-00000cd4] bridge_channel.c: Channel PJSIP/floraniche307-00001c96 joined 'simple_bridge' basic-bridge <3cb480c8-d524-469d-b7bf-77d688588d7c>
[Dec 13 08:04:44] -- crash without log entry --

After Asterisk restart:
[Dec 13 08:05:01] Asterisk 16.1.0 built by vpbx @ vpbx2 on a x86_64 running Linux on 2018-12-12 18:34:16 UTC
[..]
[Dec 13 08:05:27] VERBOSE[4034] res_pjsip/pjsip_options.c: Contact floraniche307/sip:floraniche307@41.164.155.150:5060;x-ast-txp=transport-udp is now Unreachable.  RTT: 0.000 msec
[Dec 13 08:06:24] VERBOSE[4036] res_pjsip/pjsip_configuration.c: Endpoint floraniche307 is now Reachable

----------

[Dec 13 08:16:59] VERBOSE[8176][C-000004bc] app_dial.c: PJSIP/0216718077-00000b00 is making progress passing it to PJSIP/floraniche307-00000afe
[Dec 13 08:17:28] VERBOSE[8176][C-000004bc] app_dial.c: PJSIP/0216718077-00000b00 answered PJSIP/floraniche307-00000afe
[Dec 13 08:17:28] VERBOSE[8376][C-000004bc] bridge_channel.c: Channel PJSIP/0216718077-00000b00 joined 'simple_bridge' basic-bridge <857a366b-64b1-42b1-8d7c-9fc4e529efc0>
[Dec 13 08:17:28] VERBOSE[8176][C-000004bc] bridge_channel.c: Channel PJSIP/floraniche307-00000afe joined 'simple_bridge' basic-bridge <857a366b-64b1-42b1-8d7c-9fc4e529efc0>
[Dec 13 08:17:33] -- crash without log entry --

After restart:
[Dec 13 08:18:01] Asterisk 16.1.0 built by vpbx @ vpbx2 on a x86_64 running Linux on 2018-12-12 18:34:16 UTC
[Dec 13 08:19:04] VERBOSE[8453] res_pjsip/pjsip_configuration.c: Endpoint floraniche307 is now Reachable
[Dec 13 08:19:04] VERBOSE[8453] res_pjsip/pjsip_options.c: Contact floraniche307/sip:floraniche307@41.164.155.150:5060;x-ast-txp=transport-udp is now Reachable.  RTT: 107.500 msec

----------

[Dec 13 08:31:30] VERBOSE[13650][C-000006a3] app_dial.c: PJSIP/0216718077-00000f5f answered PJSIP/floraniche307-00000f5e
[Dec 13 08:31:30] VERBOSE[13650][C-000006a3] bridge_channel.c: Channel PJSIP/floraniche307-00000f5e joined 'simple_bridge' basic-bridge <ebcd54f0-6bb0-4dd2-b25d-465dbe4ab659>
[Dec 13 08:31:36] -- crash without log entry --

After restart:
[Dec 13 08:32:01] Asterisk 16.1.0 built by vpbx @ vpbx2 on a x86_64 running Linux on 2018-12-12 18:34:16 UTC
[Dec 13 08:33:07] VERBOSE[13957] res_pjsip/pjsip_configuration.c: Endpoint floraniche307 is now Reachable
[Dec 13 08:33:07] VERBOSE[13957] res_pjsip/pjsip_options.c: Contact floraniche307/sip:floraniche307@41.164.155.150:5060;x-ast-txp=transport-udp is now Reachable.  RTT: 113.133 msec

----------

[Dec 13 08:42:47] VERBOSE[18865][C-00000539] app_dial.c: PJSIP/0216718077-00000c26 answered PJSIP/floraniche307-00000c25
[Dec 13 08:42:47] VERBOSE[19026][C-00000539] bridge_channel.c: Channel PJSIP/0216718077-00000c26 joined 'simple_bridge' basic-bridge <56b2be82-1420-4562-88bb-7092cc3eb655>
[Dec 13 08:42:47] VERBOSE[18865][C-00000539] bridge_channel.c: Channel PJSIP/floraniche307-00000c25 joined 'simple_bridge' basic-bridge <56b2be82-1420-4562-88bb-7092cc3eb655>
[Dec 13 08:42:53] -- crash without log entry --

After restart:
[Dec 13 08:43:02] Asterisk 16.1.0 built by vpbx @ vpbx2 on a x86_64 running Linux on 2018-12-12 18:34:16 UTC
[Dec 13 08:43:44] VERBOSE[19131] res_pjsip/pjsip_options.c: Contact floraniche307/sip:floraniche307@41.164.155.150:5060;x-ast-txp=transport-udp is now Unreachable.  RTT: 0.000 msec


By: Gregory Massel (gmza) 2018-12-15 04:43:32.033-0600

Also what is different on this server versus the other (that is stable), is that this one qualifies its pjsip contacts.

I note that there has been a fair amount of work in the qualification code for Asterisk 16.1.0.

It would appear that the above calls are all T.38 and all to/from the same destination, however, critically it also appears that there may have been a qualification event at the same time as the calls that may have seen the contact as being unreachable.

By: Gregory Massel (gmza) 2018-12-15 04:51:31.634-0600

The dialplan is not one that you're going to be able to replicate. It is complex uses a large number of func_odbc queries to dynamically route calls.
The last dial plan action relating to these calls looks like this:

[Dec 13 08:42:17] VERBOSE[18865][C-00000539] pbx.c: Executing [s@swvpbx-sub-dialout:56] Dial("PJSIP/floraniche307-00000c25", "PJSIP/0216718077/sip:0333457349@196.216.192.4:5060,120,") in new stack
[Dec 13 08:42:17] VERBOSE[18865][C-00000539] app_dial.c: Called PJSIP/0216718077/sip:0333457349@196.216.192.4:5060

based on:

same  =>   n,Dial(PJSIP/${TRUNK}/sip:${NUMBER_DIALLED}@196.216.192.1:5060,120,${CALLOPT})

where ${CALLOPT} was empty, ${TRUNK} was 0216718077 and ${NUMBER_DIALLED} was 0333457349
in each and every case

By: Gregory Massel (gmza) 2018-12-15 05:00:44.098-0600

So the scenario seems to be that endpoint A calls endpoint B, B answers, B initiates a T.38 re-INVITE, call starts exchanging T.38, all is good...
During the call, Asterisk sends a qualify (OPTIONS) to A.
Asterisk segfaults.

By: Gregory Massel (gmza) 2018-12-15 05:06:47.673-0600

BTW, this fax machine backed off on 13 December, however, I can see that, on 14 December at 10:06:34 (when the server was downgraded to Asterisk 13.24.0), it re-tried again, and the fax transmitted on first attempt. The call was 02:28 in duration, 01:57 in connected duration, and fax transmission was 1 page. Our monitoring software was able to reproduce the fax from the PCAP of the completed call and it is a normal transmission.

The config was not changed with the downgrade; it was just a case of recompiling from source and re-installing using the exact same config files.

By: Joshua C. Colp (jcolp) 2018-12-17 07:09:46.543-0600

I'm triaging this and have a question about the scenario - you say fax relay, is this strictly T.38 passthrough or are you using the fax gateway functionality? What does "relay" mean exactly?

By: Joshua C. Colp (jcolp) 2018-12-17 07:10:46.111-0600

And I assume 196.216.192.5 is Asterisk based on the traffic I'm seeing?

By: Gregory Massel (gmza) 2018-12-17 07:20:46.710-0600

Yes, 196.216.192.5 is the Asterisk box.
By "relay" I mean T.38 passthrough (i.e that both endpoints talk T.38 and Asterisk does *NOT* gateway between G.711 and T.38).
T.38 Gateway is NOT enabled.



By: Joshua C. Colp (jcolp) 2018-12-17 09:58:15.611-0600

I think what is happening is that the re-invite collision is causing the pending media state to get cleared out, so when the re-invite goes out again the negotiation does not have the proper state to act on that it assumes to exist and causes a crash. The pending state needs to remain on a 491, as the re-invite will indeed go out - albeit just delayed.

By: Gregory Massel (gmza) 2019-09-05 18:32:18.737-0500

Could this issue potentially be the same as ASTERISK-28495?
If so, it may be resolved?

By: Kevin Harwell (kharwell) 2019-09-06 09:26:40.781-0500

Yes, you are correct. I believe this issue should be resolved by the fix for ASTERISK-28495.