ASTERISK-27609: sip_tcptls_read: SIP TCP/TLS server has shut down after 120s

[Home]

Summary: ASTERISK-27609: sip_tcptls_read: SIP TCP/TLS server has shut down after 120s

Reporter: Chris Jones (chrisnjones) Labels:

Date Opened: 2018-01-22 10:50:26.000-0600 Date Closed: 2018-03-30 13:55:12

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_sip/TCP-TLS

Versions: 15.1.3 Frequency of
Occurrence Constant

Related
Issues:

Environment: AWS EC2 running Centos 7 Attachments: ( 0) asterisk-27609.core.tar.gz
( 1) asterisk-27609.pcap.tar.gz
( 2) asterisk-27609.tcpdump
( 3) ASTERISK-27609.txt
( 4) asterisk-cli-output.txt

Description: I have a SIP provider (Twilio) that has provided working guidelines for configuring Asterisk to have a Secure SIP tunnel between it and Twilio. I have implemented the TLS and SRTP, and configured Twilio to be Secure-only.

I can make and receive calls just fine - TLS and SRTP is working.

If I hang up the call within 2 minutes: the call disconnects perfectly fine on both ends.

If the call is longer than two minutes,
- at exactly the 120s mark, I get the following sip debug:

DEBUG[30015]: iostream.c:157 iostream_read: TLS clean shutdown alert reading data
DEBUG[30015]: chan_sip.c:2905 sip_tcptls_read: SIP TCP/TLS server has shut down

- if I hang up the call on the Asterisk side, my mobile phone that dialed into Twilio, does not hang up.
- after 300seconds beyond the initial 120+ seconds, my mobile phone hangs up. I feel this a Twilio timeout. Interesting enough, Twilio sends a BYE but Asterisk responds that the call leg does not exist anymore.

Because call setup and SRTP both work, the hangup issue is an aggravation because it could increase toll charges (extra 5 minutes every call if the other doesnt hang up). And we need to resolved before we can deploy our app.

Comments: By: Asterisk Team (asteriskteam) 2018-01-22 10:50:27.565-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Chris Jones (chrisnjones) 2018-01-22 11:08:16.548-0600

debug
By: Chris Jones (chrisnjones) 2018-01-22 11:08:36.659-0600

Asterisk CLI output during test
By: George Joseph (gjoseph) 2018-01-22 11:23:05.144-0600

Hi Chris,

At what time in the logs did the issue happen (my eyes glaze over reading those things :) )? Also were you able to get a wireshark capture as Alexander Traud suggested on the asterisk-dev list? It'll help us determine who's actually terminating the connection.
By: Chris Jones (chrisnjones) 2018-01-22 11:34:20.330-0600

Hi George.
Thanks for taking a look so quickly. [Jan 22 11:01:37] in the CLI log.

I have not taken a wireshark capture because it is an EC2 server up in AWS, but I can certainly try.

Chris
By: Richard Mudgett (rmudgett) 2018-01-22 11:39:52.336-0600

Asterisk is rebooting in the middle of that call. It either crashed \[1] in which case we need backtraces or you did a "core restart now" command.

\[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace
By: Chris Jones (chrisnjones) 2018-01-22 11:52:18.228-0600

core dump 20170122
By: Chris Jones (chrisnjones) 2018-01-22 11:52:49.815-0600

Hi Richard,
Just uploaded the core dump.

Chris
By: Richard Mudgett (rmudgett) 2018-01-22 11:57:30.010-0600

Core dumps are *useless* to us. Only your machine can interpret them. Follow the instructions I linked to generate a backtrace.
By: Chris Jones (chrisnjones) 2018-01-22 12:02:52.388-0600

astdumped the core into brief, full, locks and threads file per https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace
By: Chris Jones (chrisnjones) 2018-01-22 12:13:14.414-0600

Naked wireshark packet capture ~15,000

To help filter, Twilio SIP servers are:

asterisk223.pstn.twilio.com
54.172.60.2
54.172.60.0
54.172.60.1
54.172.60.3

There is only one legitimate call in this packet capture. About 2:37 after the cll was initiated, the call should have hung up. My mobile hung until I ended the call at 3:00 after initiation.
By: George Joseph (gjoseph) 2018-01-22 12:18:01.288-0600

I'll look at the pcap when I get back from lunch but what I see in the logs is...
iostream.c:157 iostream_read: TLS clean shutdown alert reading data
which means while we were reading, we received an alert from the remote end that indicates they are doing a clean shutdown.
In that case, the protocol says we have to close the connection.

By: George Joseph (gjoseph) 2018-01-22 14:05:53.820-0600

The pcap seems top be corrupted. can you try again?

Also, the crash was resolved in ASTERISK-27534

By: Chris Jones (chrisnjones) 2018-01-22 14:53:41.397-0600

Yes sir, I will try again.

Do you know if the crash was fixed in the 15.2 branch, and which specific version? I had it loaded on another server, since decommissioned, and had the same error.

I don't mind upgrading to a new version - we are not in production. Alternatively, I can also downgrade to 13.8 cert2 if that makes more sense.
By: Chris Jones (chrisnjones) 2018-01-22 15:04:54.089-0600

I looked at the "wireshark" capture on the server and it is just a bunch of one liners in text format. Not the format you needed.

I followed these directions for a tcpdump:
https://www.wireshark.org/docs/wsug_html_chunked/AppToolstcpdump.html

See if that works better. Hopefully you will have full detail.
By: George Joseph (gjoseph) 2018-01-22 15:23:50.705-0600

The patch isn't in a released version of 15 yet so it's only in the base 15 branch in git/gerrit.
If you're going to downgrade, go to the latest 13 version, not as far back as 13.8.
I'll take a look at the new pcap.

By: George Joseph (gjoseph) 2018-01-22 15:45:22.130-0600

The new pcap shows that things appear to be normal then twilio sending an encrypted alert (closing) to which asterisk responds with an encrypted alert. Then twilio closes the TCP connection and asterisk follows suit. Unfortunately we have no way of knowing WHY twilio sent the alert. 30 seconds earlier, twilio sent a TCP keepalive packet which was ACKed and there were no other errors I can see. I don't suppose they can provide any logs from their side?

By: Chris Jones (chrisnjones) 2018-01-22 16:41:51.795-0600

I mistyped earlier, should have been 13.18 cert2.

We have already opened the ticket with Twilio using the same call ID awaiting their response.

I'm going to hold on moving the software to any version just yet. Will let you know what Twilio reports.
By: George Joseph (gjoseph) 2018-01-23 08:16:53.902-0600

Cool. Let us know.
By: Chris Jones (chrisnjones) 2018-01-29 09:55:42.430-0600

Finally something from Twilio; will keep you in the loop:

I looked into the logs and found in ASTERISK-27609-logger.conf.txt file that your Asterisk server is getting BYE correctly:

[Jan 22 11:02:17] VERBOSE[813] chan_sip.c:<--- SIP read from TLS:54.172.60.3:35991 --->BYE sip:+18442033202@52.91.76.243:5061;transport=tls SIP/2.0CSeq: 14423 BYEFrom: <sip:+12059087858@asterisk224.pstn.twilio.com;isup-oli=62>;tag=66369999_6772d868_4b1d52f7-a5b4-46ac-af12-9d8a2b01028bTo: <sip:+18442033202@pbx.statement.services;user=phone>;tag=as482692c8Call-ID: 1b99772d4affe3f5a053c29e43277aef@0.0.0.0Max-Forwards: 68Via: SIP/2.0/TLS https://urldefense.proofpoint.com/v2/url?u=http-3A __54.172.60.3&d=DwIG-g&c=x_Y1Lz9GyeGp2OvBCa_eow&r=t2YgCTqxzerFRlsCFDyfTv6e6hm4_syTqyTbXV6oNO8&m=I2bUe66XRIMdUN4DmoHZRQ8qMVZ2Kap6oTYtaLeB2-E&s=i_5ngCXtrrRjhAsfvv-9lsXhSHLWPMR8fx_CaxNOUag&e=:5061;branch=z9hG4bKc03c.e98fba07.0Via: SIP/2.0/UDP https://urldefense.proofpoint.com/v2/url?u=http-3A__ 172.18.16.53&d=DwIG-g&c=x_Y1Lz9GyeGp2OvBCa_eow&r=t2YgCTqxzerFRlsCFDyfTv6e6hm4_syTqyTbXV6oNO8&m=I2bUe66XRIMdUN4DmoHZRQ8qMVZ2Kap6oTYtaLeB2-E&s=XFb46VuOPONdh-jw0M27Cam7NcmE3Z81qG-8SzrBHes&e=:5060;rport=5060;received=https://urldefense.proofpoint.com/v2/url?u=http-3A__52.71.97.121&d=DwIG-g&c=x_Y1Lz9GyeGp2OvBCa_eow&r=t2YgCTqxzerFRlsCFDyfTv6e6hm4_syTqyTbXV6oNO8&m=I2bUe66XRIMdUN4DmoHZRQ8qMVZ2Kap6oTYtaLeB2-E&s=XCL-si03hzCX-CeoWqVFcoJY4SBOV9UqsjEwQ66xDbQ&e=;branch=z9hG4bK4b1d52f7-a5b4-46ac-af12-9d8a2b01028b_6772d868_362-11331624058400453305Reason: Q.850;cause=16;text="Terminated"Supported: 100relUser-Agent: Twilio GatewayX-Twilio-CallSid: CA0122c0799eac2ad9ded58f3d7d9da4b4Content-Length: 0

Further investigating I found that this call was terminated from our side, but as it's a single example I'd like to wait until we have a trend.
By: George Joseph (gjoseph) 2018-01-29 14:58:52.357-0600

Yep, let us know.
By: Chris Jones (chrisnjones) 2018-02-08 10:07:39.215-0600

After many call examples, this is what Twilio came back with, with my comments in [ ]:

The TLS (TCP) connection is probably getting torn down while the call is up. [TRUE] It's not something Asterisk related. It is likely their side is behind a NAT, and the NAT is not staying open for the entire call which is causing the connection to be lost. Can they adjust their NAT to allow connections to stay open longer? [No, AWS does not allow for NAT configuration] This usually involves sending CR-LF periodically. [???] Or maybe they can implement SIP session timers, so that SIP keepalives (re-INVITEs) are sent to keep the connection up. [Enabled by default, but nothing pointing to 120 seconds. See below]

Session Timers: Accept
Session Refresher: uas
Session Expires: 1800 secs
Session Min-SE: 90 secs

canreinvite is set to 'no'; however, we tried setting it to 'yes', but observed the same call behavior.

Any other advice?

By: Chris Jones (chrisnjones) 2018-02-08 10:49:08.273-0600

From our AWS ticket ruling out NAT gateway:

The NAT gateway will close a connection if it is idle for 5 minutes [1]. However TTL while called time to live is based on hops, as the connection made and current the hops and TTL will not depreciate.

[1] https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat-gateway.html#nat-gateway-troubleshooting-timeout
By: Chris Jones (chrisnjones) 2018-02-08 16:20:19.209-0600

Forgot to mention earlier: if I disable TLS, all calls, no matter the length, work just fine.
Enable TLS, and the TCP connection drops at 120 seconds.
By: George Joseph (gjoseph) 2018-02-09 10:41:57.035-0600

Given that chan_sip is no longer supported, the only other thing I can offer is to set "keepalive=60" in sip.conf.
I'm going to "open" this issue in case a community member can help.

By: Chris Jones (chrisnjones) 2018-02-09 11:11:30.904-0600

OK George, I think I tried that, but I can try again.

Question though, Twilio has asked for:

For Signaling, when using TCP or TLS, this may be achieved by periodically sending SIP OPTIONS to Twilio

I have qualify=yes on the trunk. Wouldn't that send OPTIONS to Twilio for the default timer? Or is my understanding of qualify wrong?

Chris
By: Chris Jones (chrisnjones) 2018-03-30 13:54:52.520-0500

Had another issue with DTMF not passing through to a dial out in our IVR.

Got over my stubborness and just swapped out chan_sip for chan_pjsip and the TLS hangup issue has been resolved. Closing issue.

Thanks all.
By: Chris Jones (chrisnjones) 2018-03-30 13:55:12.601-0500

swapped chan_sip for chan_pjsip resolved issue.