[Home]

Summary:ASTERISK-27001: res_pjsip: TLS connection not stable
Reporter:Ian Gilmour (tuxian)Labels:patch pjsip
Date Opened:2017-05-16 06:05:45Date Closed:2017-12-22 09:18:54.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:13.15.0 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-26984 chan_pjsip TLS incorrectly tears down connection
Environment:centos 6.8(64-bit)Attachments:( 0) output.tgz
( 1) pjproject-2.6.patch
Description:Hi,

I have a development Asterisk 13.15.0 test setup (uses the bundled pjsip-2.6).

On startup Asterisk registers 1 Asterisk users with a remote OpenSIPS server, over TLS, using the PJSIP stack. As part of the test this Asterisk PJSIP user is reregistered with OpenSIPS Server every couple of mins.

All outgoing/incoming pjsip call media is encrypted using SRTP and via an external RTPPROXY running alongside the external OpenSIPS Server.

Asterisk is additionally configured to use PJSIP on 127.0.0.1:5060 to allow calls from a locally run SIPp process. All SIPp calls are TCP+RTP.

I use SIPp to run multiple concurrent loopback calls (calls vary in duration) through Asterisk to the OpenSIPS server and back to an echo() service running on the same Asterisk).
i.e.
{noformat}
 SIPp <-TCP/RTP-> Asterisk <-TLS/SRTP-> OpenSIPS server (+ rtpproxy) <-TLS/SRTP-> Asterisk (echo service).
{noformat}

With no calls running the PJSIP TLS connection stays up and I see it reregistering the user every ~2mins.

When I start to run the SIPp test I start seeing the PJSIP stack having TLS issues - closing the current port as a result, in this state outgoing SIPp calls obviously start failing.  A few seconds later Asterisk (PJSIP) opens a new port, reregistering with the OpenSIPS server, and the calls continue. With SIPp running the connection is being reestablished every ~10-20 minutes due to TLS issues.

If I switch Asterisk to use the chan_sip stack rather than the PJSIP stack for the TLS connection to the OpenSIPS server the connection stays up with no call failures.

I patched a couple of PJSIP files to help me see what's going on and I have played with the PJSIP TLS code. I can improve the reliability of the connection by ignoring a specific OpenSSL error condition (see the code within #if EXPERIMENTAL...#endif in the attached patch). In the original code this error causes of >90% of the connection failures I see. With this mod in place the TLS connection stays up for hours rather than minutes at a time, on the same outgoing port, and calls work fine. I doubt this mod is the proper fix though.
Comments:By: Asterisk Team (asteriskteam) 2017-05-16 06:05:46.696-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2017-05-24 12:45:11.886-0500

Thanks for the report and troubleshooting patch.

Can you additionally attach a debug log captured for a demonstration of the issue from Asterisk's perspective?

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

A pcap that correlates to the same call or set of calls would be helpful too.

By: Ian Gilmour (tuxian) 2017-05-25 14:40:50.350-0500

output.tgz contains 4 files:
* console.txt - console output with:
** pjproject set log level 5
** core set verbose 4
** core set debug 4
** pjsip set logging on
* netstat.txt - looped netstat output every 5 secs to highlight when the port changes occur.
* 2 x tcpdump-xxx-yyy.txt - 2 pcap outputs showning what happens around each port change to SIP server.

Notes:
* tcpdump-xxx-yyy.txt timestamps are in UTC. Timestamps in other files are in UK BST - so 1 hour time difference.
* test carried out on local Centos 6.9 VM, using Virtualbox on OS-X host, using unmodified Asterisk-13.15.1 source code, with bundled PJSIP.

By: Ian Gilmour (tuxian) 2017-05-31 05:58:38.640-0500

I did a SIPp test over a long weekend with a slightly modified version of the pjproject-2.6.patch in place (the only change was to modify the patch to count the number of times it ignores the SSL BIO error).

i.e.
{noformat}
/* SSL might just return SSL_ERROR_WANT_READ in
* re-negotiation.
*/
if (err != SSL_ERROR_NONE && err != SSL_ERROR_WANT_READ)
{
#define EXPERIMENTAL 1
#if EXPERIMENTAL
   // experimental...
                   if (err == SSL_ERROR_SYSCALL && size_ == -1
&& ERR_peek_error() == 0 && errno == 0) {
static int count=0;
status = STATUS_FROM_SSL_ERR2("Read", ssock, size_, err, len);
PJ_LOG(2,("SSL", "BIO error: %d", count));
count++;
       // ignore these errors
       ;
   } else {
       /* Reset SSL socket state, then return PJ_FALSE */
       status = STATUS_FROM_SSL_ERR2("Read", ssock, size_, err, len);
       reset_ssl_sock_state(ssock);
       goto on_error;
   }
#else
   status = STATUS_FROM_SSL_ERR2("Read", ssock, size_, err, len);
   reset_ssl_sock_state(ssock);
   goto on_error;
#endif
}

status = do_handshake(ssock);
{noformat}

In my test config. I only have 1 TLS connection to a SIP server so the above suffices.

The test ran for 4 days and completed 100,000+ loopback calls. It closed the existing connection and opened a new TLS connection to the SIP server a total of 5 times over the 4 days. 3 were due to the SIP server being restarted, so not an Asterisk issue, the other 2 reconnections, because of the extra SSL logging in the pjproject-2.6.patch, were reported as being due to:

{noformat}
WARNING: pjproject: SSL SSL_ERROR_SSL (Read): Level: 0 err: <336151548> <SSL routines-SSL3_READ_BYTES-sslv3 alert bad record mac> len: 6000
{noformat}

The SSL BIO error count was 75 by the end of the 4 day test. i.e. without the pjproject-2.6.patch applied Asterisk would have closed and reopened the TLS connection a further 75 times.


By: Ross Beer (rossbeer) 2017-05-31 10:37:30.517-0500

I confirm that the patch has vastly improved TLS connectivity, however, I have noticed a couple of disconnects while testing. Before applying this patch TLS would disconnect hundreds of times within an hour.

Should the patch be passed to the PJSIP team to investigate further?

By: George Joseph (gjoseph) 2017-07-17 16:23:58.009-0500

Does either the asterisk instance or the OpenSIPS instance have qualify turned on?   Can you attach the asterisk side pjsip config elements (transport, endpoint, aor, registration).   Without sensitive details of course.

I can _kinda_ reproduce the issue but i want to make sure we're looking at the same things.



By: Ian Gilmour (tuxian) 2017-07-18 02:24:45.472-0500

Hi George, you can find my pjsip config on ASTERISK-27070. Qualify is not enabled on Asterisk or OpenSIPS.

By: George Joseph (gjoseph) 2017-07-19 08:35:26.873-0500

Thanks Ian.

By: George Joseph (gjoseph) 2017-07-20 09:28:16.958-0500

I can't exactly tell from the output but with your patch how many sockets are open between asterisk and opensips?   I.E.  Does opensips reuse the tcp connection that asterisk established to it when it registered or does it create a new connection back to asterisk for the hairpin calls to go back on?

Also, is opensips doing contact rewriting?  I.E.  If asterisk is configured on port 5061 (which will be in the headers) and sends packets from ephemeral port 30000, is opensips rewriting the contact with 30000 as the port?  It's fine if it is, I just need to know.



By: Ian Gilmour (tuxian) 2017-07-20 10:25:53.117-0500

Hi George, in my normal setup Asterisk is behind NAT. Asterisk is configured to listen (for TLS) on port 5062, but this port is never used (it's actually blocked in Asterisk host iptables). OpenSIPS has no direct access to Asterisk listening port. So all SIP traffic (loopback call requests, reregistrations, etc.) are via TCP connection Asterisk establishes during initial registration.



By: George Joseph (gjoseph) 2017-07-21 05:57:34.780-0500

Perfect, that's the way it should be.  Thanks!



By: George Joseph (gjoseph) 2017-07-26 11:20:56.366-0500

I've cleaned the patch up a bit and submitted it to Teluu for inclusiuon.  [~tuxian], I CCd you on the email I sent.  When they merge it, I'll add it to the bundled patches directory.


By: Ian Gilmour (tuxian) 2017-07-31 06:53:39.962-0500

Hi George,

Further to our email discussion...

Whilst I have no problem with the patch going in as is (it definitely improved things in my testing) I still have concerns about the PJSIP TLS stack code (even with the patch applied).

My patch fixed >90% of my TLS disconnection errors, but it did not eliminate them altogether. As reported earlier, I still see the TLS connection being torn down (for other reasons) and a reconnection taking place shortly afterwards. I'd like to eliminate these too, if possible.

Nor am I 100% sure that it is valid to simply ignore the OpenSSL TLS error condition that resulted in >90% of the connection errors I saw - as my patch does. I simply tried ignoring it and found that it seemed to work, and that it seemed to have no adverse side-effects (at least not in my testing).

I still feel there is more investigation required in this area. Possibly the additional error reporting in the patch will help with this.


By: Alexander Traud (traud) 2017-10-16 03:22:37.891-0500

Ian, is ASTERISK-27347 related? That was the cause for many TCP/TLS connection dropouts here in my case.

By: Ian Gilmour (tuxian) 2017-10-17 02:37:45.869-0500

Hi Alexander - I don't think ASTERISK-27347 is related. With your patch applied I don't see any noticable improvement. I've added further comments to your Jira.

By: Asterisk Team (asteriskteam) 2017-10-17 02:37:46.140-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Kevin Harwell (kharwell) 2017-10-23 16:48:30.917-0500

[~tuxian], due to you commenting this issue got re-opened. If you feel this issue needs more work we'll need some more information before it can be moved forward. For instance specific scenarios duplicating the failures (if possible) you see with the patch in place, debug logs and pcaps as well.

Otherwise it can be closed for now, and at such time you are able to narrow the problem down further you can reopen this issue or create a new one.


By: Ross Beer (rossbeer) 2017-11-07 03:57:55.684-0600

Since moving to PJPROJECT 2.7 in the 13 GIT release, it appears this issue has returned

By: Joshua C. Colp (jcolp) 2017-12-22 09:18:54.900-0600

Ross is tracking his TLS issue at ASTERISK-27474. [~tuxian] if you are able to provide the info that [~kharwell] asked for then feel free to comment and this will automatically reopen.