[Home]

Summary:ASTERISK-27529: Endpoint with TLS lost inbound registration
Reporter:Dmitriy Serov (Demon)Labels:pjsip
Date Opened:2017-12-22 13:12:41.000-0600Date Closed:
Priority:MinorRegression?Yes
Status:Open/NewComponents:Resources/res_pjsip
Versions:15.2.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Attachments:( 0) debug9-1.txt
( 1) demon.tcpdump.txt
( 2) endpoint.txt
( 3) pjsip-config-tls-lost.txt
( 4) registration_with_lost.txt
( 5) TLS-Qualify-ServerPort-Bug.zip
Description:Regression.15.1.4 -> 15.2.0-rc.
Endpoint with TLS lost inbound registration.
After some time of first register (connect) in asterisk log "is now Unreachable.  RTT: 0.000 msec".
Homer shows packets REGISTER and answers OK. But no line on Reachable in logs.
Client thinks is online. And asterisk does not. As a result - lost calls.
Comments:By: Asterisk Team (asteriskteam) 2017-12-22 13:12:41.515-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Dmitriy Serov (Demon) 2017-12-22 13:14:39.648-0600

netstat -nalp | grep 5061 | grep <ip>
shows ESTABLISHED connection

By: Joshua C. Colp (jcolp) 2017-12-22 13:16:03.836-0600

Can you please provide the console output with SIP logging as an attachment? Including the REGISTER.

By: Dmitriy Serov (Demon) 2017-12-22 13:17:56.614-0600

pjsip show endpoint

By: Dmitriy Serov (Demon) 2017-12-22 13:27:18.722-0600

Log with logger on attached.
Packet REGISTER, answer OK. And... where is Contact is now Reachable?

By: Joshua C. Colp (jcolp) 2017-12-22 13:31:51.411-0600

It appears as though the underlying TLS connection was dropped, as the new registration displaced the old one which would not occur if it were a refresh. This would also explain it going unreachable.

Getting a wireshark capture would confirm what is happening to the connection. Otherwise this appears to be the same thing as ASTERISK-27474.

By: Dmitriy Serov (Demon) 2017-12-22 13:59:06.825-0600

demon.tcpdump.txt attached.
TCPdump stopped when a contact becomes unavailable


By: Richard Mudgett (rmudgett) 2017-12-22 14:19:02.670-0600

Have you read \[1]?  You seem to have remove_existing enabled with an unknown max_contacts set.

\[1] http://blogs.asterisk.org/2017/11/29/pjsip-mis-configuration-can-cause-loss-sip-registrations/


By: Dmitriy Serov (Demon) 2017-12-22 14:35:47.607-0600

now:
max_contacts=1
rewrite_contact=yes
remove_existing=yes

What's wrong?
However, a device refreshing a registration may register a different contact address and exceed “max_contacts”?
IP address, port and other - the same... max_contacts=2 will help?

By: Richard Mudgett (rmudgett) 2017-12-22 14:58:56.379-0600

I didn't say anything was wrong.  I noticed in the attached log that remove_existing was enabled and that a contact was removed because of it.  I was just pointing out that blog to see if you had considered the consequences of those settings described by the blog.

By: Dmitriy Serov (Demon) 2017-12-22 15:05:14.233-0600

Maybe it's a "coincidence", but the max_contacts=2 helped. I'll keep looking.
But to 15.2.0, I did not notice such problems with max_contacts=1.
And, of course, I think is wrong this behavior when is required a limit of 1 contact.

Thank you for blog url.

By: Dmitriy Serov (Demon) 2017-12-22 15:53:16.731-0600

No max_contacts=2 did not help. Contact is unavailable again.


By: Joshua C. Colp (jcolp) 2018-01-02 05:26:44.834-0600

Can you take a gander at this too? It was already mentioned on the other issue, so you can close this one as a duplicate if you want.

By: Dmitriy Serov (Demon) 2018-01-02 07:37:32.667-0600

I don't know how to close this issue.
You can close as duplicate of ASTERISK-27474.
I look forward to the patch for asterisk 15. Thank you

By: George Joseph (gjoseph) 2018-01-02 07:41:47.884-0600

I think [~jcolp]'s comment was directed at me.
I'm going to leave this open for now until I can investigate a bit more.


By: Asterisk Team (asteriskteam) 2018-01-25 12:00:01.816-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Dmitriy Serov (Demon) 2018-07-11 13:10:45.938-0500

Asterisk 15.5.0-rc1 with bundled pjproject.
Server config is in the attached file "pjsip-config-tls-lost.txt".
Softphone PhonerLite configured to use TLS (port 5061).

Two cases:
1. In case of endpoint is configured without "transport=transport-tls" inbound registration is stable and doesn't lost.
2. In case of endpoint is configured with "transport=transport-tls" inbound registration lost after minutes (range from 2 minutes to a couple of hours):
[2018-07-11 19:40:20] VERBOSE[18092] res_pjsip/pjsip_options.c: Contact 10228/sip:10228@109.xx.xx.xx:63119;transport=TLS is now Reachable.  RTT: 8.305 msec
[2018-07-11 20:31:25] VERBOSE[12819] res_pjsip/pjsip_options.c: Contact 10228/sip:10228@109.xx.xx.xx:63119;transport=TLS is now Unreachable.  RTT: 0.000 msec


By: Kevin Harwell (kharwell) 2018-07-12 17:36:26.356-0500

I have been unable to duplicate this issue with the given information and nothing immediately stands out. In order to move forward we'll need a complete/full debug log from Asterisk that includes the appropriate protocol debug options enabled, e.g. 'pjsip set logger on'.

The configuration you attached also does not appear to be aligned with the verbose messages you posted. For instance the following:
{noformat}
2018-07-11 20:31:25] VERBOSE[12819] res_pjsip/pjsip_options.c: Contact 10228/sip:10228@109.xx.xx.xx:63119;transport=TLS is now Unreachable. RTT: 0.000 msec
{noformat}
Can only be seen if endpoint qualification has been enabled. However in the attached pjsip.conf I do not see the qualify_frequency option set on the aor for the given endpoint in question. As such please post the complete configuration that was used when the problem occurs.

A wireshark compatible packet capture, captured at the same time as the Asterisk log output would also be helpful.

Please follow the instructions on the wiki [1] for how to collect debugging and protocol specific debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: Asterisk Team (asteriskteam) 2018-07-27 12:00:01.641-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Dmitriy Serov (Demon) 2018-10-09 16:56:44.166-0500

Asterisk 15.6.1
Endpoint has single transport (TLS).
I have noticed that contact keeps Reachable in for a while. All the while, asterisk sends an OPTIONS packet from port 5061 to the client port (and receives a response).
After a few hours, asterisk for some reason decides to send packets from another dynamic port.
In this case, the client is no longer responsible (of course). And netstats shows SYN_SENT state of this connection.
The connection from port 5061 is still in use (ESTABLISHED) and accepts and responds to REGISTER packets from the client.
For this reason, the client considers the connection stable.

Question: why the server sends OPTIONS with dynamic port (not 5061)?

By: Asterisk Team (asteriskteam) 2018-10-09 16:56:44.683-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Joshua C. Colp (jcolp) 2018-10-10 05:40:49.619-0500

It would send from a different source port if it thought the connection dropped and it had to establish a new outgoing one (or try to). Core level logs and such would show this happening if it thought so. Do you have current logs showing all of this?

By: Dmitriy Serov (Demon) 2018-10-10 06:03:59.299-0500

attached debug9-1.txt

This is debug level 9 log. when there was already a switch from port 5061 to dynamic. I.e. the previous OPTIONS has been a dynamic port (different each time). This is not the moment when there was the first OPTIONS with the "wrong" port.

netstat -nalp | grep 109.60.xxx.xxx
tcp        0      1 85.142.148.xx:57263     109.60.xxx.xxx:51359    SYN_SENT    3186/asterisk
tcp        0      0 85.142.148.xx:5061      109.60.xxx.xxx:59714    ESTABLISHED 3186/asterisk

I see two connections: established with port 5061 and SYN_SENT with dynamic port.

By: Joshua C. Colp (jcolp) 2018-10-15 05:38:41.131-0500

That is not enough console log to look at it. I was more looking to see if PJSIP wrongly thought a connection had dropped. What is the output of "pjsip show contact" on the specific contact when it happens? Is there anything further up in the console log?

By: Dmitriy Serov (Demon) 2018-10-15 08:06:32.927-0500

Unfortunately, the server has thousands of registrations and it is not easy to collect logs. But it is possible.

Collecting the logs, I found the following sequence of events:
1. client establishes TCP connection
2. the client sends a REGISTER (authorized). Contact becomes Available
3. the server, using the Contact parameters, sends OPTIONS to check the connection on the desired destination port
4 at some point, the client does not respond in time to OPTIONS (in my case, the delay was 7 seconds). Not an important reason (I will understand from the author of PhonerLite)
5. server marks Contact as Unavailable. The connection between the client and the server is NOT LOST. The client does not know about the problems.
6. The client then sends the REGISTER packets and receives responses over the existing connection. The server, seeing that Contact is unavailable, tries to establish a new connection from another server port and does not reach the client.

7. The client will never go online, because he does not know about the server problems. The server will never reach the client via a" foreign " connection.

By: Joshua C. Colp (jcolp) 2018-10-18 04:51:37.249-0500

Still need to see the information I previously mentioned. In particular the "pjsip show contact" output would be very helpful.

By: Dmitriy Serov (Demon) 2018-10-23 16:53:23.626-0500

I was hoping to work around the issue by disabling Qualify. But the problem remained. After some time, the INVITE packet was sent by the server not from port 5061, but from another (dynamic) port. The situation was repeated, but with the packet INVITE.

In the archive TLS-Qualify-ServerPort-Bug.zip two log files:
1. cutted.txt - abbreviated file with rows that relate to the endpoint 10228.
You can see that 23:32:10 qualify was from port 5061 and was successful.
23:35:10 server src port somehow changed from 5061 to 33217. Port 33217 previously in the logs was not mentioned even once.
Something happened between the two.

2. full.txt - full log (debug level 9) between this two moments. 35 MB :(

"pjsip show contact" is useless. The client port does not change, and only the client port is displayed everywhere. The contact status changes to Unavailable after a timeout.

By: George Joseph (gjoseph) 2018-10-25 14:33:02.575-0500

I just can't seem to reproduce this issue either in 15.6 or 16.0.  Can you create a sipp scenario that we can try?


By: Asterisk Team (asteriskteam) 2018-11-09 12:00:01.778-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Dmitriy Serov (Demon) 2019-02-09 05:02:01.799-0600

The reason is found and the workaround works. I will tell you about the causes of the problem.
It is important that it was:
- allow_reload=true in the transport config
- reloading pjsip config

Sometimes the algorithm thinks that the transport has changed (although there were no real changes):
{code}
/res/res_pjsip/config_transport.c
               ast_sorcery_diff(sorcery, perm_state->transport, transport, &changes);
               if (!changes && !has_state_changed(perm_state->state, temp_state->state)) {
{code}
and then a new listener is created instead of the old one.

and then
{code}
third-party/pjproject/source/pjsip/src/pjsip/sip_transport.c
       /* If transport is found and listener is specified, verify listener */
       else if (sel && sel->type == PJSIP_TPSELECTOR_LISTENER &&
                sel->u.listener && transport->factory != sel->u.listener)
       {

           transport = NULL;
           /* This will cause a new transport to be created which will be a
            * 'duplicate' of the existing transport (same type & remote addr,
            * but different factory). Any future hash lookup will return
            * the new one, and eventually the old one will still be freed
            * (by application or #1774).
            */
       }
{code}
When you try to send from the server to the OPTIONS or INVITE client, a new connection is created (from the server to the client, which by its nature will never be correct) and replaces the old connection (with the same Protocol, address and port) in the transport hash.
As a result, we have a lost socket that successfully responds to the client's REGISTER and creates the illusion of successful authorization from the client.

allow_reload=false is workaround i used now.

By: Asterisk Team (asteriskteam) 2019-02-09 05:02:02.646-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.