ASTERISK-27944: res_pjsip_t38: Crash receiving 1xx responses other than 100 before 200 for T.38 reINVITE

[Home]

Summary: ASTERISK-27944: res_pjsip_t38: Crash receiving 1xx responses other than 100 before 200 for T.38 reINVITE

Reporter: Joshua Elson (joshelson) Labels: fax patch pjsip

Date Opened: 2018-06-25 15:59:44 Date Closed: 2018-07-05 10:49:52

Priority: Minor Regression?

Status: Closed/Complete Components: Resources/res_pjsip_t38

Versions: 15.2.2 15.4.1 Frequency of
Occurrence

Related
Issues:
is duplicated by ASTERISK-27951 Segfault in res_pjsip_t38.so on incoming fax

Environment: Attachments: ( 0) core.com-2018-08-01T15-46-31-0600-brief.txt
( 1) core.com-2018-08-01T15-46-31-0600-full.txt
( 2) core.com-2018-08-01T15-46-31-0600-locks.txt
( 3) core.com-2018-08-01T15-46-31-0600-thread1.txt
( 4) core.iad.2018-08-10T12-34-02-0600-brief.txt
( 5) core.iad.2018-08-10T12-34-02-0600-full.txt
( 6) core.iad.2018-08-10T12-34-02-0600-locks.txt
( 7) core.iad.2018-08-10T12-34-02-0600-thread1.txt
( 8) core.iad.com.2018-06-25T14-06-14-0600-full.txt
( 9) core.iad.com.2018-06-25T14-06-14-0600-locks.txt
(10) core.iad.com.2018-06-25T14-06-14-0600-thread1.txt
(11) core.iad.com-2018-06-25T14-06-14-0600-brief.txt
(12) core.iad.com-2018-07-09T09-14-58-0600-brief.txt
(13) core.iad.com-2018-07-09T09-14-58-0600-full.txt
(14) core.iad.com-2018-07-09T09-14-58-0600-locks.txt
(15) core.iad.com-2018-07-09T09-14-58-0600-thread1.txt
(16) crash_dialplan.txt
(17) jira_asterisk_27944_v2_w_debug_v15.patch

Description: Crash of Asterisk on T38 session rejection via PJSIP.

This is happening somewhat routinely on a number of our nodes. The error message in the logs will appear as:

[2018-06-25 14:06:04] WARNING[3368][C-00005bd3] res_fax.c: channel 'PJSIP/FC-DFW-PROXY-0000f050' refused to negotiate T.38

There will be a crash with the attached backtrace.

Comments: By: Asterisk Team (asteriskteam) 2018-06-25 15:59:45.923-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Joshua Elson (joshelson) 2018-06-25 16:02:38.738-0500

Backtrace attached.
By: Joshua C. Colp (jcolp) 2018-06-25 16:49:43.563-0500

What version is the backtrace actually against? Please also provide the SIP trace and console log, so we can see how it got into this state.
By: Joshua Elson (joshelson) 2018-06-25 16:54:00.461-0500

This was on 15.2.2, although we've seen it on all 15.x versions.

Let me see if I can get a way to pull the SIP trace and the console log. This doesn't happen on all T38 negotiation rejections, but a subset of them.

Looking for a fully reproducible way to create this state and haven't quite found it yet, but it does occur with regularity on our cluster nodes, even under light to moderate load.
By: Joshua Elson (joshelson) 2018-06-25 17:02:31.270-0500

Aha! Was able to pull both. The SIP capture from our proxy's perspective doesn't show anything of real interest... we see a full 488 reject sequence completed properly.

The dialplan output, on the other hand, is far more interesting. This is slightly scrubbed, but it appears that we're in a state where audio faxing is being done when the T38 rejection happens.

The dialplan output ends when the crash occurs.
By: Joshua C. Colp (jcolp) 2018-06-27 04:34:31.692-0500

It seems as though the T.38 was rejected but we then received a 200 OK to a T.38 reinvite request, which is a state transition not currently handled.
By: Richard Mudgett (rmudgett) 2018-07-02 11:58:38.684-0500

I reproduced the crash. We crash because the peer sent a 1xx response other than {{100 Trying}} before completing the T.38 re-INVITE with a {{200 OK}}. I suspect the peer was sending {{180 Ringing}} to the T.38 re-INVITE.
By: Friendly Automation (friendly-automation) 2018-07-05 10:49:53.868-0500

Change 9321 merged by Jenkins2:
res_pjsip_t38.c: Fix crash by ignoring 1xx messages.

[https://gerrit.asterisk.org/9321|https://gerrit.asterisk.org/9321]
By: Friendly Automation (friendly-automation) 2018-07-05 10:55:02.394-0500

Change 9323 merged by Jenkins2:
res_pjsip_t38.c: Fix crash by ignoring 1xx messages.

[https://gerrit.asterisk.org/9323|https://gerrit.asterisk.org/9323]
By: Friendly Automation (friendly-automation) 2018-07-05 10:57:55.813-0500

Change 9322 merged by Jenkins2:
res_pjsip_t38.c: Fix crash by ignoring 1xx messages.

[https://gerrit.asterisk.org/9322|https://gerrit.asterisk.org/9322]
By: Joshua Elson (joshelson) 2018-07-09 10:56:47.139-0500

With this patch applied, we observed this crash again this morning on one of our nodes. Possible that something else is going on here as well?
By: Richard Mudgett (rmudgett) 2018-07-09 11:40:10.298-0500

It is possible that the crash may be a different manifestation to the cause of ASTERISK-27657 where the patch to ASTERISK-27080 is causing the T.38 state to become rejected.
By: Joshua Elson (joshelson) 2018-08-01 17:42:19.249-0500

With both the above referenced patches applied, we are still observing the crash. It's intermittent, but it happening on multiple nodes at least once a day.
By: Asterisk Team (asteriskteam) 2018-08-01 17:42:19.622-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Richard Mudgett (rmudgett) 2018-08-01 18:11:29.948-0500

The original and latest backtraces show the same thing. :( I don't understand how it can get in that state after the original fix. The backtraces are showing the code is optimized. Optimization may be playing tricks on the backtrace and not showing what is going on. The patch for ASTERISK-27563 fixed a problem where res_pjsip modules were always optimized even when DONT_OPTIMIZE was enabled.

* An unoptimized backtrace may show a better picture of what is going on.
* Does it seem to crash for a particular caller or calling device?
By: Joshua Elson (joshelson) 2018-08-01 18:49:45.715-0500

We appear to need that patch to get unoptimized traces in res_pjsip. We'll get that taken care of on our end ASAP.

It does appear to be limited to a couple flows that we're still looking to see if we can untangle. It's mostly just an on-net Polycom phone calling a fax line on the same server and getting rejected. It doesn't appear to happen all the time, but does happen.

We'll get a properly unoptimized trace here as soon as it crashes again.
By: Richard Mudgett (rmudgett) 2018-08-10 07:01:34.599-0500

Waiting on new backtraces and more data on the call flow involved (logs).
By: Joshua Elson (joshelson) 2018-08-10 14:04:08.501-0500

Got it. This one has full symbols.
By: Richard Mudgett (rmudgett) 2018-08-10 19:59:10.044-0500

I still don't understand how we could have another final response processed for the T.38 reINVITE. Maybe the INVITE transaction is timing out at the same time as the 200 OK is coming in? I've looked at the code but I don't think that situation could cause the crash.

[^jira_asterisk_27944_v2_w_debug_v15.patch] - This patch adds protection from receiving more than one final response to the T.38 reINVITE. We will ignore any unexpected response. It also has a NOTICE log level BUGBUG message to make it easy to grep the logs to see if we likely avoided a crash.

By: Joshua Elson (joshelson) 2018-08-10 20:53:10.794-0500

You're the man! We'll give it a shot and report back Richard.
By: Joshua Elson (joshelson) 2018-08-15 13:53:19.097-0500

Richard - we've found several instances of the NOTICE BUGBUG in infrastructure and it has in fact prevented the crash. Is this something we could modify to make a permanent part of infrastructure going forward?
By: Richard Mudgett (rmudgett) 2018-08-15 14:28:17.410-0500

Yes it can be put into the codebase. However, that patch is purely defensive and does not fix the root cause. We are apparently getting two different final responses to the T.38 reinvite transaction for that callback which appears to violate the SIP protocol. The message does give the channel name that it happens with so you could have a better chance of narrowing down what devices or call flows are involved.
By: Friendly Automation (friendly-automation) 2018-08-20 10:55:33.793-0500

Change 9961 merged by George Joseph:
res_pjsip_t38.c: Fix crash if already saw a final T.38 reINVITE response.

[https://gerrit.asterisk.org/9961|https://gerrit.asterisk.org/9961]
By: Friendly Automation (friendly-automation) 2018-08-20 10:55:56.510-0500

Change 9960 merged by George Joseph:
res_pjsip_t38.c: Fix crash if already saw a final T.38 reINVITE response.

[https://gerrit.asterisk.org/9960|https://gerrit.asterisk.org/9960]
By: Friendly Automation (friendly-automation) 2018-08-20 10:56:20.098-0500

Change 9959 merged by George Joseph:
res_pjsip_t38.c: Fix crash if already saw a final T.38 reINVITE response.

[https://gerrit.asterisk.org/9959|https://gerrit.asterisk.org/9959]
By: Joshua C. Colp (jcolp) 2018-08-21 11:25:10.710-0500

I'm closing this out as fixed since the patch itself solves the crash problem. If you are able to provide the specific scenario then we can potentially look deeper into understanding it.