[Home]

Summary:ASTERISK-28923: T.38 Segfaults in chan_pjsip_queryoption
Reporter:Yury Kirsanov (lt_flash)Labels:fax patch
Date Opened:2020-05-30 05:52:09Date Closed:2020-06-10 14:00:48
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_pjsip
Versions:16.10.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Linux Ubuntu 18.04.04 LTSAttachments:( 0) 0001-res_fax-Don-t-start-a-gateway-if-either-channel-is-h.patch
( 1) crash.tar.gz
Description:We're using Asterisk in a heavily-loaded production environment with a lot of devices in both UDP and TCP mode, compiled with bundled version of PJSIP. Recently we have enabled T.38 fax receiving and started to get segfaults even when there's not too much load (i.e. during weekend days) in chan_pjsip.so. I've tried to compile external (non-bundled) version of PJProject, it works fine (version 2.10-dev) but didn't resolve the issue. So I have compiled a debug version of Asterisk and it looks like thread 1 is crashing when doing some T.38 fax options. Could you please let me know if I'm correct or what could be wrong? I'm attaching coredump files in archive. Thanks.
Comments:By: Asterisk Team (asteriskteam) 2020-05-30 05:52:11.279-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Yury Kirsanov (lt_flash) 2020-05-30 05:53:23.495-0500

Coredump parsed with ast_coredump utility

By: Yury Kirsanov (lt_flash) 2020-05-30 05:54:28.813-0500

In logs I can see:

{noformat}
[Sat May 30 11:06:20 2020] asterisk[81358]: segfault at 8 ip 00007fd456c391d8 sp 00007fd3d159a230 error 4 in chan_pjsip.so[7fd456c2e000+1d000]
[Sat May 30 11:06:44 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 11:34:16 2020] asterisk[87569]: segfault at 8 ip 00007f56791cd1d8 sp 00007f56469be230 error 4 in chan_pjsip.so[7f56791c2000+1d000]
[Sat May 30 11:34:23 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 11:39:27 2020] asterisk[88714]: segfault at 8 ip 00007f8f695c11d8 sp 00007f8eab24e230 error 4 in chan_pjsip.so[7f8f695b6000+1d000]
[Sat May 30 11:39:32 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 12:57:16 2020] asterisk[106814]: segfault at 8 ip 00007ff3491cd1d8 sp 00007ff28f6b3230 error 4 in chan_pjsip.so[7ff3491c2000+1d000]
[Sat May 30 12:57:20 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 15:33:44 2020] asterisk[1425]: segfault at 8 ip 00007f38c95c11d8 sp 00007f393c73e230 error 4 in chan_pjsip.so[7f38c95b6000+1d000]
[Sat May 30 15:33:48 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 16:53:14 2020] asterisk[15169]: segfault at 8 ip 00007f19373201d8 sp 00007f1890490230 error 4 in chan_pjsip.so[7f1937315000+1d000]
[Sat May 30 16:53:18 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 17:11:51 2020] asterisk[17795]: segfault at 8 ip 00007f298bb421d8 sp 00007f298a00b230 error 4 in chan_pjsip.so[7f298bb37000+1d000]
[Sat May 30 17:11:54 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 17:52:29 2020] asterisk[22201]: segfault at 8 ip 00007f5d277561d8 sp 00007f5ce67ca230 error 4 in chan_pjsip.so[7f5d2774b000+1d000]
[Sat May 30 17:52:32 2020] TCP: request_sock_TCP: Possible SYN flooding on port 7060. Sending cookies.  Check SNMP counters.
[Sat May 30 17:57:29 2020] asterisk[23031]: segfault at 8 ip 00007f5b2c7a11d8 sp 00007f5b2a975230 error 4 in chan_pjsip.so[7f5b2c796000+1d000]
{noformat}

every time the issue happens.

SYN flood happens after Asterisk segfaults as every TCP-enabled client immediately tries to reconnect.

By: George Joseph (gjoseph) 2020-06-01 07:19:02.355-0500

The stack traces were very helpful, thanks.  Would it be possible to get a packet capture of a call that triggers the issue?

Did the segfaults happen on an earlier version of Asterisk?  If not, what was the last version that didn't have the segfaults?


By: Yury Kirsanov (lt_flash) 2020-06-01 08:08:45.589-0500

Hi George,
Thanks for your reply, unfortunately, this happens completely randomly, I can't get a trace of any call that would be 100% causing this behaviour, I'm not even sure if this is T.38 fax related. I can get more coredumps if required, server is crashing about 10-15 times during busy day usually.

Yes, segfaults were happening on Asterisk 16.9.0 as well and some earlier version too, I was updating Asterisk two times till current 16.10.0, so, unfortunately, I don't know if there is a stable version where this behaviour won't be happening.

I can rebuild Asterisk with thread, malloc or any other debug (probably except for ref debug as it immediately creates a huge log that consumes all disk space) if required, please let me know if you want some other debug versions!

By: George Joseph (gjoseph) 2020-06-01 08:53:37.475-0500

I understand about the packet capture.  It's tough, especially on a heavily loaded system.

A few more "thread1" text files would help confirm that the issue always happens in the same place and is always fax related.

The biggest help will be if you can run ast_coredumper with the {{--tarball-coredumps}} option.  That will capture the asterisk binaries as well as the actual raw coredump which will allow us to examine it more closely.  You can do this on an existing coredump file if there's one still around...

{{ast_coredumper --tarball-coredumps --no-default-search <path to coredump>}}

The file will be big and contain sensitive information so don't attach it to the issue.  Instead upload it to a file hosting service like Google Drive, DropBox, etc and send the link in an email to asteriskteam@digium.com with a subject "ASTERISK-28923: Coredumps".  We'll download the tarball and let yo know when you can delete it from the hosting service.








By: Yury Kirsanov (lt_flash) 2020-06-01 08:57:30.780-0500

Thanks, no worries, I will do this as soon as I get a new coredump file which would probably happen tomorrow!

By: Yury Kirsanov (lt_flash) 2020-06-01 19:44:51.627-0500

Hi George,
I've just emailed the link with newest coredump archive. Thanks!

By: George Joseph (gjoseph) 2020-06-03 11:34:32.677-0500

[~lt_flash] I've attached a test patch that can be applied to Asterisk 16.10.  You can apply it with {{ patch -p1 < 0001-res_fax-Don-t-start-a-gateway-if-either-channel-is-h.patch }}.  Give it a try and let me know how it goes.  


By: Yury Kirsanov (lt_flash) 2020-06-03 12:20:16.410-0500

Patch has been applied successfully, I will wait for tomorrow and report how it goes, thanks!

By: Yury Kirsanov (lt_flash) 2020-06-04 02:51:02.243-0500

Hi George,
This far patch looks very promising, we didn't have a single segfault today during whole day! I will monitor server for couple more days and reply to you next Monday, 8th of June if we won't have any more segfaults!

By: Yury Kirsanov (lt_flash) 2020-06-08 09:24:45.769-0500

Hi,
We have experienced NO segfaults after applying this patch since 4th of June. I will deploy a non-debug version of the Asterisk today but I think patch has fixed the issue. Thanks a lot for a prompt reply and patch!

By: George Joseph (gjoseph) 2020-06-08 09:37:38.691-0500

Excellent.  We have patches up on Gerrit but I don't think they'll make it into the next releases which are scheduled for this week.  The releases after that though for sure.

By: Yury Kirsanov (lt_flash) 2020-06-10 01:11:08.125-0500

George,
Patched version of Asterisk without 'don't optimize' and debug flags works just fine, so I assume that the issue is resolved and this ticket can be closed. Thanks again for your help!

By: Friendly Automation (friendly-automation) 2020-06-10 14:00:49.086-0500

Change 14495 merged by George Joseph:
res_fax: Don't start a gateway if either channel is hung up

[https://gerrit.asterisk.org/c/asterisk/+/14495|https://gerrit.asterisk.org/c/asterisk/+/14495]

By: Friendly Automation (friendly-automation) 2020-06-10 14:01:00.291-0500

Change 14494 merged by George Joseph:
res_fax: Don't start a gateway if either channel is hung up

[https://gerrit.asterisk.org/c/asterisk/+/14494|https://gerrit.asterisk.org/c/asterisk/+/14494]

By: Friendly Automation (friendly-automation) 2020-06-10 14:01:13.370-0500

Change 14493 merged by George Joseph:
res_fax: Don't start a gateway if either channel is hung up

[https://gerrit.asterisk.org/c/asterisk/+/14493|https://gerrit.asterisk.org/c/asterisk/+/14493]

By: Friendly Automation (friendly-automation) 2020-06-10 14:01:25.492-0500

Change 14460 merged by George Joseph:
res_fax: Don't start a gateway if either channel is hung up

[https://gerrit.asterisk.org/c/asterisk/+/14460|https://gerrit.asterisk.org/c/asterisk/+/14460]