[Home]

Summary:ASTERISK-29037: segfault in opus codec stats transmission
Reporter:Andrew Yager (andrewyager)Labels:
Date Opened:2020-08-20 10:16:47Date Closed:2020-08-24 05:01:56
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Codecs/codec_opus
Versions:13.35.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Centos 7.8Attachments:( 0) core.6345-brief.txt
( 1) core.6345-thread1.txt
Description:Apparently during codec decrease count and license usage telemetry, we see a segfault generated.
Comments:By: Asterisk Team (asteriskteam) 2020-08-20 10:16:48.646-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Andrew Yager (andrewyager) 2020-08-20 10:19:39.687-0500

Backtrace. Asterisk compiled with better backtraces and don't optimise.

By: Andrew Yager (andrewyager) 2020-08-20 10:26:56.362-0500

A little further to this - it seems that this bug is only triggered when there is a massive number of TCP sessions being set up/established around the same time, such as immediately after an asterisk restart. The major issue with this is that a restart (due to an intended restart, or as a result of an unexpected crash) could trigger an Opus call to attempt to send its telemetry which fails, possibly because something else is going on somewhere else in libcurl.

I've left asterisk stable for 5 - 10 minutes and then triggered an Opus call and have not been able to replicate the crash in that case; but if I restart Asterisk within the first 2 - 3 minutes after a restart while all the peers are qualifying, and then start an Opus call the segfault will trigger.

By: Kevin Harwell (kharwell) 2020-08-20 13:25:26.758-0500

Unfortunately, given the current data this does not appear to be a crash in Asterisk, but further down in a supporting library. However there is the possibility bad data was passed into the curl method.

Any kind of errors or warnings in the log prior to crashing?

Could you install the debug symbols for libcurl and possibly the other libraries involved, and then run the ast_coredumper script [1] on it after a crash?

Lastly, and in the mean time you could also try reporting the issue to, or contacting one of the other project libraries to see if they might have more information, or an already reported similar issue.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Andrew Yager (andrewyager) 2020-08-22 06:22:25.765-0500

The issue is likely this one:

https://ludovicrousseau.blogspot.com/2020/01/new-version-of-pcsc-lite-1826.html

which has been fixed in a current build of pcsc-lite. Will do some further investigating to confirm.

By: Andrew Yager (andrewyager) 2020-08-22 06:32:03.532-0500

Actually, I can probably just disable pcscd all together, as I suspect that libnss is unnecessarily calling it when there is no need on this system. Thanks for the pointer.

By: Joshua C. Colp (jcolp) 2020-08-24 04:46:32.103-0500

I'm assigning this back to you for feedback since you believe the issue is actually with another library. If you confirm that, please comment so that it goes into the history in case someone else runs into the same issue.

By: Andrew Yager (andrewyager) 2020-08-24 05:00:30.448-0500

This _specific_ issue is fixed in the pcsc-lite library as of this version: https://ludovicrousseau.blogspot.com/2020/01/new-version-of-pcsc-lite-1826.html.

The underlying issue is that the version of pcsc-lite shipping with CentOS 7 uses select() to receive a FD to allow it to talk to a smartcard library. During initialising the NSS library to open an SSL/TLS connection, the OS will attempt to load keys/details from a connected device, and attempt to open access to a smartcard reader, even if it doesn't exist. In this case, the select() method in the shipped version of the library fails under load, because the number of open connections > 1024. poll() does not have this same issue, and the updated version of pcsc-lite makes the change to address this issue.

The fix, in our case, was simply to remove the pcsc library set all together. Anecdotally, this actually also improved performance of the entire TLS stack (presumedly because we've reduce the call stack for any encrypted call by having less to do); and is a firm reminder to understand what the libraries on your system are there for, and remove any components that aren't required.

By: Andrew Yager (andrewyager) 2020-08-24 05:01:56.796-0500

Issue is with third-party library interacting with libnss/curl and not related to Asterisk.