ASTERISK-29130: prometheus: Crash when scraping bridge

[Home]

Summary: ASTERISK-29130: prometheus: Crash when scraping bridge

Reporter: Francisco Correia (fcorreia) Labels:

Date Opened: 2020-10-16 08:27:21 Date Closed: 2021-04-02 07:37:47

Priority: Minor Regression?

Status: Closed/Complete Components: Resources/General

Versions: 18.0.0 Frequency of
Occurrence Constant

Related
Issues:
is duplicated by ASTERISK-29378 res_prometheus: Crash when scraping bridges and creating a bridge at the same time

is related to ASTERISK-29374 res_prometheus: Crash when scraping channels

Environment: OS: CentOS 7 Version: asterisk-18.0.0-rc2.tar.gz Attachments: ( 0) core.1383-brief.txt
( 1) core.1383-full.txt
( 2) core.1383-info.txt
( 3) core.1383-locks.txt
( 4) core.1383-thread1.txt

Description: Prometheus Resource while building metrics response creates segmentation fault during active call on a Stasis Application.

{noformat}
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/asterisk -f -g -C /etc/asterisk/asterisk.conf'.
Program terminated with signal 11, Segmentation fault.
#0 bridges_scrape_cb (response=0x7f337c0ccc78) at prometheus/bridges.c:129
129 PROMETHEUS_METRIC_SET_LABEL(&bridge_metrics[index], 1, "id", (snapshot->uniqueid));
{noformat}

Comments: By: Asterisk Team (asteriskteam) 2020-10-16 08:27:22.085-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].
By: Joshua C. Colp (jcolp) 2020-10-16 08:30:49.742-0500

Thank you for the crash report. However, we need more information to investigate the crash. Please provide:

1. A backtrace generated from a core dump using the instructions provided on the Asterisk wiki [1].
2. Specific steps taken that lead to the crash.
3. All configuration information necesary to reproduce the crash.

Thanks!

[1]: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Francisco Correia (fcorreia) 2020-10-16 09:02:13.512-0500

Steps to Reproduce
- Prometheus resource enabled and configured no 8088/metrics, and a prometheus server is scraping metrics with a 30 seconds interval

1) Call enters a Stasis application
2) Stasis application starts playing music on hold on the caller channel
3) Stasis application dials one of the dummy extensions with a 10 second timeout
4) Repeats dial until some extension answer
Asterisk crashes,
I think the crash is a consequence of prometheus server invoking the metrics endpoint on asterisk while a call is active
By: George Joseph (gjoseph) 2021-03-31 15:50:34.105-0500

We've had another report of this but I think I know what's going on...
In channels_scrape_cb() we get the channel cache then get the count of channels in it. We then use the count to calculate how many channel_metric elements we need.
Then we iterate over ther channel cache. All without locking the channel cache. I think the number of channels in the cache went UP between the time we got the count and the time we started iterating. Since we only had enough metric elements for the original count, we ran out and attempted to set elements that were actually beyond the original allocation.

The same thing happens in bridges_scrape_callback().
By: George Joseph (gjoseph) 2021-04-01 08:52:43.090-0500

[~fcorreia], [~sduthil], [~bweschke]: There are fix reviews up on gerrit (see above) if any of you want to give them a try.

By: Friendly Automation (friendly-automation) 2021-04-02 07:37:49.031-0500

Change 15719 merged by Friendly Automation:
res_prometheus: Clone containers before iterating

[https://gerrit.asterisk.org/c/asterisk/+/15719|https://gerrit.asterisk.org/c/asterisk/+/15719]
By: Friendly Automation (friendly-automation) 2021-04-02 07:39:04.907-0500

Change 15724 merged by Friendly Automation:
res_prometheus: Clone containers before iterating

[https://gerrit.asterisk.org/c/asterisk/+/15724|https://gerrit.asterisk.org/c/asterisk/+/15724]