[Home]

Summary:ASTERISK-29698: Segfault if sorcery object_lifetime_maximum and qualify_frequency the same value
Reporter:Alexei Gradinari (alexei gradinari)Labels:patch
Date Opened:2021-10-21 17:49:29Date Closed:2022-01-28 13:29:30.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip Resources/res_sorcery_memory_cache
Versions:16.21.1 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bt.txt
( 1) bt2.txt
( 2) core-brief.txt
( 3) core-full.txt
( 4) core-full.txt
( 5) core-locks.txt
( 6) core-thread1.txt
( 7) core-thread1.txt
( 8) include_asterisk_schedh.patch
( 9) main_schedc.patch
(10) memcachetrace.patch
(11) refs-0x1a642c8.txt
(12) trace.xz
Description:sorcery.conf:
[res_pjsip]
endpoint=realtime,ps_endpoints
endpoint/cache=memory_cache,maximum_objects=3000,expire_on_reload=yes,object_lifetime_maximum=30
aor=realtime,ps_aors
aor/cache=memory_cache,maximum_objects=3000,expire_on_reload=yes,object_lifetime_maximum=30

The endpoint with aor's static contact and qualify_frequency = 30

I think one thread 'sorcery_memory_cache' clears cache due to expiry
and another thread 'pjsip_options' uses the same cache object which is already freed.
Comments:By: Asterisk Team (asteriskteam) 2021-10-21 17:49:30.429-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Alexei Gradinari (alexei gradinari) 2021-10-21 17:50:59.475-0500

backtrace of SIGSEGV thread

By: Benjamin Keith Ford (bford) 2021-10-22 08:05:33.445-0500

Is this only an issue with realtime or are you able to replicate it with configuration files as well?

By: Alexei Gradinari (alexei gradinari) 2021-10-22 08:26:25.963-0500

There are only realtime endpoints/aor/contacts in this configuration.
I can setup test server with file configuration to check this issue.


By: Alexei Gradinari (alexei gradinari) 2021-10-22 08:33:31.353-0500

With realtime and object_lifetime_maximum=60 and qualify_frequency = 30 no segfaults more then 15 hours.
With realtime and object_lifetime_maximum=30 and qualify_frequency = 30  there were segfaults every 15-30 minutes.

Total cached endpoints: 17
of them 12 with static contacts and 5 with dynamic.


By: Alexei Gradinari (alexei gradinari) 2021-10-22 10:53:13.849-0500

Tested with file configuration without realtime.
The result the same - segfault, same backtrace.


By: Alexei Gradinari (alexei gradinari) 2021-10-22 10:55:45.341-0500

without realtime backtrace of SIGSEGV thread

By: Joshua C. Colp (jcolp) 2021-10-25 05:07:15.362-0500

Can you please provide a FULL backtrace instead of for just a single thread?

By: Benjamin Keith Ford (bford) 2021-10-25 11:57:27.270-0500

Unfortunately the backtrace doesn't reveal much. Created an issue for this to investigate further.

By: Alexei Gradinari (alexei gradinari) 2021-10-25 15:07:14.325-0500

I see it like this:

The thread 'pjsip_options' in function sorcery_memory_cache_create locks cache->objects then calls remove_from_cache and calls schedule_cache_expiration which sets cache->del_expire=1

The thread 'sorcery_memory_cache'  trywrlock, but there is lock and the cache->del_expire==1, so call ao2_ref(cache, -1); and call sorcery_memory_cache_destructor which calls ast_heap_destroy (ALL THIS WITHOUT LOCK)

Then thread 'pjsip_options' calls ast_heap_peek, BUT heap is already destroyed by sorcery_memory_cache_destructor

By: Alexei Gradinari (alexei gradinari) 2021-10-25 15:13:10.817-0500

I think it was regression when tried to fix this issue ASTERISK-25441
with this changes https://gerrit.asterisk.org/c/asterisk/+/1360


By: Joshua C. Colp (jcolp) 2021-11-09 14:07:14.702-0600

Can you elaborate on what you think above? In schedule_cache_expiration the reference to cache is only unreferenced if it is successfully deleted from the scheduler. If the scheduled item is running then the unref in AST_SCHED_DEL_UNREF is never executed, and instead the unref is done in expire_objects_from_cache. It's one or the other - not both - so there would be no double unreference.

This reference is incremented in the first place when the scheduled item for expire_objects_from_cache is added.

By: Alexei Gradinari (alexei gradinari) 2021-11-09 15:44:30.584-0600

I tried to run on test server with refdebug = yes and on startup got another segfault in __ao2_ref

astobj2.c:658
if (ref_log && !(obj->priv_data.options & AO2_ALLOC_OPT_NO_REF_DEBUG)) {

but obj is already freed, look at line 642
ast_free(obj);

Should I create a new issue or you'll do it?


By: Joshua C. Colp (jcolp) 2021-11-09 16:24:41.634-0600

You should.

By: Alexei Gradinari (alexei gradinari) 2021-11-09 16:47:40.475-0600

Attached refs-0x1a642c8.txt which contains refdebug for cache=cache@entry=0x1a642c8
I hope this help.


By: Alexei Gradinari (alexei gradinari) 2021-11-09 17:27:44.215-0600

I think the issue with ao2_ref(cache, -1) at line 513
I think it's not needed.

By: Joshua C. Colp (jcolp) 2021-11-09 19:00:34.611-0600

It is needed if the intent of the code in question is correct. Without it there would be a memory leak of the cache, because the reference would never be released in the scenario. That is according to the original intent of the code. I don't believe simply removing it is the correct fix, it has to be understood in greater detail precisely the order of things. For example based on the log it may be that schedule_cache_expiration is getting called twice, and scheduling cache expiration to occur twice. That's not supposed to happen - which would mean that the core issue is it getting scheduled twice in the first place.

By: Alexei Gradinari (alexei gradinari) 2021-11-10 15:40:37.501-0600

Both threads 'pjsip_options' and 'sorcery_memory_cache' call schedule_cache_expiration.
Wasn't it intended to do so?

By: Joshua C. Colp (jcolp) 2021-11-11 03:56:35.193-0600

The pjsip_options code doesn't directly know about the cache. The cache may do things as a result of actions by pjsip_options.

By: Michael Bradeen (mbradeen) 2021-11-11 11:55:57.470-0600

patch for additional tracing in res_sorcery_memory_cache.c

By: Michael Bradeen (mbradeen) 2021-11-11 11:59:58.213-0600

[~alexei gradinari] - We so far have not been able to reproduce this locally.  Please try the provided patch then enable trace logging, re-produce the issue then forward us the resulting logs and backtrace

https://issues.asterisk.org/jira/secure/attachment/60881/memcachetrace.patch

By: Alexei Gradinari (alexei gradinari) 2021-11-11 14:01:00.232-0600

[~mbradeen] - I'm on vacation. I will be able to test it only after 2 weeks.

By: Alexei Gradinari (alexei gradinari) 2021-11-23 11:58:56.397-0600

res_sorcery_memory_cache trace logging
SIGSEGV in 22908

By: Michael Bradeen (mbradeen) 2021-12-08 10:42:55.530-0600

[~alexei gradinari] - Please try the attached scheduler patches to see if this resolves the segfault.

By: Alexei Gradinari (alexei gradinari) 2021-12-09 10:27:50.884-0600

The uptime is 22.5 hours on test server with the scheduler patches.


By: Alexei Gradinari (alexei gradinari) 2021-12-10 14:27:38.049-0600

No segfaults since 2021-12-08 13pm on test server with the scheduler patches.

By: Friendly Automation (friendly-automation) 2022-01-28 13:29:32.119-0600

Change 17924 merged by Friendly Automation:
sched: fix and test a double deref on delete of an executing call back

[https://gerrit.asterisk.org/c/asterisk/+/17924|https://gerrit.asterisk.org/c/asterisk/+/17924]

By: Friendly Automation (friendly-automation) 2022-01-28 13:34:56.593-0600

Change 17923 merged by Friendly Automation:
sched: fix and test a double deref on delete of an executing call back

[https://gerrit.asterisk.org/c/asterisk/+/17923|https://gerrit.asterisk.org/c/asterisk/+/17923]

By: Friendly Automation (friendly-automation) 2022-01-28 13:35:31.410-0600

Change 17644 merged by Friendly Automation:
sched: fix and test a double deref on delete of an executing call back

[https://gerrit.asterisk.org/c/asterisk/+/17644|https://gerrit.asterisk.org/c/asterisk/+/17644]

By: Friendly Automation (friendly-automation) 2022-01-28 13:36:56.518-0600

Change 17922 merged by Joshua Colp:
sched: fix and test a double deref on delete of an executing call back

[https://gerrit.asterisk.org/c/asterisk/+/17922|https://gerrit.asterisk.org/c/asterisk/+/17922]