[Home]

Summary:ASTERISK-29786: cdr.c: FRACK! causing crashes
Reporter:Jon Sparks (jrsparks86)Labels:
Date Opened:2021-12-03 08:35:54.000-0600Date Closed:2021-12-22 12:00:01.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:CDR/General Core/Stasis
Versions:18.6.0 18.7.0 18.7.1 18.8.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Attachments:( 0) core.ip-10-10-12-153.us-east-2.compute.internal-2021-12-03T14-42-43+0000-brief.txt
( 1) core.ip-10-10-12-153.us-east-2.compute.internal-2021-12-03T14-42-43+0000-full.txt
( 2) core.ip-10-10-12-153.us-east-2.compute.internal-2021-12-03T14-42-43+0000-info.txt
( 3) core.ip-10-10-12-153.us-east-2.compute.internal-2021-12-03T14-42-43+0000-locks.txt
( 4) core.ip-10-10-12-153.us-east-2.compute.internal-2021-12-03T14-42-43+0000-thread1.txt
( 5) log
Description:We are running 18.8.0 and experiencing a couple strange crashes with Asterisk. Here are the fracks we are seeing.
{noformat}
FRACK!, Failed assertion bad magic number 0x0 for object 0xffff80059bd8 (0) - channel.c - ast_waitfor_nandfds
FRACK!, Failed assertion user_data is NULL (0) - cdr.c - handle_channel_snapshot_update_message - 2327/2334
FRACK!, Failed assertion 0 (0) - cdr.c - handle_channel_snapshot_update_message - 2302
FRACK!, Failed assertion strcasecmp(snapshot->base->name, cdr->party_a.snapshot->base->name) == 0 (0) - cdr.c -base_process_party_a - 1542
{noformat}


A few of the latest backtraces:
{noformat}
Got 12 backtrace records
# 0: [0xaaaab64a1270] asterisk utils.c:2583 __ast_assert_failed()
# 1: [0xaaaab64b7594] asterisk utils.h:708 _ast_assert()
# 2: [0xaaaab64bb418] asterisk cdr.c:1545 base_process_party_a()
# 3: [0xaaaab64be61c] asterisk cdr.c:2312 handle_channel_snapshot_update_message()
# 4: [0xaaaab6468ad8] asterisk stasis_message_router.c:202 router_dispatch()
# 5: [0xaaaab644caec] asterisk stasis.c:787 subscription_invoke()
# 6: [0xaaaab644e514] asterisk stasis.c:1267 dispatch_exec_async()
# 7: [0xaaaab6481b20] asterisk taskprocessor.c:1235 ast_taskprocessor_execute()
# 8: [0xaaaab647e260] asterisk taskprocessor.c:201 default_tps_processing_function()
# 9: [0xaaaab649db68] asterisk utils.c:1428 dummy_start()
#10: [0xffff836614fc] libpthread.so.0 pthread_create.c:477 start_thread()
#11: [0xffff8336d67c] libc.so.6 :0 clone()
{noformat}

{noformat}
Got 12 backtrace records
# 0: [0xaaaad2dc7270] asterisk utils.c:2583 __ast_assert_failed()
# 1: [0xaaaad2c08178] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0xaaaad2c08270] asterisk astobj2.c:224 __ao2_lock()
# 3: [0xaaaad2de4738] asterisk cdr.c:2328 handle_channel_snapshot_update_message()
# 4: [0xaaaad2d8ead8] asterisk stasis_message_router.c:202 router_dispatch()
# 5: [0xaaaad2d72aec] asterisk stasis.c:787 subscription_invoke()
# 6: [0xaaaad2d74514] asterisk stasis.c:1267 dispatch_exec_async()
# 7: [0xaaaad2da7b20] asterisk taskprocessor.c:1235 ast_taskprocessor_execute()
# 8: [0xaaaad2da4260] asterisk taskprocessor.c:201 default_tps_processing_function()
# 9: [0xaaaad2dc3b68] asterisk utils.c:1428 dummy_start()
#10: [0xffff978f44fc] libpthread.so.0 pthread_create.c:477 start_thread()
#11: [0xffff9760067c] libc.so.6 :0 clone()
{noformat}

{noformat}
Got 11 backtrace records
# 0: [0xaaaad2dc7270] asterisk utils.c:2583 __ast_assert_failed()
# 1: [0xaaaad2ddd594] asterisk utils.h:708 _ast_assert()
# 2: [0xaaaad2de458c] asterisk cdr.c:2302 handle_channel_snapshot_update_message()
# 3: [0xaaaad2d8ead8] asterisk stasis_message_router.c:202 router_dispatch()
# 4: [0xaaaad2d72aec] asterisk stasis.c:787 subscription_invoke()
# 5: [0xaaaad2d74514] asterisk stasis.c:1267 dispatch_exec_async()
# 6: [0xaaaad2da7b20] asterisk taskprocessor.c:1235 ast_taskprocessor_execute()
# 7: [0xaaaad2da4260] asterisk taskprocessor.c:201 default_tps_processing_function()
# 8: [0xaaaad2dc3b68] asterisk utils.c:1428 dummy_start()
# 9: [0xffff978f44fc] libpthread.so.0 pthread_create.c:477 start_thread()
#10: [0xffff9760067c] libc.so.6 :0 clone()
{noformat}

I’m seeing different issues with each frack. Two things I have noticed:

When “Failed assertion strcasecmp” occurs it seems to wait until the last call ends and it does a core dump and acts like a graceful restart.
When “Failed assertion user_data is NULL” or “Failed assertion 0” occurs it is more likely the server will crash with active calls not graceful.

I have yet to be able to reproduce it in our dev environment or find a root cause. In our prod environment the fracks were created 27,000 times in the last 24 hours. We believe it’s related to hangups with the stasis app.

Comments:By: Asterisk Team (asteriskteam) 2021-12-03 08:35:55.679-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Jon Sparks (jrsparks86) 2021-12-03 08:37:47.335-0600

I have few weeks of data around this issue. Let me know what would be the most helpful and I will get it gathered for you.

By: Joshua C. Colp (jcolp) 2021-12-03 08:40:31.588-0600

Please attach a full log file showing it, as well as describing what exactly you are doing with Asterisk.

By: Joshua C. Colp (jcolp) 2021-12-03 08:42:25.273-0600

Thank you for the crash report. However, we need more information to investigate the crash. Please provide:

1. A backtrace generated from a core dump using the instructions provided on the Asterisk wiki [1].
2. Specific steps taken that lead to the crash.
3. All configuration information necesary to reproduce the crash.

Thanks!

[1]: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

And as you mentioned crash, we also need a backtrace.

By: Jon Sparks (jrsparks86) 2021-12-03 11:25:03.126-0600

I attached the logs from the most recent hard crash. I will attempt my best to describe what is happening. We have multiple Asterisk servers that do many different things. Currently we are attempting to stay under 15 calls per server and cycling them constantly because of this issue. Call load or uptime doesn’t seem to matter. The server can be up for 5 minutes to 12 hours before the crash. In the case of the graceful restarts the frack could happen hours before the server restarts.

In the case of these failures the most common part is we have calls in a conference room. Then we place a new call into the conference room that runs a stasis app. That app could be doing a mixture of things. The case of the logs provided we are listening to the call to collect data about it.

Our belief is when the hangup occurs that there is a race condition that Asterisk freaks out about or we are crossing some kind of limit not known to us.

Looking at these logs before I sent them I noticed stasis/m:cdr:aggregator-00000005 is pretty interesting with the queue and max queue. I can look back to see if that is similar to other crashes.

By: Joshua C. Colp (jcolp) 2021-12-03 11:39:30.354-0600

There are backtraces attached, but no Asterisk logs themselves.

By: Jon Sparks (jrsparks86) 2021-12-03 11:56:17.243-0600

Let me see what I can do about that because there are 27,000 log entries with data I can't just upload.

By: Jon Sparks (jrsparks86) 2021-12-06 13:30:45.882-0600

We were able to recreate it in our dev environment. Gathering the log for you now.

One interesting note:
Unable to allocate chan channel structure - core_unreal.c - ast_unreal_new_channels - 1263

By: Jon Sparks (jrsparks86) 2021-12-06 13:50:49.850-0600

The full log from the recent crash in our dev environment. We are in the process of digging into it deeper.

By: Joshua C. Colp (jcolp) 2021-12-06 13:55:47.873-0600

What is this log from? It doesn't contain thread identifiers, and it's very confusingly organized and displayed.

By: Joshua C. Colp (jcolp) 2021-12-08 07:22:56.106-0600

As well, are you generating your own channel unique ids when creating/originating channels and guaranteeing they are unique? I believe that's the problem here - the requests are using the same uniqueid for multiple simultaneous channels. They have to be unique.

By: Joshua C. Colp (jcolp) 2021-12-08 07:29:55.961-0600

The full ARI requests would help to see in that regard. I think you are hanging up a channel and expecting it to occur immediately when it's asynchronous, then afterwards the same uniqueid is used to originate a new channel before the Local channels are fully hung up causing a conflict and going into off-nominal territory.

By: Asterisk Team (asteriskteam) 2021-12-22 12:00:00.843-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines