ASTERISK-29846: channels: bad ao2 ref causes crash

[Home]

Summary: ASTERISK-29846: channels: bad ao2 ref causes crash

Reporter: N A (InterLinked) Labels:

Date Opened: 2022-01-08 05:36:29.000-0600 Date Closed: 2022-10-20 05:47:36

Priority: Minor Regression?

Status: Closed/Complete Components: General

Versions: 18.9.0 Frequency of
Occurrence Frequent

Related
Issues:

Environment: Debian 10 Attachments: ( 0) core-asterisk-2022-02-01T17-04-52Z-brief.txt
( 1) core-asterisk-2022-02-01T17-04-52Z-full.txt
( 2) core-asterisk-2022-02-01T17-04-52Z-info.txt
( 3) core-asterisk-2022-02-01T17-04-52Z-locks.txt
( 4) core-asterisk-2022-02-01T17-04-52Z-thread1.txt
( 5) core-asterisk-2022-02-10T00-46-55Z-full.txt

Description: Currently, Asterisk crashes randomly every few days, not seemingly linked to anything going on at the moment. In fact, prior to this crash, the system was completely idle (no calls) for the previous 34 minutes.

Here is a backtrace from right before the last crash, useful as it has line numbers:

01:35:03 ERROR[13076] : Got 10 backtrace records
# 0: [0x55ca492928cc] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x55ca49109db3] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x55ca4910a73a] asterisk astobj2.c:501 __ao2_ref()
# 3: [0x55ca4917269a] asterisk cli.c:1218 handle_chanlist()
# 4: [0x55ca4917859f] asterisk cli.c:3020 ast_cli_command_full()
# 5: [0x55ca49178734] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 6: [0x55ca49101e0c] asterisk asterisk.c:1424 netconsole()
# 7: [0x55ca4928fb57] asterisk utils.c:1572 dummy_start()
# 8: [0x7f257c305fa3] libpthread.so.0 pthread_create.c:487 start_thread()
# 9: [0x7f257bd994cf] libc.so.6 clone.S:97 clone()

Comments: By: Asterisk Team (asteriskteam) 2022-01-08 05:36:30.679-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].
By: George Joseph (gjoseph) 2022-01-10 08:08:58.553-0600

Are you going to work this issue yourself?

By: N A (InterLinked) 2022-01-10 09:15:19.525-0600

Not sure of the cause at the moment, so not at this time, just reporting for now.
By: Joshua C. Colp (jcolp) 2022-01-11 12:44:51.926-0600

Do you have the actual full backtrace and an accompanying core debug log?
By: N A (InterLinked) 2022-01-11 12:53:39.649-0600

I do not. I was hoping this might be enough to clue somebody in on something...

I'm having an issue with this system where it seems like it doesn't dump a backtrace unless I manually start as asterisk -g in the /tmp directory (it runs as the asterisk user), even though coredumps=yes is in asterisk.conf. As such, that has already squandered a few opportunities where it crashed and there were no core dumps to process.

If a backtrace is required, I will wait until this happens and again and hopefully there is a core dump.
By: Joshua C. Colp (jcolp) 2022-01-11 12:56:09.092-0600

The given information doesn't really give enough insight, it merely states it that an ao2 object was likely destroyed. That's it. A full backtrace and log would be needed to provide the data on what was going on.
By: N A (InterLinked) 2022-01-11 14:18:28.272-0600

What about this part of the backtrace though?

```
3: [0x55ca4917269a] asterisk cli.c:1218 handle_chanlist()
4: [0x55ca4917859f] asterisk cli.c:3020 ast_cli_command_full()
5: [0x55ca49178734] asterisk cli.c:3048 ast_cli_command_multiple_full()
```

In any case, I'll be on the lookout for this and post further debugs/backtraces to the issue if and as I'm able to procure them.
By: Joshua C. Colp (jcolp) 2022-01-11 14:38:38.553-0600

Yes, it shows that it's in the handle_chanlist function. For example, though, the given line number doesn't match 18.9.0. An actual backtrace would better confirm things and show more information.
By: N A (InterLinked) 2022-01-18 13:59:20.653-0600

Here's another backtrace received 2 seconds after starting Asterisk today (recovering from a crash, may or may not have been caused by this same thing):

# 0: [0x562a20a259e1] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x562a2089cec8] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x562a2089d84f] asterisk astobj2.c:501 __ao2_ref()
# 3: [0x562a209057af] asterisk cli.c:1218 handle_chanlist()
# 4: [0x562a2090b6b4] asterisk cli.c:3020 ast_cli_command_full()
# 5: [0x562a2090b849] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 6: [0x562a20894f21] asterisk asterisk.c:1442 netconsole()
# 7: [0x562a20a22c6c] asterisk utils.c:1572 dummy_start()
# 8: [0x7fc80ea6bfa3] libpthread.so.0 pthread_create.c:487 start_thread()
# 9: [0x7fc80e4ff4cf] libc.so.6 clone.S:97 clone()

That is this line here:
ast_cli(a->fd, "%d call%s processed\n", ast_processed_calls(), ESS(ast_processed_calls()));
}
ao2_ref(channels, -1);

By: George Joseph (gjoseph) 2022-01-19 08:08:59.427-0600

Do you have an actual coredump we can examine? You'd need to use {{--enable-dev-mode}} and set {{ DO_CRASH}} in menuselect..
The next time it crashes, you'll then need to run ast_coredumper with the {{--tarball-coredumps}} option to capture everything. After that, host the tarball somewhere (Google Drive, DropBox, etc.) and email the link to asteriskteam <at> digium.com.

By: N A (InterLinked) 2022-01-19 08:38:58.801-0600

I've been trying to but I'm having difficulty in getting core dumps on this machine from chronic crashes. I think it has to do with a combination of Asterisk not running as root and what directory it starts in. I couldn't get a core dump, even manually by asterisk -g && asterisk -gr and then forcing a crash, with sysctl -w kernel.core_pattern=/var/crash, but when I changed it back to sysctl -w kernel.core_pattern=core, then I could, so maybe it will work better now. It could be that using /tmp was also a problem.

I don't see a DO_CRASH in menuselect, is that's what it's called? It's not in compiler flags with dev mode.
By: George Joseph (gjoseph) 2022-01-19 09:57:58.040-0600

It's in menuselect under {{Compiler Flags - Development}} third from the bottom. Only shows up if you use {{--enable-dev-mode}}.

By: N A (InterLinked) 2022-01-22 14:30:12.240-0600

Here are some additional backtraces from just restarting Asterisk (as part of a normal patch -> install -> restart). The signs are all definitely pointing towards PJSIP as causing the issue here:

Core debug is still 5.
[2022-01-22 20:28:08] ERROR[28680]: Got 11 backtrace records
# 0: [0x563ab7a0ca2c] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x563ab7883ec8] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x563ab788484f] asterisk astobj2.c:501 __ao2_ref()
# 3: [0x563ab7886f33] asterisk astobj2_container.c:494 ao2_iterator_init()
# 4: [0x563ab78ec063] asterisk cli.c:1152 handle_chanlist()
# 5: [0x563ab78f26ff] asterisk cli.c:3020 ast_cli_command_full()
# 6: [0x563ab78f2894] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 7: [0x563ab787bf21] asterisk asterisk.c:1442 netconsole()
# 8: [0x563ab7a09cb7] asterisk utils.c:1572 dummy_start()
# 9: [0x7fa8f2328fa3] libpthread.so.0 pthread_create.c:487 start_thread()
#10: [0x7fa8f1dbc4cf] libc.so.6 clone.S:97 clone()

[2022-01-22 20:28:08] ERROR[28680]: cli.c:1153 handle_chanlist: FRACK!, Failed assertion user_data is NULL (0)

[2022-01-22 20:28:09] == Manager registered action PJSIPShowAuths
[2022-01-22 20:28:09] ERROR[28680]: Got 10 backtrace records
# 0: [0x563ab7a0ca2c] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x563ab7883ec8] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x563ab78871cb] asterisk astobj2_container.c:563 __ao2_iterator_next()
# 3: [0x563ab78ec661] asterisk cli.c:1153 handle_chanlist()
# 4: [0x563ab78f26ff] asterisk cli.c:3020 ast_cli_command_full()
# 5: [0x563ab78f2894] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 6: [0x563ab787bf21] asterisk asterisk.c:1442 netconsole()
# 7: [0x563ab7a09cb7] asterisk utils.c:1572 dummy_start()
# 8: [0x7fa8f2328fa3] libpthread.so.0 pthread_create.c:487 start_thread()
# 9: [0x7fa8f1dbc4cf] libc.so.6 clone.S:97 clone()

[2022-01-22 20:28:09] ERROR[28680]: astobj2_container.c:499 ao2_iterator_restart: FRACK!, Failed assertion user_data is NULL (0)
[2022-01-22 20:28:09] == Manager registered action PJSIPShowAors

[2022-01-22 20:28:09] Loading res_pjsip_endpoint_identifier_ip.so.
[2022-01-22 20:28:09] ERROR[28680]: Got 11 backtrace records
# 0: [0x563ab7a0ca2c] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x563ab7883ec8] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x563ab7886fad] asterisk astobj2_container.c:499 ao2_iterator_restart()
# 3: [0x563ab78870df] asterisk astobj2_container.c:540 ao2_iterator_destroy()
# 4: [0x563ab78ec67f] asterisk cli.c:1204 handle_chanlist()
# 5: [0x563ab78f26ff] asterisk cli.c:3020 ast_cli_command_full()
# 6: [0x563ab78f2894] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 7: [0x563ab787bf21] asterisk asterisk.c:1442 netconsole()
# 8: [0x563ab7a09cb7] asterisk utils.c:1572 dummy_start()
# 9: [0x7fa8f2328fa3] libpthread.so.0 pthread_create.c:487 start_thread()
#10: [0x7fa8f1dbc4cf] libc.so.6 clone.S:97 clone()

[2022-01-22 20:28:09] ERROR[28680]: astobj2_container.c:540 ao2_iterator_destroy: FRACK!, Failed assertion user_data is NULL (0)
[2022-01-22 20:28:09] ERROR[28680]: Got 11 backtrace records
# 0: [0x563ab7a0ca2c] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x563ab7883ec8] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x563ab788484f] asterisk astobj2.c:501 __ao2_ref()
# 3: [0x563ab788710e] asterisk astobj2_container.c:541 ao2_iterator_destroy()
# 4: [0x563ab78ec67f] asterisk cli.c:1204 handle_chanlist()
# 5: [0x563ab78f26ff] asterisk cli.c:3020 ast_cli_command_full()
# 6: [0x563ab78f2894] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 7: [0x563ab787bf21] asterisk asterisk.c:1442 netconsole()
# 8: [0x563ab7a09cb7] asterisk utils.c:1572 dummy_start()
# 9: [0x7fa8f2328fa3] libpthread.so.0 pthread_create.c:487 start_thread()
#10: [0x7fa8f1dbc4cf] libc.so.6 clone.S:97 clone()

[2022-01-22 20:28:09] ERROR[28680]: cli.c:1216 handle_chanlist: FRACK!, Failed assertion user_data is NULL (0)
[2022-01-22 20:28:09] == res_pjsip_endpoint_identifier_ip.so => (PJSIP IP endpoint identifier)

[2022-01-22 20:28:09] ERROR[28680]: Got 10 backtrace records
# 0: [0x563ab7a0ca2c] asterisk utils.c:2727 __ast_assert_failed()
# 1: [0x563ab7883ec8] asterisk astobj2.c:212 log_bad_ao2()
# 2: [0x563ab788484f] asterisk astobj2.c:501 __ao2_ref()
# 3: [0x563ab78ec7fa] asterisk cli.c:1218 handle_chanlist()
# 4: [0x563ab78f26ff] asterisk cli.c:3020 ast_cli_command_full()
# 5: [0x563ab78f2894] asterisk cli.c:3048 ast_cli_command_multiple_full()
# 6: [0x563ab787bf21] asterisk asterisk.c:1442 netconsole()
# 7: [0x563ab7a09cb7] asterisk utils.c:1572 dummy_start()
# 8: [0x7fa8f2328fa3] libpthread.so.0 pthread_create.c:487 start_thread()
# 9: [0x7fa8f1dbc4cf] libc.so.6 clone.S:97 clone()

[2022-01-22 20:28:09] -- Remote UNIX connection disconnected

By: Joshua C. Colp (jcolp) 2022-01-22 15:45:28.605-0600

I disagree that those messages point to it being PJSIP. The log shows the CLI command being executed when Asterisk is being started, specifically at the point PJSIP stuff is loading. That doesn't mean it's PJSIP. It could be that the system is not yet completely initialized resulting in undefined behavior of the CLI command for listing channels.
By: Joshua C. Colp (jcolp) 2022-01-22 15:52:44.031-0600

I should probably be even more specific: In regards to the specific backtrace records you've provided today.
By: N A (InterLinked) 2022-01-22 15:53:39.726-0600

Hmm... so maybe PJSIP isn't causing this issue per se, but that still sounds like a bug. Should I open another issue for that?
I do see the Remote UNIX connection in there, which appears right as Asterisk is starting; fairly certain that's not anything I have. This was restarting Asterisk manually so it the monitor script wasn't even involved here.
By: Joshua C. Colp (jcolp) 2022-01-22 15:55:02.300-0600

In fact, looking at the implementation of ao2_iterator_init that is precisely the cause. The channel cache by name is not yet initialized/allocated when you're invoking the CLI command, causing ao2_iterator_init to FRACK when calling ao2_t_ref and producing the records.
By: Joshua C. Colp (jcolp) 2022-01-22 15:56:28.502-0600

You can if you want.
By: Joshua C. Colp (jcolp) 2022-01-22 15:57:52.173-0600

I should also add that even with the log you sent privately, without a backtrace it's a needle in a haystack and ultimately it's the system that produces and stores the core dump, not Asterisk, so there is limited things we can do.
By: Kevin Harwell (kharwell) 2022-01-24 13:49:41.404-0600

The misaligned source code numbers is an oddity. Is this a "clean" build of Asterisk or patched in some way? When you installed did you overwrite an old install? Is it possible "old", or other version built modules are being loaded?

Even so, I agree with [~jcolp]. Looking at the code the cause appears to be an uninitialized channel cache (at least for the last crash), which is initialized quite early in the process.

Given it's crashing in different places though my guess is either some kind of patch or versioning problem, or there is a memory corruption somewhere. I would suggest ensuring you have a clean build and install first. In order to help track down a potential corruption we'll need a full backtrace with [MALLOC_DEBUG|https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag] enabled.
By: N A (InterLinked) 2022-01-28 06:02:08.385-0600

All right, some potential more clues in this debug right before the latest crash at 06:36 this morning UTC.

[2022-01-28 06:35:26] DEBUG[5490][C-000001fd] channel.c: Deadlock avoided for write to channel 'Local/0@radio-feed-00000637;1'
[2022-01-28 06:35:26] VERBOSE[13222] res_pjsip/pjsip_options.c: Contact Modem1/sip:Modem1@REDACTED:35579;transport=TLS;x-ast-orig-host=192.168.10.101:6062 is$
[2022-01-28 06:35:26] VERBOSE[26746] res_pjsip/pjsip_options.c: Contact voipfone/sip:sip.voipfone.net is now Unreachable. RTT: 0.000 msec
[2022-01-28 06:35:28] DEBUG[13235] res_pjsip_outbound_registration.c: Received REGISTER response 200(OK)
[2022-01-28 06:35:29] DEBUG[13235] res_pjsip_outbound_registration.c: Processing REGISTER response 200 from server 'sip:sip.voipfone.net' for client 'sip:REDACTED@$
[2022-01-28 06:35:29] DEBUG[5491][C-000001fd] channel.c: Deadlock avoided for write to channel 'Local/0@radio-feed-00000637;2'
[2022-01-28 06:35:29] DEBUG[13235] res_pjsip_outbound_registration.c: Outbound registration to 'sip:sip.voipfone.net' with client 'sip:REDACTED@sip.voipfone.net' s$
[2022-01-28 06:35:30] DEBUG[13235] res_pjsip_outbound_registration.c: Scheduling outbound registration to server 'sip:sip.voipfone.net' from client 'sip:REDACTED@s$
[2022-01-28 06:35:30] DEBUG[13221] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:35:31] DEBUG[27311] chan_iax2.c: ip callno count decremented to 1 for REDACTED
[2022-01-28 06:35:33] DEBUG[13223] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:35:40] DEBUG[13202] threadpool.c: Worker thread idle timeout reached. Dying.
[2022-01-28 06:35:45] DEBUG[13220] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:35:46] DEBUG[30398] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:35:49] DEBUG[28180] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:36:02] DEBUG[5490][C-000001fd] channel.c: Deadlock avoided for write to channel 'Local/0@radio-feed-00000637;1'
[2022-01-28 06:36:05] DEBUG[13236] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:36:07] DEBUG[13246] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:36:07] DEBUG[13220] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:36:07] DEBUG[13238] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:36:07] DEBUG[27315] chan_iax2.c: ip callno count incremented to 2 for REDACTED
[2022-01-28 06:36:12] DEBUG[13245] res_pjsip_sips_contact.c: Upgrading contact URI on outgoing SIP request to SIPS due to SIPS Request URI
[2022-01-28 06:36:14] DEBUG[27321] sched.c: Attempted to delete nonexistent schedule entry 57!
[2022-01-28 06:36:15] DEBUG[27311] chan_iax2.c: ip callno count decremented to 1 for REDACTED
[2022-01-28 06:40:02] Asterisk 18.9.0 built by root @ voip on a x86_64 running Linux on 2021-12-27 21:51:01 UTC
[Jan 28 06:40:02] DEBUG[13784] config.c: Parsing /etc/asterisk/logger.conf

The "sched.c: Attempted to delete nonexistent schedule entry 57!" looks mighty suspicious here, especially given that was what happened immediately before the crash. I don't think I've seen that before.

This time, all my fiangling paid off and it DID dump a core, but I was somewhat of an idiot and ran my automated tool, and for some reason the automated POST request failed and then all the files were wiped out afterwards, so I don't have any backtrace to show for this one. I've modified that now to not delete the backtrace files, in case it fails.

When it crashes again in a few days (probably Feb. 1 or 2 by the pattern at this point), I'll be sure to get that backtrace uploaded here.
By: N A (InterLinked) 2022-01-28 06:08:10.593-0600

Sorry, I forgot to comment on this before:
The source code should only be different by a few lines, at most.

if (!concise) {
numchans = ast_active_channels();
ast_cli(a->fd, "%d active channel%s\n", numchans, ESS(numchans));
if (ast_option_maxcalls)
ast_cli(a->fd, "%d of %d max active call%s (%5.2f%% of capacity)\n",
ast_active_calls(), ast_option_maxcalls, ESS(ast_active_calls()),
((double)ast_active_calls() / (double)ast_option_maxcalls) * 100.0);
else
ast_cli(a->fd, "%d active call%s\n", ast_active_calls(), ESS(ast_active_calls()));

ast_cli(a->fd, "%d call%s processed\n", ast_processed_calls(), ESS(ast_processed_calls()));
}
ao2_ref(channels, -1);

return CLI_SUCCESS; <--- This is line 1218 for me:

I think it's off by 1 line, though, because the ao2_ref is line 1217, rather than 1218 as in those backtraces above, but it seems close enough?

By: N A (InterLinked) 2022-02-01 17:10:39.573-0600

Finally, *finally* I have prevailed! Only the 2nd backtrace I've gotten, and the first one I haven't accidentally deleted.

Asterisk crashed this morning, and at the time, I cd'd to /var/crash and then ran asterisk -g. It appears running as root and starting asterisk in the /var/crash directory finally allowed a core to dump successfully. Autorestarting via the cron job does the same thing but for some reason doesn't seem to work. Luckily, it took only a few hours for Asterisk to crash again (never thought I'd say that) and this time it did core dump.

Anyways, here is the backtrace: https://code.phreaknet.org/asterisk/backtrace.txt

Looks like an invalid free in the logger thread, which would explain why the issue seems to be random, looking at a debug level.

Thread 1 (Thread 0x7f3caf94a700 (LWP 19812)):
#0 0x00007f3cb1e387bb in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
set = {__val = {134238211, 139898675710720, 5, 0, 0, 139898665703576, 139898620517696, 139898675737786, 94527391385472, 94527390831189, 94527391347888, 94527391347888, 94527391385472, 94527390831175, 94527391385472, 0}}
pid = <optimized out>
tid = <optimized out>
#1 0x00007f3cb1e23535 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x25205d6435255b20, sa_sigaction = 0x25205d6435255b20}, sa_mask = {__val = {45253707243635, 0 <repeats 13 times>, 139898620517328, 139898620517584}}, sa_flags = -1349215280, sa_restorer = 0x1000}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007f3cb1e7a508 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f3cb1f8528d "%sn") at ../sysdeps/posix/libc_fatal.c:181
ap = {{gp_offset = 24, fp_offset = 0, overflow_arg_area = 0x7f3caf949ce0, reg_save_area = 0x7f3caf949c70}}
fd = 2
list = <optimized out>
nlist = <optimized out>
cp = <optimized out>
written = <optimized out>
#3 0x00007f3cb1e80c1a in malloc_printerr (str=str@entry=0x7f3cb1f870a0 "free(): corrupted unsorted chunks") at malloc.c:5341
#4 0x00007f3cb1e828a2 in _int_free (av=0x7f3ca8000020, p=0x7f3ca8281150, have_lock=<optimized out>) at malloc.c:4348
size = 992
fb = <optimized out>
nextchunk = 0x7f3ca8281530
nextsize = 48
nextinuse = <optimized out>
prevsize = <optimized out>
bck = 0x7f3ca8000080
fwd = <optimized out>
__PRETTY_FUNCTION__ = "_int_free"
#5 0x000055f8df6786fa in __ast_free (ptr=0x7f3ca8281160, file=0x55f8df8d4141 "logger.c", lineno=187, func=0x55f8df8d57b8 <__PRETTY_FUNCTION__.16325> "logmsg_free") at astmm.c:1556
#6 0x000055f8df84e5e3 in logmsg_free (msg=0x7f3ca8281160) at logger.c:187
__PRETTY_FUNCTION__ = "logmsg_free"
#7 0x000055f8df853c3b in logger_thread (data=0x0) at logger.c:1885
next = 0x7f3ca84322e0
msg = 0x7f3ca8281160
__PRETTY_FUNCTION__ = "logger_thread"
#8 0x000055f8df7fea21 in dummy_start (data=0x55f8e0b63290) at utils.c:1572
By: Kevin Harwell (kharwell) 2022-02-01 18:07:41.828-0600

Is this a "clean" build of Asterisk or patched in some way? When you installed did you overwrite an old install? Is it possible "old", or other version built modules are being loaded?

Also, what's the output of the following:
{noformat}
*CLI> core show version
{noformat}
If you haven't already please execute the {{ast_coredumper}} script against the core, and attach all *.txt output files to this issue.

Again this is looking like a memory corruption, so enabling {{MALLOC_DEBUG}} could be helpful.
By: N A (InterLinked) 2022-02-01 18:15:37.610-0600

It's 18.9.0. It's not 100% clean, as there hasn't been an 18.10 yet so I have some unreleased patches included and some other modules that I haven't submitted for review, but nothing that seems immediately relevant off-hand. Most of those in question are dialplan apps, and given this happens randomly when the system has been idle for a long time, it can't have anything to do with something like a dialplan app. It's something in the core itself.

No old modules are being loaded. My install script automatically does rm -f /usr/lib/asterisk/modules/* whenever a new version is installed.

I'm happy to recompile with MALLOC_DEBUG if needed, though this is a prod system so not sure how much additional load that will add...

By: N A (InterLinked) 2022-02-01 18:16:39.625-0600

Attached are all the files produced by astcoredumper.
By: Joshua C. Colp (jcolp) 2022-02-02 10:18:14.496-0600

Your backtrace appears to contain a memory corruption. We need one or both of the following items to continue investigation of the issue:
1. Valgrind output. See https://wiki.asterisk.org/wiki/display/AST/Valgrind for instructions on how to use Valgrind with Asterisk.
2. MALLOC_DEBUG output. See https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag for instructions on how to use the MALLOC_DEBUG option.

Note that MALLOC_DEBUG and Valgrind are mutually exclusive options. Valgrind output is preferable, but will be more system resource intensive and may be difficult to get on a production system. In such a case, you may have better luck getting the necessary output from MALLOC_DEBUG.

By: Kevin Harwell (kharwell) 2022-02-02 10:54:49.222-0600

{quote}
It's not 100% clean, as there hasn't been an 18.10 yet so I have some unreleased patches included and some other modules that I haven't submitted for review, but nothing that seems immediately relevant off-hand.
{quote}
The problem is there is just no way to know. Especially since it appears to be a memory corruption. One of your modules could be corrupting memory in some way even if it's not immediately obvious.

Issues can only be accepted if we know for sure the problem is occurring within code found within the current Asterisk code base as it would be nigh impossible to fix an issue in which we have no view into.

Until you can guarantee the issue is Asterisk only (full clean build with no private non submitted patches applied) there's not much we can do aside from offer ideas to help you debug things (see some of the above comments for ideas). And until such can occur I'd suggest moving this discussion to the one of the community forums (dev list, IRC, etc...) for more help.
By: N A (InterLinked) 2022-02-09 18:56:29.565-0600

Not sure if this is the same issue or something else, but here is the next successful backtrace I have gotten. Probably had a dozen crashes since the last backtrace where no core dump was left. Not sure if this would be related.
By: Joshua C. Colp (jcolp) 2022-02-10 04:50:53.050-0600

The latest backtrace is ASTERISK-29535.
By: N A (InterLinked) 2022-02-10 05:22:17.432-0600

Hmm, I don't use ARI at all, and pretty sure I disabled it; is that one preventable/disablable somehow?
This may be complicated further if there are several issues simultaneously causing crashes here.
By: Joshua C. Colp (jcolp) 2022-02-10 05:37:35.644-0600

That crash is not strictly ARI, the user just happened to use it. Any information about it would be on the JIRA issue.
By: N A (InterLinked) 2022-02-13 19:04:48.813-0600

Another crash half an hour ago, no core dump, but I had the CLI open when it crashed and these are the last things there - interestingly, a bunch of warnings (which is uncommon):

[2022-02-14 00:31:11] WARNING[25311]: chan_iax2.c:1203 iax_error_output: Information element length exceeds message size
[2022-02-14 00:31:11] WARNING[25243]: pjproject: <?>: SSL 6 [SSL_ERROR_ZERO_RETURN] (Read) ret: 0 len: 65535
[2022-02-14 00:31:12] WARNING[25311]: chan_iax2.c:10242 socket_process_helper: Undecodable frame received from 'REDACTED:4569'
[2022-02-14 00:31:13] WARNING[25311]: chan_iax2.c:1203 iax_error_output: Information element length exceeds message size
[2022-02-14 00:31:13] -- Contact REDACTED is now Unreachable. RTT: 0.000 msec
[2022-02-14 00:31:13] WARNING[25311]: chan_iax2.c:10242 socket_process_helper: Undecodable frame received from 'REDACTED:4569'
[2022-02-14 00:31:22] ERROR[9354]: res_pjsip.c:4861 endpt_send_request: Error 120032 'Broken pipe' sending OPTIONS request to endpoint REDACTED
[2022-02-14 00:31:29] == Endpoint REDACTED is now Unreachable

I can't recall seeing a broken pipe error in Asterisk before, but it seems like multiple factors are pointing towards this likely being a PJSIP-related issue.
By: Joshua C. Colp (jcolp) 2022-02-14 04:35:26.632-0600

Since I see TLS involved, this may indeed just be a duplicate of ASTERISK-29535.
By: N A (InterLinked) 2022-03-18 05:13:02.355-0500

The "case of the missing core dumps" continues...
I recompiled Asterisk with DO_CRASH and it continues to crash all right, but I'm still not getting any core dumps.

Enough disk space is available for the core dump
I have BETTER_BACKTRACES, DONT_OPTIMIZE, MALLOC_DEBUG, and DO_CRASH, among others.

Any ideas?
By: Joshua C. Colp (jcolp) 2022-03-18 05:36:17.663-0500

Core dumps are not an Asterisk thing, they're a system thing. There's things we can use to try to influence it (such as requesting a core dump occurs) but ultimately it's the system itself that does it/handles it. If you're not getting core dumps and Asterisk is actually segfaulting, then the system itself would be the place to look. I don't know anything further beyond that.
By: Kevin Harwell (kharwell) 2022-03-21 10:29:08.881-0500

[~InterLinked]

Double check the output of the following:
{noformat}
$ cat /proc/sys/kernel/core_pattern
{noformat}
That specifies where/how core files are written out. I have had it get overwritten on me before by something else (never figured out what).

Are there any files in {{/var/crash/}}? Depending you might need to use something like {{apport-unpack}} to extract the actual core from the file.
By: N A (InterLinked) 2022-03-21 10:41:04.052-0500

Thanks, I've tried playing with that in various ways.

I initially had it set to core, then /var/crash/core, then core again, and now recently back to /var/crash/core again.

I have a script that starts Asterisk if it detects a crash by first cd'ing to /var/crash so that is the working directory. Permissions on /var/crash are 777.

I'm on Debian, I tried installing the coredumpctl package but that didn't do it either.

/var/crash is always empty whenever I check it. However, if I get a segfault due to a "known" issue while I'm working on something, there is a core dump in there. But every time one of these "mysterious" crashes happens, there's nothing there.

By: Joshua C. Colp (jcolp) 2022-03-24 07:59:33.621-0500

What does it currently say/do when it occurs? Are you sure it's actually a crash?
By: N A (InterLinked) 2022-03-24 10:22:19.706-0500

Yes, it's a crash. Or at least, Asterisk is no longer running, and if that's not the result of a crash, not sure what else.

"It" doesn't say anything. I have a cron job that runs every 5 minutes that executes asterisk -rx "core show channels", and if it fails, it runs cd /var/crash && asterisk -g (the g is theoretically redundant, since I have the dump core option in asterisk.conf, but I'm not taking any chances...)

It then sends me an email so I know within 5 minutes if Asterisk crashed.

I also have recently added to the email to print out an ls -la /var/crash BEFORE starting it again, so I can see the contents, and in all of these mysterious crashes, it is simply empty. The res_calendar thing from the other day was an exception as I did get core dumps from those.

Additionally, if I get a segfault in a module that I wrote, I also get a core dump from that (typically while I am testing, so I can see in real time). So it's not like I can't get core dumps - I can and do - but never in these "mysterious" cases that happen randomly, a few times a week.

The only indication is that one minute Asterisk is running and the next it isn't, and I get an email.

Thus, it's almost as if whatever is causing the crash is somehow outside the scope of the processes that would generally result in a core dump. I don't really know what that would mean in the context of Asterisk or what type of issue that could indicate.

Really, I think the only possibly clue would be I get kept the debug log running continuously to a file, and that's what I initially did, but you indicated that wouldn't pinpoint the issue and indeed it seemed inconsistent in tracking that over time. Unfortunately it does not seem like we can really get anything further than those. Happy to try out any further suggestions though as at this point there is nothing to lose.
By: Kevin Harwell (kharwell) 2022-03-24 10:53:33.866-0500

Could there be some other external process that is killing Asterisk? For example if you issue a {{$ kill -9 <asterisk pid>}} I don't think it will dump the core even if configured to do so.
By: Kevin Harwell (kharwell) 2022-03-24 11:00:48.599-0500

{quote}
I have a cron job that runs every 5 minutes that executes asterisk -rx "core show channels", and if it fails, it runs cd /var/crash && asterisk -g (the g is theoretically redundant, since I have the dump core option in asterisk.conf, but I'm not taking any chances...)
{quote}
Just trying to think of anything here, but how does the script know the call fails? And if it does recognize some failure how does it know Asterisk itself is not running/executing still? Guessing it's checking for the PID?

Before starting Asterisk does the script check to see if Asterisk is not actually running (still running or restarted by something else)? Could it be possible the script is messing up and starting multiple instances of Asterisk? That could potentially cause odd behavior if so.
By: N A (InterLinked) 2022-03-24 11:07:10.889-0500

I don't have anything set up that would be killing Asterisk, for any reason.

Not sure if this answers your question, but here is the full script itself:

{noformat}
#!/bin/bash
size=`/usr/sbin/asterisk -rx 'core show channels' 2>&1 | grep 'Unable' | wc -l`
if [ $size -gt 0 ]
then
start=`/usr/sbin/asterisk -g` # get asterisk going again
printf "%s\r\n\t%s\r\n" "Greetings," "Asterisk was just discovered to be down due to temporary equipment trouble. The switch has automatically been started again. Please investigate the cause of the fault ASAP." | mail -s "Asterisk Crash Alert" -a "From: Asterisk <alerts@REDACTED>" "REDACTED"
fi
{noformat}

Cron job:
{noformat}
*/5 * * * * /etc/asterisk/scripts/asteriskmonitor.sh >/dev/null 2>&1
{noformat}

Basically I know that if Asterisk is down and you try running rasterisk, you get "Unable to connect to remote Asterisk", so I'm looking for that output, and inferring that Asterisk is not running from that.

I can confirm that if I get one of these, Asterisk was actually down. Occasionally I'll have no dial tone and be unable to connect to Asterisk from SSH, and sure enough then I get one of these emails within that 5 minute interval - as Asterisk was no longer running.
By: N A (InterLinked) 2022-03-24 11:09:42.829-0500

Actually that was the older version... here is the one I've been using recently:

{noformat}
#!/bin/bash
size=`/usr/sbin/asterisk -rx 'malloc trim' 2>&1 | grep 'Unable' | wc -l`
if [ $size -gt 0 ]
then
ls -la /var/crash > /tmp/core.txt
ulimit -c >> /tmp/core.txt
crash=`cat /tmp/core.txt`
cd /var/crash
ulimit -c unlimited
start=`/usr/sbin/asterisk -g` # get asterisk going again
printf "%s\r\n\t%s\r\n%s\r\n" "Greetings," "Asterisk was just discovered to be down due to temporary equipment trouble. The switch has automatically been started again. Please investigate the cause of the fault ASAP." "$crash" | mail -s "Asterisk Crash Alert" -a "From: Asterisk <alerts@REDACTED>" "REDACTED"
fi
{noformat}
By: Kevin Harwell (kharwell) 2022-03-24 11:41:06.770-0500

Probably should be alright, and doubtful it's the case here but I have seen it where Asterisk is running, but a remote console is unable to attach (can happen if the executing user does not have permissions to write the asterisk.pid file).

Aslo if you are not aware there is already an execute, monitor, check exit status, copy any core files, and restart script in the code base: {{contrib/scripts/safe_asterisk}}
By: N A (InterLinked) 2022-03-24 20:13:40.681-0500

Here's the output of an email I received 2 minutes ago...

{noformat}
Greetings,
Asterisk was just discovered to be down due to temporary equipment trouble. The switch has automatically been started again. Please investigate the cause of the fault ASAP.
total 8
drwxrwxrwx 2 asterisk asterisk 4096 Mar 23 10:33 .
drwxr-xr-x 15 root root 4096 Jan 18 19:54 ..
0
{noformat}

As can be seen, /var/crash is completely empty, and the directory itself was "last modified" yesterday morning (server time is in UTC).

As far as those other things go, the cron job is run as root so permissions would not be an issue there.

I'll check out the other scripts, I know that safe_asterisk exists but didn't know about monitor.

This was something quick and dirty I put together when I realized that Asterisk would crash and hours would go by and I'd start getting nasty emails and phone calls ;)
By: Joshua C. Colp (jcolp) 2022-03-27 17:15:22.504-0500

Does it show up in dmesg as having crashed?
By: N A (InterLinked) 2022-03-29 19:35:31.897-0500

Ah, and t'was there the rub:

{noformat}
[976049.815002] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=asterisk,pid=18565,uid=1004
[976049.818972] Out of memory: Killed process 18565 (asterisk) total-vm:842964kB, anon-rss:214332kB, file-rss:0kB, shmem-rss:0kB, UID:1004 pgtables:872kB oom_score_adj:0
[976049.973479] oom_reaper: reaped process 18565 (asterisk), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
{noformat}

So that would explain it all, as you said... Asterisk is causing the system to run out of memory, or at least is the most memory-intensive application at such a time.

I will try adding some more swap and see how that goes. In any case, at least the mystery of the missing backtraces is solved - thanks for that tip!

Back on the original subject of this thread, the PJSIP issue that causes that assert every now and then still happens occasionally. I imagine with DO_CRASH, that *would* cause a crash.
So this issue can remain open for that at this point, I think.
By: Kevin Harwell (kharwell) 2022-03-30 13:50:26.588-0500

I might be missing something, but I'm not seeing any info with regards to a PJSIP crash occurring.

Any problem, crash or otherwise, needs to be fully differentiated from the system running out of memory. When a system gets close to the memory limit odd things can occur. Please ensure the crash occurs when the system is not out of memory.

If the crash is happening outside of that then on next crash please run {{ast_coredumper}} against the core, and attach the *.txt output here.

And yes, DO_CRASH should force Asterisk to crash on an Asterisk related assert.
By: N A (InterLinked) 2022-04-30 14:07:57.767-0500

Here is another one, with slightly more detail:

Not sure if it's PJSIP related, but this one here has been consistent for months and there is certainly a bug with the channel list:

{noformat}
[2022-04-30 18:08:42] NOTICE[12062]: loader.c:2389 load_modules: 316 modules will be loaded.
[2022-04-30 18:08:43] ERROR[12204]: Got 11 backtrace records
# 0: asterisk __ast_assert_failed()
# 1: asterisk <unknown>()
# 2: asterisk __ao2_ref()
# 3: asterisk ao2_iterator_init()
# 4: asterisk <unknown>()
# 5: asterisk ast_cli_command_full()
# 6: asterisk ast_cli_command_multiple_full()
# 7: asterisk <unknown>()
# 8: asterisk <unknown>()
# 9: [0x51e5fa3] libpthread.so.0 pthread_create.c:487 start_thread()
#10: [0x5794eff] libc.so.6 clone.S:97 clone()

==12062==
==12062== Process terminating with default action of signal 6 (SIGABRT): dumping core
==12062== at 0x56D37BB: raise (raise.c:51)
==12062== by 0x56BE534: abort (abort.c:79)
==12062== by 0x31A440: ast_do_crash (utils.c:2699)
==12062== by 0x31A4DF: __ast_assert_failed (utils.c:2728)
==12062== by 0x17EEFB: log_bad_ao2 (astobj2.c:212)
==12062== by 0x17F87E: __ao2_ref (astobj2.c:501)
==12062== by 0x182085: ao2_iterator_init (astobj2_container.c:492)
==12062== by 0x1EAFD5: handle_chanlist (cli.c:1152)
==12062== by 0x1F1DDA: ast_cli_command_full (cli.c:3106)
==12062== by 0x1F1F82: ast_cli_command_multiple_full (cli.c:3133)
==12062== by 0x1738F4: netconsole (asterisk.c:1441)
==12062== by 0x3175F5: dummy_start (utils.c:1572)
{noformat}
By: George Joseph (gjoseph) 2022-05-02 06:07:23.318-0500

Still looks like {{core show channels}} is being run before Asterisk is fully initialized.

https://issues.asterisk.org/jira/browse/ASTERISK-29846?focusedCommentId=257824&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-257824

You could probably fix this yourself by adding code like...
{code}
while (!ast_test_flag(&ast_options, AST_OPT_FLAG_FULLY_BOOTED)) {
usleep(100);
}
{code}
to cli.c/handle_chanlist before any container access is attempted.

By: Friendly Automation (friendly-automation) 2022-09-09 20:41:27.872-0500

Change 19086 merged by Friendly Automation:
cli: Prevent assertions on startup from bad ao2 refs.

[https://gerrit.asterisk.org/c/asterisk/+/19086|https://gerrit.asterisk.org/c/asterisk/+/19086]
By: Friendly Automation (friendly-automation) 2022-09-09 20:41:35.672-0500

Change 19085 merged by Friendly Automation:
cli: Prevent assertions on startup from bad ao2 refs.

[https://gerrit.asterisk.org/c/asterisk/+/19085|https://gerrit.asterisk.org/c/asterisk/+/19085]
By: Friendly Automation (friendly-automation) 2022-09-09 20:41:38.760-0500

Change 18523 merged by Friendly Automation:
cli: Prevent assertions on startup from bad ao2 refs.

[https://gerrit.asterisk.org/c/asterisk/+/18523|https://gerrit.asterisk.org/c/asterisk/+/18523]
By: Friendly Automation (friendly-automation) 2022-09-09 20:41:46.732-0500

Change 19084 merged by Friendly Automation:
cli: Prevent assertions on startup from bad ao2 refs.

[https://gerrit.asterisk.org/c/asterisk/+/19084|https://gerrit.asterisk.org/c/asterisk/+/19084]
By: Friendly Automation (friendly-automation) 2022-09-09 20:41:50.421-0500

Change 19087 merged by Friendly Automation:
cli: Prevent assertions on startup from bad ao2 refs.

[https://gerrit.asterisk.org/c/asterisk/+/19087|https://gerrit.asterisk.org/c/asterisk/+/19087]