ASTERISK-29376: res_rtp_asterisk: Coredump with t.140 RED enabled

[Home]

Summary: ASTERISK-29376: res_rtp_asterisk: Coredump with t.140 RED enabled

Reporter: Thomas Johnson (tom5870) Labels:

Date Opened: 2021-03-30 08:35:00 Date Closed:

Priority: Minor Regression?

Status: Open/New Components: Resources/res_rtp_asterisk

Versions: 17.8.1 Frequency of
Occurrence Frequent

Related
Issues:

Environment: RedHat 8 Attachments: ( 0) core.asterisk.2431.ipswitchdev66.microautomation.local.1617722782-brief.txt
( 1) core.asterisk.2431.ipswitchdev66.microautomation.local.1617722782-full.txt
( 2) core.asterisk.2431.ipswitchdev66.microautomation.local.1617722782-info.txt
( 3) core.asterisk.2431.ipswitchdev66.microautomation.local.1617722782-locks.txt
( 4) core.asterisk.2431.ipswitchdev66.microautomation.local.1617722782-thread1.txt
( 5) core.asterisk.296402.ipswitchdev66.microautomation.local.1622643818-brief.txt
( 6) debug_log_123456.gz
( 7) full
( 8) temp.zip

Description: Asterisk will randomly coredump if RED is allowed. The exception is in an apparent scheduled callback to red_write. I tried modifying the function to check for NULL pointers, but it still crashes due to possibly an invalid pointer.

Comments: By: Asterisk Team (asteriskteam) 2021-03-30 08:35:02.500-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].
By: Joshua C. Colp (jcolp) 2021-03-30 08:42:03.560-0500

Please attach backtraces and information. This can be done after filing by going to More and then Attach Files.

Additionally Asterisk 17 no longer receives bug fixes. Current supported bug fix versions are 16 and 18 and as well T.140 is generally community supported.
By: Thomas Johnson (tom5870) 2021-03-30 14:06:46.628-0500

Any suggestions on why I am not getting a coredump, only the log entry I posted originally. Asterisk is run with -g

root 524458 1 0 10:27 ? 00:00:00 /bin/sh /usr/sbin/safe_asterisk -U asterisk -G asterisk
asterisk 524656 2878 0 10:27 ? 00:01:05 node /var/www/html/admin/modules/core/node/fastagi-server.js
asterisk 524892 2878 0 10:27 ? 00:01:14 letschat
asterisk 596462 1 0 14:38 ? 00:00:00 dirmngr --daemon --homedir /home/asterisk/.gnupg
asterisk 612432 524458 20 15:04 ? 00:00:09 /usr/sbin/asterisk -f -U asterisk -G asterisk -vvvg -c
asterisk 612488 2878 4 15:04 ? 00:00:02 node /var/www/html/admin/modules/ucp/node/index.js

By: George Joseph (gjoseph) 2021-03-31 08:09:31.704-0500

You might be getting them but just not in a place you expect. Try running {{sysctl kernel.core_pattern}} to see where the kernel is placing them. You can also try running {{coredumpctl}}

By: Thomas Johnson (tom5870) 2021-04-06 10:36:26.760-0500

Backtrace from the core dump of the crash.
By: Thomas Johnson (tom5870) 2021-04-06 13:09:50.382-0500

Per my original ticket, I had already changed the line to:

if (rtp!=NULL && rtp->red!=NULL && rtp->red->t140.datalen > 0) {

And it still crashes.
By: Sean Bright (seanbright) 2021-04-06 13:13:18.889-0500

Sorry, I don't see your original ticket.
By: Sean Bright (seanbright) 2021-04-06 16:01:03.276-0500

Are you able to load the core file with gdb? If so, what is the output of {{print *rtp}}?
By: Thomas Johnson (tom5870) 2021-04-06 19:22:37.619-0500

I just get "No symbol "rtp" in current context". Not familiar with using gdb at all, sorry

By: Sean Bright (seanbright) 2021-04-07 08:17:33.880-0500

Try this. If it doesn't work, then don't worry about it. Run {{gdb}} like this:

{noformat}
$ gdb /path/to/asterisk /path/to/core
<gdb loads some stuff here>
(gdb) print *rtp
{noformat}
By: Thomas Johnson (tom5870) 2021-04-07 08:30:56.511-0500

Still get No symbol "rtp" in current context.

By: Sean Bright (seanbright) 2021-04-07 08:32:19.879-0500

OK Thanks
By: Kevin Harwell (kharwell) 2021-04-09 11:53:52.702-0500

[~tom5870] Would you be willing to send the core file itself? If so run something like the following:
{noformat}
$ /var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps --no-default-search <path_to_coredump>
{noformat}
And please send the resulting tarball to asteriskteam@digium.com with the issue number in the subject. For large files please send a link where they can then be downloaded.

Please also note though:
{panel:title=Private Submission of Information Disclaimer} Note that submission of such information does not change the priorization of this issue. By sending this information you agree to the Website Terms of Use available on the Sangoma website at https://www.sangoma.com/legal/. Any exchange of private information between you and an Asterisk community member outside of the Asterisk JIRA is not subject to the Website Terms of Use and should be privately discussed between yourself and the Asterisk community member.{panel}
By: Thomas Johnson (tom5870) 2021-04-09 12:28:41.307-0500

I sent a link to that E-mail address, the file was too big to send via E-mail.
By: Kevin Harwell (kharwell) 2021-04-09 13:23:20.397-0500

Thanks got the link, and downloaded. However I am unable to load it as it appears the stack trace appears corrupted. Probably has something to do with:
{noformat}
warning: exec file is newer than core file.
{noformat}
when loading things into _gdb_.

I think that's why you are getting:
{noformat}
(gdb) print *rtp
No symbol "rtp" in current context.
{noformat}

Does this crash occur often? Next time it happens please execute the _ast_coredumper_ command again right afterwards on the core file. Then try [~seanbright]'s recommendation again:
{noformat}
$ gdb /path/to/asterisk /path/to/core
<gdb loads some stuff here>
(gdb) print *rtp
{noformat}
And post the output.

Thanks!
By: Thomas Johnson (tom5870) 2021-04-10 12:28:58.777-0500

(gdb) print *rtp
Cannot access memory at address 0x1

Sent a new link to the ast_coredumper output in a Email to the above Email.

By: Kevin Harwell (kharwell) 2021-04-12 12:04:58.629-0500

I've loaded the new core file and am still seeing some oddities. As well the output from the backtrace(s) do not seem to be "lining up" with the source code from the reported version. Are you running a custom patched version of Asterisk? Or have you modified the source in any other way?

Also the version you are running, while it's still receiving security fixes (for a few more months at least), it is no longer receiving bug fixes.

[Please upgrade to a most recent, and fully supported version of Asterisk|https://www.asterisk.org/downloads/asterisk/all-asterisk-versions/] (I recommend 18), ensure Asterisk has not been altered in any way from the base source (no custom patches, etc...) and then see if you are still able to reproduce the problem.
By: Thomas Johnson (tom5870) 2021-04-12 13:25:14.333-0500

With Asterisk 18.3, print *rtp on the core file gives this huge result:

{noformat}
$1 = {s = 275, f = {frametype = 0, subclass = {integer = 0, {format = 0x2257bf8, topology = 0x2257bf8}, frame_ending = 0}, datalen = 0, samples = 0, mallocd = 0,
mallocd_hdr_len = 0, offset = 0, src = 0x0, data = {ptr = 0x0, uint32 = 0, pad = "\000\000\000\000\000\000\000"}, delivery = {tv_sec = 0, tv_usec = 0},
frame_list = {next = 0x0}, flags = 0, ts = 0, len = 0, seqno = 0, stream_num = 0}, rawdata = '\000' <repeats 8255 times>, ssrc = 1640019621, ssrc_orig = 0,
ssrc_saved = 0 '\000', cname = "740819dc-27b2-48ce-9d02-00e3b99287f3", themssrc = 0, themssrc_valid = 0, lastts = 0, lastividtimestamp = 0,
lastovidtimestamp = 0, lastitexttimestamp = 0, lastotexttimestamp = 0, lastrxseqno = 0, expectedrxseqno = -1, missing_seqno = {elems = 0x0, max = 0,
current = 0}, expectedseqno = -1, seedrxseqno = 0, seedrxts = 0, rxcount = 0, rxoctetcount = 0, txcount = 0, txoctetcount = 0, cycles = 0, rxjitter = 0,
rxtransit = 0, lasttxformat = 0x2257bf8, lastrxformat = 0x2257bf8, resp = 0 '\000', last_seqno = 0, last_end_timestamp = 0, dtmf_duration = 0,
dtmf_timeout = 0, dtmfsamples = 0, dtmfmode = AST_RTP_DTMF_MODE_NONE, lastdigitts = 0, sending_digit = 0 '\000', send_digit = 0 '\000', send_payload = 0,
send_duration = 0, flags = 8, rxcore = {tv_sec = 0, tv_usec = 0}, txcore = {tv_sec = 0, tv_usec = 0}, drxcore = 0, dtmfmute = {tv_sec = 0, tv_usec = 0},
smoother = 0x0, seqno = 28245, sched = 0x44a8c50, rtcp = 0x7fd1dc051260, asymmetric_codec = 0, bundled = 0x0, stream_num = -1, ssrc_mapping = {
elems = 0x7fd1dc052600, max = 1, current = 0}, bind_address = {ss = {ss_family = 2, __ss_padding = "8x", '\000' <repeats 115 times>, __ss_align = 0},
len = 16}, strict_rtp_state = STRICT_RTP_LEARN, strict_rtp_address = {ss = {ss_family = 0, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0},
len = 0}, rtp_source_learn = {proposed_address = {ss = {ss_family = 0, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0}, len = 0}, start = {
tv_sec = 1618251061, tv_usec = 606695}, received = {tv_sec = 0, tv_usec = 0}, max_seq = 0, packets = 4, stream_type = AST_MEDIA_TYPE_UNKNOWN}, red = 0x0,
send_buffer = 0x0, recv_buffer = 0x0, transport_wide_cc = {packet_statistics = {elems = 0x0, max = 0, current = 0}, last_seqno = 0, last_extended_seqno = 0,
feedback_count = 0, cycles = 0, schedid = -1}, cond = {__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {__g1_start = 0, __g1_start32 = {
__low = 0, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}},
__size = '\000' <repeats 47 times>, __align = 0}, ice = 0x0, role = AST_RTP_ICE_ROLE_CONTROLLED, turn_rtp = 0x0, turn_rtcp = 0x0,
turn_state = PJ_TURN_STATE_NULL, passthrough = 0, rtp_passthrough = 0, rtcp_passthrough = 0, ice_port = 14456, rtp_loop = {ss = {ss_family = 0,
__ss_padding = '\000' <repeats 117 times>, __ss_align = 0}, len = 0}, rtcp_loop = {ss = {ss_family = 0, __ss_padding = '\000' <repeats 117 times>,
__ss_align = 0}, len = 0}, ioqueue = 0x0, remote_ufrag = '\000' <repeats 255 times>, remote_passwd = '\000' <repeats 255 times>,
local_ufrag = "3cd463d67157fc3d0b32fcf3055484e3", '\000' <repeats 223 times>, local_passwd = "2a99b3bb0c5b6e7050830b8d777e3432", '\000' <repeats 223 times>,
ice_local_candidates = 0x7fd1dc0b24c0, ice_active_remote_candidates = 0x0, ice_proposed_remote_candidates = 0x0, ice_original_rtp_addr = {ss = {ss_family = 2,
__ss_padding = "8x", '\000' <repeats 115 times>, __ss_align = 0}, len = 16}, ice_num_components = 2, ice_media_started = 0, ssl_ctx = 0x0,
dtls_verify = AST_RTP_DTLS_VERIFY_NONE, suite = 0, local_hash = AST_RTP_DTLS_HASH_SHA256, local_fingerprint = '\000' <repeats 159 times>,
remote_hash = AST_RTP_DTLS_HASH_SHA256, remote_fingerprint = '\000' <repeats 63 times>, rekey = 0, rekeyid = -1, dtls = {ssl = 0x0, read_bio = 0x0,
write_bio = 0x0, dtls_setup = AST_RTP_DTLS_SETUP_ACTIVE, connection = AST_RTP_DTLS_CONNECTION_NEW, timeout_timer = -1}}
{noformat}

Core output was sent to the E-mail.

By: Sean Bright (seanbright) 2021-04-12 13:47:40.336-0500

I know you have said that you have tried this, but could you make this change to line 8448 or {{res_rtp_asterisk.c}}, run {{make && make install}} and see if you are able to reproduce?

{noformat}
if (rtp->red && rtp->red->t140.datalen > 0) {
{noformat}

By: Thomas Johnson (tom5870) 2021-04-12 13:58:04.091-0500

Yes, it reproduces with that line. From what I saw earlier it looked like the pointer was 0x000..001 and not null.

By: Sean Bright (seanbright) 2021-04-12 14:01:50.900-0500

No I mean - can you reproduce it and send the {{print *rtp}} and {{print *rtp->red}} output with that change in place?
By: Thomas Johnson (tom5870) 2021-04-12 14:26:13.737-0500

(gdb) print *rtp
Cannot access memory at address 0x1
(gdb) print *rtp->red
Cannot access memory at address 0x2409

By: Sean Bright (seanbright) 2021-04-12 14:28:44.163-0500

Thanks
By: Sean Bright (seanbright) 2021-04-12 14:30:38.897-0500

That's a different crash. Odd.

Clearly this is something easy for you to reproduce - can you provide instructions so that I can try to reproduce here?
By: Thomas Johnson (tom5870) 2021-04-12 14:55:17.300-0500

I have the following values for my chan_sip calls
textsupport=yes
allow=red
allow=t140

After that, I have a piece of code that dials 20 calls one after the other. I answer all of them one after the other and start hanging them up after answering all 20. In the process of answering and hanging up, the Asterisk would crash. When the calls are answered by our Softphone, it sends the UTF BOM (\xFEFF). SipCon1 does this also, but only can do one call at a time. It might take longer to crash just doing a single call at a time.

By: Kevin Harwell (kharwell) 2021-04-12 15:39:23.739-0500

Could you bump up the debug level to 5 (see how to [here|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information]), and reproduce and attach the log file here? Maybe something will show just before the crash
By: Thomas Johnson (tom5870) 2021-04-12 15:56:07.656-0500

Here is the debug log.
By: Kevin Harwell (kharwell) 2021-04-12 16:26:19.082-0500

Thanks unfortunately the logger is unable to keep up at that level:
{noformat}
[2021-04-12 16:50:19] WARNING[893335] logger: Log queue threshold (1000) exceeded. Discarding new messages.
[2021-04-12 16:50:19] WARNING[893148] logger: Logging resumed. 1160 messages discarded.
{noformat}
Maybe try putting the debug at level 3? I want to see if it outputs a certain debug statement.
By: Thomas Johnson (tom5870) 2021-04-13 07:04:51.660-0500

Here it is a debug level 3.
By: Kevin Harwell (kharwell) 2021-04-14 10:27:53.862-0500

From what I can tell from the debug log your system seems to be a bit overloaded. Or maybe there is some other problem causing resources to diminish. At least it appears chan_sip eventually starts to struggle with things. You might want to investigate into what's happening and why the system is getting stressed. It's possible this triggers the situation that causes the eventual crash. If that's the case then fixing the one might help you avoid the other.

But of course Asterisk shouldn't be crashing so I'll go ahead and open this up. However please note as @jcolp stated:
{quote}
T.140 is generally community supported
{quote}
This means that a community developer _may_ work the issue when time and resources become available.
By: Thomas Johnson (tom5870) 2021-04-15 10:27:33.881-0500

I have crashed it with single calls also, but was running 20 simultaneous calls before so it can happen faster. I understand it may take a while to get fixed.
By: Thomas Johnson (tom5870) 2021-04-27 15:14:21.110-0500

Wanted to check in and see if anyone had a chance to look into this further. I understand that T.140 is a community supported issue, but this is kind of a big issue for us which is stopping us to go live with T.140.

Thanks
By: Joshua C. Colp (jcolp) 2021-04-27 15:22:53.778-0500

If anyone is looking at an issue or has comments they usually leave comments on the issue. As there are none, it is unlikely anyone is looking at this.
By: Sean Bright (seanbright) 2021-04-30 15:30:22.823-0500

So I labbed this up and sent over 100k T.140/RED enabled calls to it and never ran in to this bug. If you want to move this forward you will need to have someone look at your live environment, add some tracing to the code, etc.
By: Thomas Johnson (tom5870) 2021-04-30 15:42:27.450-0500

Can you provide us with any custom code files (with any logging statements needed) which we can replace in our environment, recompile and run the tests again?
By: Thomas Johnson (tom5870) 2021-05-07 18:35:27.860-0500

Just asking again, since I'm not familiar with all the Asterisk code, can someone provide us custom code with the logging statements added?

By: Thomas Johnson (tom5870) 2021-06-02 10:06:40.132-0500

I modified red_write to add some logging. However, I am confused, the crash (from the brief.txt) occurs on the if statement, but it was able to log after the crash. I've attached a temp.zip which includes the asterisk log, and the modified res_rtp_astersik.c from Asterisk 18.4. Also attached the brief since the whole .tar.gz is too big. Any suggestions?

I seem to be able to consistently crash Asterisk by placing two t140 RED calls answering both and the dropping the first one. The crash does seem to be in the drop. My thought is the pointer was freed and the scheduled task still ran, but if you look at the modified res_rtp_asterisk, the logs saying it was being freed were not logged.

By: Kevin Harwell (kharwell) 2021-06-02 12:38:21.405-0500

{quote}
I modified red_write to add some logging. However, I am confused, the crash (from the brief.txt) occurs on the if statement, but it was able to log after the crash. I've attached a temp.zip which includes the asterisk log, and the modified res_rtp_astersik.c from Asterisk 18.4.
....
{quote}
I'm assuming multiple threads are accessing the write function at the same time, so one thread could have logged the statement output, and then another thread actually caused the crash. That would explain why it appears to have logged after the crash.

Something else to maybe try perhaps is compiling Asterisk with [MALLOC_DEBUG|https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag]. See the following as to why it might be helpful: https://www.asterisk.org/using-malloc_debug-find-memory-misuse/
By: Thomas Johnson (tom5870) 2021-06-02 13:56:19.187-0500

I tried recompiling, but doesn't seem to be helpful.

Per my previous comment it is even easier, Asterisk crashes if the call is on hold when it disconnects from either side.

By: Thomas Johnson (tom5870) 2021-06-02 17:00:45.565-0500

A co-worker pointed out that I hadn't checked the mmlog, which just has" id-New Session" in it for each time Asterisk has started since making the MALLOC_DEBUG change.

By: Kevin Harwell (kharwell) 2021-06-02 17:12:44.715-0500

Probably won't find anything useful in the mmlog. However, it might output a new warning in the main Asterisk log/debug log and or search new backtraces for {{0xdeaddead}}.
By: Thomas Johnson (tom5870) 2021-06-02 19:33:23.934-0500

Do not see it in the backtraces or asterisk log. I had set verbose to 255 and debug to 255 before testing.

By: Thomas Johnson (tom5870) 2021-06-03 07:42:42.006-0500

Correct that, not seeing the last 'd', but I do see this:

[2021-06-03 08:37:20] ERROR[508116] res_rtp_asterisk.c: FRACK!, Failed assertion bad magic number 0xdeaddea for object 0x7f8bd011eaf0 (0)
[2021-06-03 08:37:20] ERROR[508116] : Got 9 backtrace records
# 0: [0x5dae00] asterisk utils.c:2454 __ast_assert_failed()
# 1: [0x465b35] asterisk astobj2.c:220 log_bad_ao2()
# 2: [0x465b9b] asterisk astobj2.c:224 __ao2_lock()
# 3: [0x7f8b981c4f5b] res_rtp_asterisk.so res_rtp_asterisk.c:8548 red_write()
# 4: [0x58df4f] asterisk sched.c:793 ast_sched_runq()
# 5: [0x7f8bca3e129f] chan_sip.so chan_sip.c:30051 do_monitor()
# 6: [0x5d8199] asterisk utils.c:1299 dummy_start()
# 7: [0x7f8c1333715a] libpthread.so.0 :0 __libpthread_freeres()
# 8: [0x7f8c12884f73] libc.so.6 :0 clone()
By: Thomas Johnson (tom5870) 2021-06-03 09:39:10.848-0500

I attached a full file with the statements similar to above in it, didn't save the log from that test before continuing.
By: Thomas Johnson (tom5870) 2021-06-03 11:15:22.366-0500

After looking at that last full log and the code, I have the following workaround:
static int red_write(const void *data)
{
/* do not reschedule if the pointers were NULL */
int ret=1;
struct ast_rtp_instance *instance = (struct ast_rtp_instance*) data;
struct ast_rtp *rtp = ast_rtp_instance_get_data(instance);
ao2_lock(instance);
if (rtp) {
if (rtp->red) {
if (rtp->red->t140.datalen > 0) {
ast_rtp_write(instance, &rtp->red->t140);
}
} else {
ret=0;
}
} else {
ret=0;
}
ao2_unlock(instance);

return ret;
}

I don't consider this the proper fix to submit it. So, please continue to provide comments when possible.

By: Thomas Johnson (tom5870) 2021-06-10 12:07:23.460-0500

Just checking in to see if the above code recommendation made sense. My understanding is that this is not the right way of fixing it.

Did anyone have any thoughts on this?

By: Kevin Harwell (kharwell) 2021-06-10 12:49:54.940-0500

You've been running the above for a while and it hasn't crashed again?

If that's the case then given that a previous suggestion of
{quote}
if (rtp->red && rtp->red->t140.datalen > 0) {
{quote}
did not work (i.e. crashes still occurred), but your code above does then the only difference I see between the two is the check to see if {{rtp}} is NULL or not.

Is what you have the right fix? Not sure. It really depends on why, or how {{rtp}} is NULL at this point in time when it's retrieved.

If the {{rtp}} object is indeed the problem then one suggestion I have is to lock the instance prior to retrieving the {{rtp}} object. That will at least ensure it's not somehow freed between retrieving it and the NULL check.

Again though not saying it's the right fix. Someone would need to investigate more into why the object can be NULL here.