ASTERISK-26972: jitterbuffer: Crash in adaptive implementation with large timer interval

[Home]

Summary: ASTERISK-26972: jitterbuffer: Crash in adaptive implementation with large timer interval

Reporter: Richard Kenner (kenner) Labels:

Date Opened: 2017-04-28 09:56:10 Date Closed:

Priority: Major Regression? No

Status: Open/New Components: Core/Jitterbuffer

Versions: 14.3.0 Frequency of
Occurrence Occasional

Related
Issues:

Environment: Attachments: ( 0) jitterbuffer-crash.txt

Description: I've gotten three crashes with arithmetic exceptions in abstract_jb.c where
framedata->timer_interval is (way) over 100. Here's the traceback for one of them:
{noformat}
#0 0x0000000000435018 in hook_event_cb (chan=<value optimized out>,
frame=0x87a180, event=<value optimized out>, data=0x2aec10001690)
at abstract_jb.c:1010
#1 0x000000000050ec25 in framehook_list_push_event (
framehooks=0x2aec100258b0, frame=0x1e7d3600,
event=AST_FRAMEHOOK_EVENT_READ) at framehook.c:118
#2 0x00000000004b9f0c in __ast_read (chan=0x1e71cc18, dropaudio=0)
at channel.c:3950
#3 0x000000000047b9d9 in bridge_channel_handle_interval (
bridge_channel=0x2aec004702f8) at bridge_channel.c:1466
#4 bridge_channel_wait (bridge_channel=0x2aec004702f8)
at bridge_channel.c:2619
#5 0x000000000047c888 in bridge_channel_internal_join (
bridge_channel=0x2aec004702f8) at bridge_channel.c:2757
#6 0x0000000000468a18 in ast_bridge_join (bridge=0x2aec005276d8,
chan=0x1e71cc18, swap=0x0, features=0x2aec088b2b60,
tech_args=<value optimized out>, flags=<value optimized out>)
at bridge.c:1713
{noformat}
and here are some things I've extracted from the dump:
{noformat}
(gdb) print (struct jb_framedata *) $rbp
$6 = (struct jb_framedata *) 0x2aec10001690
(gdb) p *$
$7 = {jb_impl = 0x5dfa20, jb_conf = {flags = 909184, max_size = 700,
resync_threshold = 1000, impl = "adaptive\000\000\000\020",
target_extra = 40}, start_tv = {tv_sec = 1492522127, tv_usec = 824808},
last_format = 0x2aebcc0121f0, timer = 0x2aec1002d4c0,
timer_interval = 4460210, timer_fd = 126, first = 1, jb_obj = 0x2aec10004230}
{noformat}
{noformat}
(gdb) p *$6.last_format
$8 = {name = 0x603455 "slin", codec = 0x2aebcc012110, attribute_data = 0x0,
interface = 0x0}
(gdb) print *$6.last_format->codec
$9 = {id = 8, name = 0x603455 "slin",
description = 0x60345a "16 bit Signed Linear PCM",
type = AST_MEDIA_TYPE_AUDIO, sample_rate = 8000, minimum_ms = 10,
maximum_ms = 70, default_ms = 20, minimum_bytes = 160,
samples_count = 0x4ca520 <g726_length>,
get_length = 0x4ca530 <slin_samples>, smooth = 1, mod = 0x0}
(gdb) p/x 909184
$10 = 0xddf80
(gdb) p frame
$11 = (struct ast_frame *) 0x87a180
(gdb) print *frame
$12 = {frametype = AST_FRAME_NULL, subclass = {integer = 0, format = 0x0,
frame_ending = 0}, datalen = 0, samples = 0, mallocd = 0,
mallocd_hdr_len = 0, offset = 0, src = 0x0, data = {ptr = 0x0, uint32 = 0,
pad = "\000\000\000\000\000\000\000"}, delivery = {tv_sec = 0,
tv_usec = 0}, frame_list = {next = 0x0}, flags = 0, ts = 0, len = 0,
seqno = 0}
{noformat}
All three crashes were in the same place with the exact same bogus timer_interval.

Is there anything else that would be useful to get out of this dump? Obviously, going back in time and finding the frame that set timer_interval would be very useful, but I don't see how to get that from the dumps. Suggestions?

Comments: By: Asterisk Team (asteriskteam) 2017-04-28 09:56:11.329-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Kevin Harwell (kharwell) 2017-04-28 10:45:30.899-0500

What audio codecs were involved at the time?
By: Richard Kenner (kenner) 2017-04-28 10:49:28.123-0500

Unknown for sure, since I wasn't monitoring the system at the time but, given previous history, it wouldn't surprise me if that were when somebody was using a Polycom over a jitterbuffer, which would have been Siren14.
By: Kevin Harwell (kharwell) 2017-04-28 11:57:05.987-0500

If you have it (or can capture one next time) a debug log right before the crash could be helpful.

Also when this occurs do you happen to know what the call scenario was at the time? For instance was it just two people talking, in a conference, a transfer was happening, etc...?

Does it happen on the same endpoint every time or seemingly random? Either way the endpoint's configuration (from [pj]sip.conf) could also help point things in a direction if that too could be gathered.
By: Richard Kenner (kenner) 2017-04-28 16:15:58.074-0500

I noticed the dump two days later. so it's really only logs that I'm going by. The crashes occurred within minutes of each other, so they must have been the same call or conference being retried.

What's wierd here is that the log on the same second as the crash says:

[Apr 18 09:28:58] WARNING[29000][C-00008f0e] chan_iax2.c: Resyncing the jb. last_delay 0, this delay -6487433, threshold 1000, new offset 6487433

But we're not using iax2! Although I see those messages a lot.

There are no relevant CDR entries, obviously, because the calls never completed.

I immediately "fixed" this by adding a bounds check on timer_interval, but that's not the right fix, of course.

This is very politically sensitive here because we also had crashes caused by another Asterisk crash and there's a growing movement here to abandon Asterisk "and use something that works", so I'm, unfortunately, not in a position to try to reproduce this and all the information I can get has to come from dumps and log messages.
By: Richard Kenner (kenner) 2017-04-28 16:38:27.928-0500

I looked at our conference system log and there was a conference that was going on at the time of the crash. I don't trust the CDR records from that conference, but it shows four participants: one, a Polycom, probably using Siren14 via chan_sip and on a line with a jitterbuffer, one person calling from another chan_sip endpoint, and two coming in via a DAHDI T1 trunk from our SV8300 PBX (unfortunately, one was our CEO, which is why this crash was such an issue).
By: Joshua Elson (joshelson) 2017-05-10 16:41:40.076-0500

Also seeing this issue with the adaptive jitterbuffer on 14.4.

Seems to exactly mirror the analysis by the initial reporter, but adding another backtrace in case that's helpful.