ASTERISK-28906: ari: Race condition when destroying holding bridge multiple times with channels in it

[Home]

Summary: ASTERISK-28906: ari: Race condition when destroying holding bridge multiple times with channels in it

Reporter: Juris Breicis (jbreicis) Labels:

Date Opened: 2020-05-19 05:07:06 Date Closed:

Priority: Minor Regression?

Status: Open/New Components: Core/Bridging/bridge_roles Resources/res_stasis

Versions: 16.6.1 Frequency of
Occurrence Occasional

Related
Issues:

Environment: Ubuntu 18.04.3 LTS in KVM. Asterisk built from source. Attachments: ( 0) 04062020-core-info.txt
( 1) 04062020-core-locks.txt
( 2) 04062020-core-thread1.txt
( 3) 04062020-sanitized.core-brief.txt
( 4) 04062020-sanitized.core-full.txt
( 5) core-locks.txt
( 6) core-locks.txt
( 7) core-thread1.txt
( 8) core-thread1.txt
( 9) sanitized.core-brief.txt
(10) sanitized.core-brief.txt
(11) sanitized.core-full.txt
(12) sanitized.core-full.txt

Description: Bit of background: I am on asterisk 16.6.2 and the box itself is handling ~ 500 simultaneous calls.
The whole call management part is fully handled by Stasis app through ARI.

With a recent bug with our own Stasis software (multiple consecutive bridge destroys in a row) we have spotted an asterisk crash on rare occasions:

{noformat}
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f7e96b77801 in __GI_abort () at abort.c:79
#2 0x00007f7e96bc0897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f7e96cedb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007f7e96bc790a in malloc_printerr (str=str@entry=0x7f7e96cef828 "double free or corruption (fasttop)") at malloc.c:5350
#4 0x00007f7e96bcf004 in _int_free (have_lock=0, p=0x7f7e7c01d670, av=0x7f7e7c000020) at malloc.c:4230
#5 __GI___libc_free (mem=0x7f7e7c01d680) at malloc.c:3124
#6 0x000055997dfc833d in __ast_free (ptr=<optimized out>, file=file@entry=0x55997e1afe4f "bridge_roles.c", lineno=lineno@entry=78, func=func@entry=0x55997e1b0150 <__PRETTY_FUNCTION__.14665> "bridge_role_destroy") at astmm.c:1577
#7 0x000055997dff8779 in bridge_role_destroy (role=<optimized out>) at bridge_roles.c:78
#8 ast_channel_clear_bridge_roles (chan=<optimized out>) at bridge_roles.c:374
#9 0x00007f7e35d9d71a in bridge_stasis_pull (self=0x7f7dc8001020, bridge_channel=0x7f7e7012adf0) at stasis/stasis_bridge.c:292
#10 0x000055997dff5239 in bridge_channel_internal_pull (bridge_channel=bridge_channel@entry=0x7f7e7012adf0) at bridge_channel.c:2170
#11 0x000055997dff67da in bridge_channel_internal_join (bridge_channel=bridge_channel@entry=0x7f7e7012adf0) at bridge_channel.c:2921
#12 0x000055997dfdc045 in bridge_channel_depart_thread (data=data@entry=0x7f7e7012adf0) at bridge.c:1787
#13 0x000055997e122f85 in dummy_start (data=<optimized out>) at utils.c:1249
#14 0x00007f7e9771f6db in start_thread (arg=0x7f7d6f6b8700) at pthread_create.c:463
#15 0x00007f7e96c5888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}

I suspect it could be a race condition where there are multiple attempts to free same memory segment (after it has already been freed).

We will apply fix to our Stasis app ourselves to remedy the issue for our case. However - I suspect that this is not expected behaviour - is there anything else, you would need from me, to be able to fix this for other users?

Comments: By: Asterisk Team (asteriskteam) 2020-05-19 05:07:07.884-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.
By: Joshua C. Colp (jcolp) 2020-05-19 05:14:48.153-0500

To attempt to reproduce this is it as easy as creating a bridge, adding a channel, and calling destroy on the bridge multiple times?
By: Joshua C. Colp (jcolp) 2020-05-19 05:15:32.347-0500

Can you also attach a full backtrace and not of the specific thread?
By: Joshua C. Colp (jcolp) 2020-05-19 05:16:27.593-0500

As well are you using any other functionality? For example are you dialing channels using create and dial? Do you have the specific ARI requests that are used?
By: Juris Breicis (jbreicis) 2020-05-19 05:25:30.580-0500

First of - we have not been able to reproduce this with limited test (very simple Stasis service, which does exactly what you described);
Also maybe this is due to the fact that this occurs on quite rare instances: machine is constantly servicing ~ 1k channels and 500 bridges with a lifetime of ~ 3 minutes. And this happens once every 24 - 48hrs.

We are using quite a lot from ARI functionality - almost everything which is related to channels and bridges; from creating individual channels and bridges, to recording through Snoop channels, etc. etc. However this crash seems to be introduced by a BUG in our Stasis app where we incorrectly tracked Bridge status within our backend, and would call out bridge destroy multiple times without checking if it has been destroyed before.

Yes, I can add full crash log (However the specific asterisk instance where I got this backtrace from, did not have DONT_OPTIMIZE build flags) - will it still be okay, or should I wait for the crash on one of the instances with DONT_OPTIMIZE flag?
By: Joshua C. Colp (jcolp) 2020-05-19 05:28:57.489-0500

Ideally with DONT_OPTIMIZE. As well the code in question is also executed with holding bridges - are you using those, and that is what is being destroyed?
By: Juris Breicis (jbreicis) 2020-05-19 05:33:49.744-0500

I am not sure if the failure is caused by holding bridge or by mixing bridge; as we are using both of these. So I will first upload crash log without DONT_OPTIMIZE flags and one with DONT_OPTIMIZE flags as soon (and if) I get one - as we are in process of applying our Stasis process fix at the moment.

Stack trace can contain sensitive information, right? What is process for sharing it with you?
By: Joshua C. Colp (jcolp) 2020-05-19 05:43:28.047-0500

It generally doesn't, but what is sensitive is unique to each deployment/person/company. Some consider the names of their endpoints sensitive while others don't. If there is something sensitive you can replace it with something else.
By: Juris Breicis (jbreicis) 2020-05-19 06:22:54.042-0500

I have added traces from original crash. Can you please hide it public visibility as it contains phone numbers and other information third parties might consider sensitive?
By: Joshua C. Colp (jcolp) 2020-05-19 06:44:22.868-0500

There is no ability to hide attachments from public view except for restricting the entire issue (which is done for security issues for a period of time after which they are made public), as a result I have deleted the attachments. You will need to sanitize any sensitive information to your liking as I previously mentioned.
By: Juris Breicis (jbreicis) 2020-05-19 07:56:55.197-0500

I have reuploaded stack traces.

Sensitive information in the core-brief and core-full files have been replaced with ||SANITIZED||
By: Juris Breicis (jbreicis) 2020-05-20 05:17:19.190-0500

After applying fix to our own Stasis backend service to only destroy once; we caught another crash; I have added these traces in upload as well.
By: Juris Breicis (jbreicis) 2020-05-20 05:18:43.885-0500

Another crash-log this time tb pointed to a different fault:

{noformat}
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f598d874801 in __GI_abort () at abort.c:79
#2 0x00007f598d8bd897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f598d9eab9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007f598d8c490a in malloc_printerr (str=str@entry=0x7f598d9ec3f0 "malloc_consolidate(): invalid chunk size") at malloc.c:5350
#4 0x00007f598d8c4bae in malloc_consolidate (av=av@entry=0x7f5890000020) at malloc.c:4441
#5 0x00007f598d8c87d8 in _int_malloc (av=av@entry=0x7f5890000020, bytes=bytes@entry=8272) at malloc.c:3703
#6 0x00007f598d8ce0b1 in __libc_calloc (n=n@entry=1, elem_size=elem_size@entry=8272) at malloc.c:3436
#7 0x000055e9afb3f421 in __ast_repl_calloc (func=0x55e9afd647d0 <__PRETTY_FUNCTION__.17546> "playtones_alloc", lineno=133, file=0x55e9afd63dee "indications.c", size=8272, nmemb=1) at astmm.c:1558
#8 __ast_calloc (nmemb=nmemb@entry=1, size=size@entry=8272, file=file@entry=0x55e9afd63dee "indications.c", lineno=lineno@entry=133, func=func@entry=0x55e9afd647d0 <__PRETTY_FUNCTION__.17546> "playtones_alloc") at astmm.c:1628
#9 0x000055e9afce79f2 in playtones_alloc (chan=0x7f58ec7c7e10, params=0x7f58481c6b90) at indications.c:133
#10 0x000055e9afb9044f in ast_activate_generator (chan=chan@entry=0x7f58ec7c7e10, gen=gen@entry=0x55e9affd8c20 <playtones>, params=params@entry=0x7f58481c6b90) at channel.c:2940
#11 0x000055e9afce80a1 in ast_playtones_start (chan=chan@entry=0x7f58ec7c7e10, vol=vol@entry=0, playlst=<optimized out>, interruptible=interruptible@entry=1) at indications.c:385
#12 0x000055e9afb8e9d9 in indicate_data_internal (chan=chan@entry=0x7f58ec7c7e10, _condition=_condition@entry=3, data=data@entry=0x0, datalen=datalen@entry=0) at channel.c:4558
#13 0x000055e9afb90ccd in ast_indicate_data (chan=chan@entry=0x7f58ec7c7e10, _condition=3, data=0x0, datalen=0) at channel.c:4617
#14 0x000055e9afb6e09b in bridge_channel_handle_control (fr=0x7f58ec356710, bridge_channel=0x7f59740f6070) at bridge_channel.c:2348
#15 bridge_channel_handle_write (bridge_channel=0x7f59740f6070) at bridge_channel.c:2420
#16 bridge_channel_wait (bridge_channel=0x7f59740f6070) at bridge_channel.c:2771
#17 bridge_channel_internal_join (bridge_channel=bridge_channel@entry=0x7f59740f6070) at bridge_channel.c:2911
#18 0x000055e9afb53045 in bridge_channel_depart_thread (data=data@entry=0x7f59740f6070) at bridge.c:1787
{noformat}

By: Joshua C. Colp (jcolp) 2020-05-20 05:28:58.428-0500

It sounds as if there is actually memory corruption going on, likely based on some interaction with how you are using ARI and that it is not limited to the bridge destroy as you mentioned. In the case you saw that was just what got corrupted. I think we'd need a better understanding of everything that is in use, potentially even a log of the ARI requests and operations being done before the crash occurs.

As well since this is now not limited to the bridge destroy, I can't be certain that some other underlying issue hasn't already been fixed. Can you try under a recent version of Asterisk?
By: Juris Breicis (jbreicis) 2020-05-21 06:05:04.964-0500

Ok, Joshua. So we'll start by updating to the very latest version in 16 series, with DONT_OPTIMIZE flag set to True for every instance; and will report back; if we get a new crash.
As for the full ARI log; how long of a trail you think you would need; last 5 minutes? 10?
By: Joshua C. Colp (jcolp) 2020-05-21 06:06:07.485-0500

The last 15 minutes is probably enough.
By: Juris Breicis (jbreicis) 2020-05-21 06:11:02.568-0500

Would ARI captured in .pcap file be ok?
By: Joshua C. Colp (jcolp) 2020-05-21 06:12:28.967-0500

An Asterisk log with debug enabled would be best, but if that is not available then a pcap would at least give some insight into what is going on. Essentially it has to be determined what in your usage patterns is causing the issue, as this hasn't been experienced by others.
By: Juris Breicis (jbreicis) 2020-06-04 09:54:12.668-0500

We just got the same crash on 17.4.0 compiled with DONT_OPTIMIZE flags set to on; I will be attaching core dump.
I have full klog with ARI commands as well (but at the moment, our sanitization script is removing phone numbers and IP addresses from it, so Ill attach it later.)
It is quite large ~ 500 MB for last 15 minutes before the crash. Should I gzip before uploading it? or such large files are fine?
By: Juris Breicis (jbreicis) 2020-06-04 10:00:14.594-0500

Crash on Asterisk 17.4.0;
By: George Joseph (gjoseph) 2020-06-04 10:21:37.521-0500

Don't up load anything that large. :)
Instead do this.....

Assuming you still have the raw coredump, run ast_coredumper with the following options...
{{ast_coredumper --tarball-coredumps --no-default-search <path to raw coredump>}}
It'll create a tarball with the raw coredump and the asterisk binaries.
Upload that to the hosting service of your choice (Google Drive, DropBox, etc), along with any other *unsanitized* logs, pcaps, etc. and send a link to it via email to asteriskteam@digium.com using the subject "ASTERISK-28906: Coredumps". We'll download the stuff then let you know when you can delete it.

By: Juris Breicis (jbreicis) 2020-06-04 10:42:03.744-0500

Thanks!

Sent an email with link.
By: George Joseph (gjoseph) 2020-06-04 12:42:10.383-0500

Unfortunately, the actual raw coredump wasn't in the tarball. If you still have it, can you send us a link to it?

By: George Joseph (gjoseph) 2020-06-04 15:57:16.668-0500

We got the correct coredump.

By: Juris Breicis (jbreicis) 2020-06-08 07:02:55.698-0500

Quick question; do you need dumps from other crashes with the same traceback?
By: George Joseph (gjoseph) 2020-06-08 09:15:23.109-0500

No I think we're OK. As you implied in the issue description, the issue appears to be that we're not locking correctly when we're trying to clear the bridge roles. We're looking at it now.

By: Juris Breicis (jbreicis) 2020-12-11 12:19:19.106-0600

Hi,
Just wanted to ask, if you have had any progress with investigating this further in your end?
By: Joshua C. Colp (jcolp) 2020-12-11 12:24:56.821-0600

The issue is open with noone assigned. Noone is actively working on it, and any updates will be posted here.
By: Juris Breicis (jbreicis) 2020-12-11 12:26:30.289-0600

Thanks for the super quick update!