[Home]

Summary:ASTERISK-27238: Bridging: Crash freeing a frame that's already been freed
Reporter:Richard Kenner (kenner)Labels:
Date Opened:2017-08-31 17:16:02Date Closed:2017-11-16 07:59:49.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/Bridging
Versions:14.6.0 Frequency of
Occurrence
Related
Issues:
is duplicated byASTERISK-27412 core: Audiohook freeing interpolated frame when it shouldn't.
is duplicated byASTERISK-27235 Crash when freeing frame in bridge
Environment:Centos 7Attachments:( 0) confbridge.conf
( 1) traceback.txt
( 2) valgrind.txt
Description:See traceback.txt traceback.  The frame being freed is below:

{noformat}
$2 = {frametype = AST_FRAME_VOICE, subclass = {integer = 0,
   format = 0x24bc940, frame_ending = 0}, datalen = 0, samples = 320,
 mallocd = 1, mallocd_hdr_len = 545, offset = 64,
 src = 0x7f554c00c7a8 "func_jitterbuffer interpolation", data = {ptr = 0x0,
   uint32 = 0, pad = "\000\000\000\000\000\000\000"}, delivery = {
   tv_sec = 1504146592, tv_usec = 647484}, frame_list = {
   next = 0x7f5544002de0}, flags = 0, ts = 0, len = 0, seqno = 0}
{noformat}
Comments:By: Asterisk Team (asteriskteam) 2017-08-31 17:16:04.235-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Richard Mudgett (rmudgett) 2017-08-31 19:06:22.725-0500

Your backtrace appears to contain a memory corruption. We need one or both of the following items to continue investigation of the issue:
1. Valgrind output. See https://wiki.asterisk.org/wiki/display/AST/Valgrind for instructions on how to use Valgrind with Asterisk.
2. MALLOC_DEBUG output. See https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag for instructions on how to use the MALLOC_DEBUG option.

Note that MALLOC_DEBUG and Valgrind are mutually exclusive options. Valgrind output is preferable, but will be more system resource intensive and may be difficult to get on a production system. In such a case, you may have better luck getting the necessary output from MALLOC_DEBUG.

Crashes in malloc or free are usually signs of memory corruption.  However, crashes in free do have the possibility of it being a double free.

By: Richard Kenner (kenner) 2017-08-31 20:15:06.636-0500

Yeah, I was starting to wonder the same thing.  This is a test system, a new VM on which Asterisk is being installed, so the CPU load of valgrind isn't an issue, though I am concerned about reproducability.  I'll try both ways and see which, if either, gives any useful information.

By: Richard Kenner (kenner) 2017-08-31 20:28:03.546-0500

Wow, that did not take long:  All I did was call into a conference bridge from *one* extension and I see dozens of:

==16245== Thread 77:
==16245== Invalid read of size 8
==16245==    at 0x52F71E: ast_frdup (frame.c:316)
==16245==    by 0x5D275F: ast_slinfactory_feed (slinfactory.c:128)
==16245==    by 0x2D73F340: softmix_bridge_write_voice (bridge_softmix.c:652)
==16245==    by 0x2D73F47B: softmix_bridge_write (bridge_softmix.c:731)
==16245==    by 0x483EF5: bridge_channel_write_frame (bridge_channel.c:648)
==16245==    by 0x488693: bridge_handle_trip (bridge_channel.c:2475)
==16245==    by 0x488A77: bridge_channel_wait (bridge_channel.c:2611)
==16245==    by 0x489220: bridge_channel_internal_join (bridge_channel.c:2757)
==16245==    by 0x46EDDF: ast_bridge_join (bridge.c:1715)
==16245==    by 0x20066A1F: confbridge_exec (app_confbridge.c:2448)
==16245==    by 0x5944C3: pbx_exec (pbx_app.c:491)
==16245==    by 0x5803B9: pbx_extension_helper (pbx.c:2923)
==16245==  Address 0xa1ab810 is 96 bytes inside a block of size 545 free'd
==16245==    at 0x4C29CDD: free (vg_replace_malloc.c:530)
==16245==    by 0x52F20D: __frame_free (frame.c:157)
==16245==    by 0x52F24A: ast_frame_free (frame.c:171)
==16245==    by 0x609444: ast_translate (translate.c:626)
==16245==    by 0x4BBDC7: __ast_read (channel.c:4315)
==16245==    by 0x4BC145: ast_read (channel.c:4398)
==16245==    by 0x48855A: bridge_handle_trip (bridge_channel.c:2431)
==16245==    by 0x488A77: bridge_channel_wait (bridge_channel.c:2611)
==16245==    by 0x489220: bridge_channel_internal_join (bridge_channel.c:2757)
==16245==    by 0x46EDDF: ast_bridge_join (bridge.c:1715)
==16245==    by 0x20066A1F: confbridge_exec (app_confbridge.c:2448)
==16245==    by 0x5944C3: pbx_exec (pbx_app.c:491)
==16245==  Block was alloc'd at
==16245==    at 0x4C2A975: calloc (vg_replace_malloc.c:711)
==16245==    by 0x6115DF: _ast_calloc (utils.h:573)
==16245==    by 0x52F840: ast_frdup (frame.c:333)
==16245==    by 0x5D275F: ast_slinfactory_feed (slinfactory.c:128)
==16245==    by 0x2D73F340: softmix_bridge_write_voice (bridge_softmix.c:652)
==16245==    by 0x2D73F47B: softmix_bridge_write (bridge_softmix.c:731)
==16245==    by 0x483EF5: bridge_channel_write_frame (bridge_channel.c:648)
==16245==    by 0x488693: bridge_handle_trip (bridge_channel.c:2475)
==16245==    by 0x488A77: bridge_channel_wait (bridge_channel.c:2611)
==16245==    by 0x489220: bridge_channel_internal_join (bridge_channel.c:2757)
==16245==    by 0x46EDDF: ast_bridge_join (bridge.c:1715)
==16245==    by 0x20066A1F: confbridge_exec (app_confbridge.c:2448)


By: Richard Kenner (kenner) 2017-08-31 20:38:58.810-0500

More precisely, valgrind.txt grew to almost 15k lines in a few seconds of the connection.  The above wasn't always exactly the same, but there were errors reading and writing 4 or 8 bytes with very similar tracebacks.

By: Richard Kenner (kenner) 2017-08-31 20:42:49.234-0500

Here's the full valgrind.txt.


By: Richard Kenner (kenner) 2017-09-01 15:53:03.165-0500

I think I may have a one-character fix to this.  In __ast_read, at around line 4313 is a call to ast_translate with the "consume" parameter set to 1.  I think it should be zero because, as I understand it, the frame is indeed going to be used.  When I make that change and do lots of entering and exiting a conference bridge, I get *absolutely no* valgrind errors.

Of course, if this change is wrong, I'm creating a massive memory leak.

By: Rusty Newton (rnewton) 2017-09-05 18:02:37.794-0500

Are you able to test your fix? If you are creating a massive memory leak, you should be able to spot that. You are probably aware also of the commands provided with MALLOC_DEBUG.

https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag

If you can test your fix and submit the patch to Gerrit, then you can get some reviews on it and we can move it through if it works out.

https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

By: Richard Kenner (kenner) 2017-09-05 18:25:10.627-0500

I tested it to the extent of running valgrind and seeing  all the errors go away.  I would think that a leak of this size would show up really quickly in increasing process memory usage and I didn't see that: it stayed rock solid.  I didn't realize that MALLOC_DEBUG would be useful here compared to valgrind, but I can certainly do that and see if it shows anything.

Unfortunately, I don't think that setting up an environment for me to be able to use git/Gerrit for a one-character patch is worthwhile.

Moreover, I quite understand that the code here is extremely complex and that I don't have the fundamental knowledge of Asterisk frame and memory management needed to know if this seems correct or if there's something else wrong.  It would seem to me to be odd that a bug of this severity would not have been seen before.  In addition, the code here is the same in 14.3 and that doesn't doesn't exhibit this issue for me.  So I'm more than a bit skeptical that this is the correct fix, but view my suggesting it as more a way to focus somebody who *does* understand this stuff to a possible place to look.

By: Richard Kenner (kenner) 2017-09-05 18:44:58.423-0500

I may well have created a leak:

Exiting with the following memory not freed:
  3079377 bytes (3079377 in caches) in 8447 allocations. frame.c ast_frdup()

Somebody who understands this code and how memory allocation of frames is done needs to look at this.

By: Richard Kenner (kenner) 2017-09-05 21:39:25.081-0500

Hmmm... because all of that was in cache, maybe there wasn't a leak afterall.  But I tried another patch, this time reverting the "consume" value back to 1 and this time changing the final test in ast_translate from "if (consume)" to "if (consume && out != f)".  This also fixes the issue, shows a similar memory behavior as above and, when run under valgrind, produces the following on exit:

{noformat}
==29680== LEAK SUMMARY:
==29680==    definitely lost: 221 bytes in 3 blocks
==29680==    indirectly lost: 0 bytes in 0 blocks
==29680==      possibly lost: 5,655 bytes in 28 blocks
==29680==    still reachable: 812,714 bytes in 5,128 blocks
==29680==         suppressed: 0 bytes in 0 blocks
{noformat}


By: Rusty Newton (rnewton) 2017-09-06 17:26:22.967-0500

bq. All I did was call into a conference bridge from one extension and I see dozens of

Can you attach the configuration for the conference bridge, channel technology and dialplan that you are using to reproduce the issue? That way we can check it out here. Sounds like it might be easy to reproduce.



By: Richard Kenner (kenner) 2017-09-06 17:35:51.117-0500

I think it should be trivial to at least see it in valgrind, even if you don't see the crash.  The incoming channel was a SIP phone (Aastra) with the sln16 codec, I'll attach confbridge.conf.  The dialplan is huge, but the relevant part is under [Conferences]:

{noformat}
exten => _20Z,1,Answer(1000)                    ; Answer and delay a bit.      
same => n,GosubIf($[x${AUTH}=x]?Authenticate,s,1([redacted])); Authenticate.        
same => n,GosubIf($[x${CALLERID(num)}=x]?Ask-CID,s,1) ; Ask for CID if none.  
same => n,Set(STATS_INC(conf_room)=1)          ; Count it.                    
same => n,ConfBridge(${EXTEN},,,default_menu)  ; Enter the conference.  
{noformat}


By: Rusty Newton (rnewton) 2017-09-18 18:24:14.225-0500

I'm unable to reproduce so far. Does your Aastra phone talk slin16 natively or do you have transcoding going on?

By: Richard Kenner (kenner) 2017-09-18 18:30:10.219-0500

I was concerned about reproducability because I didn't think that a bug this serious could be in the system, but it's slin16 native.  I've also seen this (and other errors) with the Digium Siren14 codec, but the time I ran valgrind and posted the results, it was a single Aastra 6867i phone with this in its config file:

{noformat}
sip use basic codecs: 1
sip silence suppression: 0
sip vmail: "#6"
sip customized codec: payload=0;ptime=30;silsupp=off,payload=8;ptime=30;silsupp=off,payload=18;ptime=30;silsupp=off,payload=113;ptime=30;silsupp=off
{noformat}

and it negotiated as native slin16.

By: Richard Kenner (kenner) 2017-09-18 18:34:09.221-0500

I am running with some patches that we put in for the previous version of Asterisk to deal with issues with the Siren14 codec.  I believe these were fixed other ways, but the patches seem safe, though probably unnecessary with the official fix in.  Here they are, for the record:

{noformat}
*** ./main/frame.c.orig 2017-02-13 15:00:19.000000000 -0500
--- ./main/frame.c      2017-03-22 17:08:32.000000000 -0400
***************
*** 674,678 ****
       }
 
!       if (!adjustment) {
               return 0;
       }
--- 674,678 ----
       }
 
!       if (!adjustment || !fdata) {
               return 0;
       }
*** main/slinfactory.c.orig     2017-02-13 15:00:19.000000000 -0500
--- main/slinfactory.c  2017-04-12 09:37:18.000000000 -0400
***************
*** 88,92 ****
        * set to either 160 or 240.
        */
!       if (!f->data.ptr) {
               return 0;
       }
--- 88,92 ----
        * set to either 160 or 240.
        */
!       if (!f->data.ptr || !f->datalen) {
               return 0;
       }
***************
*** 174,178 ****
                       frame_data = frame_ptr->data.ptr;
 
!                       if (frame_ptr->samples <= ineed) {
                               memcpy(offset, frame_data, frame_ptr->samples *
sizeof(*offset));
                               sofar += frame_ptr->samples;
--- 174,180 ----
                       frame_data = frame_ptr->data.ptr;
 
!                       if (frame_ptr->datalen == 0)
!                         ;
!                       else if (frame_ptr->samples <= ineed) {
                               memcpy(offset, frame_data, frame_ptr->samples *
sizeof(*offset));
                               sofar += frame_ptr->samples;
{noformat}



By: Richard Kenner (kenner) 2017-10-01 09:12:16.537-0500

I wanted to provide an update.  I'm using valgrind as my test here, rather than looking at crashes.  I removed the patches above, as well as two patches I had tried to use to fix this issue.  When I do that, I still see the valgrind errors, though now they're a bit harder to reproduce and appear to need both somebody coming into conference bridge using sln16 and ulaw (so there's a conversion from 8k to 16k).

What I found is that if I protect the "ast_frfree (middle_frame)" around line 955 in audiohook.c with
{noformat}
 if (middle_frame != start_frame)
{noformat}

the problem goes away.  That seems like quite a safe patch because the routine should never be returning a frame that it just freed, but it's a mystery to me why others aren't seeing this issue or whether this is actually the proper fix.



By: Richard Kenner (kenner) 2017-10-01 09:35:48.159-0500

I just did a bit of archeology.  I should have remembered this earlier, but the code in question is very new: it was added as the fix for ASTERISK-26926, which I also reported.   So only people who would have run into that bug will run into this one.  But that change is clearly incorrect: audiohook_list_translate_to_slin can return its operand unchanged, which is start_frame, and then it returns the frame it just freed.  So it now seems to me that my patch above is actually the correct patch.  I would also recommend changing "return frame" a few lines above to "return start_frame"; they're both equal, but the latter is more consistent with the rest of the code (which doesn't use "frame" anymore) and makes it easier to read.


By: Corey Farrell (coreyfarrell) 2017-11-14 23:59:08.062-0600

I believe this bug is being worked through ASTERISK-27412 however the fix will only be applied to 13 and 15.

See https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions.

By: Richard Mudgett (rmudgett) 2017-11-15 11:48:02.844-0600

The patch for ASTERISK-27413 would need to be applied, the new {{cache_media_frames}} option added by the patch disabled in {{asterisk.conf}}, and valgrind rerun to determine if it is the same thing as ASTERISK-27412.

By: Richard Kenner (kenner) 2017-11-15 15:34:39.882-0600

I can confirm that the first part of jira_asterisk_27412_v15.1.patch is the same patch that I found for this issue and what fixed it for me.  So this certainly does seem to be a duplicate of that issue. I can certainly do the above if it would be likely to help, but I don't think it necessary.

By: Richard Mudgett (rmudgett) 2017-11-15 16:40:05.934-0600

The second part of that patch is something unrelated that valgrind found.  It is not contributing to the memory corruption.  Many memcpy implementations are "safe" to use on overlapping blocks if you are shifting the block down in memory.  Memcpy is not guaranteed to copy memory correctly if the blocks overlap where memmove will work on overlapping blocks.

By: Friendly Automation (friendly-automation) 2017-11-16 07:59:49.942-0600

Change 7231 merged by Joshua Colp:
audiohook.c: Fix freeing a frame and still using it.

[https://gerrit.asterisk.org/7231|https://gerrit.asterisk.org/7231]

By: Friendly Automation (friendly-automation) 2017-11-16 08:17:19.606-0600

Change 7232 merged by Joshua Colp:
audiohook.c: Fix freeing a frame and still using it.

[https://gerrit.asterisk.org/7232|https://gerrit.asterisk.org/7232]

By: Friendly Automation (friendly-automation) 2017-11-16 08:27:57.054-0600

Change 7233 merged by Joshua Colp:
audiohook.c: Fix freeing a frame and still using it.

[https://gerrit.asterisk.org/7233|https://gerrit.asterisk.org/7233]