[Home]

Summary:ASTERISK-26978: rtp: Crash in ast_rtp_codecs_payload_code()
Reporter:Ross Beer (rossbeer)Labels:
Date Opened:2017-05-03 08:07:51Date Closed:2017-07-10 10:53:40
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/RTP
Versions:GIT Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-26969 Deadlock - chan_pjsip
is duplicated byASTERISK-27115 Deadlock? db_sync_thread at db.c
Environment:Fedora 23Attachments:( 0) backtrace_20170502_clean.txt
( 1) backtrace_20170511_clean.txt
( 2) backtrace_20170515_clean.txt
( 3) backtrace_20170518_02_Clean.txt
( 4) backtrace_20170524_clean.txt
( 5) backtrace_20170605_clean.txt
( 6) backtrace_20170607_8_CLEAN.txt
Description:A crash has been introduced in a recent GIT commit:

[5561] rtp_engine.c: Fix deadlock potential copying RTP payload maps.

Please see attached backtrace, if required I can provide an unredacted trace.
Comments:By: Asterisk Team (asteriskteam) 2017-05-03 08:07:52.168-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Richard Mudgett (rmudgett) 2017-05-03 14:00:00.272-0500

The crash is not even in the function changed by gerrit review 5561.  The review changed ast_rtp_codecs_payloads_copy() while the crash is in ast_rtp_codecs_payload_code().  The crash appears to be because of a stale bridged RTP instance pointer pointing to freed memory.  Using MALLOC_DEBUG could confirm this.  How it could happen is not readily apparent in the attached backtrace nor in the private fuller backtraces.

By: Ross Beer (rossbeer) 2017-05-03 15:38:04.056-0500

Thank you for looking at the backtraces, I am now recompiling with MALLOC_DEBUG to see if I can catch the issue.

By: Ross Beer (rossbeer) 2017-05-05 05:23:38.727-0500

The MALLOC_DEBUG has shown the following memory corruption:

WARNING: Memory corrupted after free of 0x7fb7e0186330 allocated at rtp_engine.c ast_rtp_instance_new() line 446
WARNING: Memory corrupted after free of 0x7f611400d5f0 allocated at rtp_engine.c ast_rtp_instance_new() line 446

This also causes a deadlock. I will send the unredacted backtrace to you directly.

By: Richard Mudgett (rmudgett) 2017-05-05 11:37:20.048-0500

The new backtrace isn't showing anything different than the one two days ago. MALLOC_DEBUG does give more of a corroborating hint that the rtp instance bridged with pointer has already been freed. It says we have corrupted an already freed rtp instance. I was more hoping for a 0xdeaddead crash as those tend to be very helpful.

The new backtrace and the one two days ago show that the called channel has left the native rtp bridge without clearing the caller's rtp instance bridged with pointer.  The caller channel then tries to pass a rtp frame to the dead callee channel's rtp instance and deadlocks on the already freed memory.

What I suspect is happening is when a channel leaves the native rtp bridge it cannot find the rtp instance of one of the channels and thus cannot clear the bridged with pointer for that channel.  I think what the native rtp bridge needs to do is save a pointer and ao2 ref of each rtp instance when the second channel gets in the bridge.  Then when one of the channels leaves the bridge we can ensure that both rtp instances will get their bridged with pointer cleared.

By: Ross Beer (rossbeer) 2017-05-09 05:21:58.501-0500

Does the patch https://gerrit.asterisk.org/#/c/5601/ attempt to resolve this issue?

By: Richard Mudgett (rmudgett) 2017-05-09 09:47:35.607-0500

No it does not.  If it were an attempt to resolve the issue I would have tagged it as such.

By: Ross Beer (rossbeer) 2017-05-11 10:35:47.267-0500

Further backtrace attached showing an issue here:

{noformat}
#0  0x000000000059b3b8 in ast_rtp_codecs_payload_code (codecs=0x7f2d8e8586e8, asterisk_format=1, format=0x226bf08, code=0) at rtp_engine.c:1019
1019 type = AST_VECTOR_GET(&codecs->payloads, i);
{noformat}

By: Ross Beer (rossbeer) 2017-05-15 10:01:46.581-0500

Please see attached a further backtrace. This is all related to the 'ast_rtp_codecs_payload_code'

By: Ross Beer (rossbeer) 2017-05-18 09:37:35.300-0500

Another crash in RTP, however, this time in res_rtp_asterisk.c:4776

#0  0x00007fbcc62c2619 in bridge_p2p_rtp_write (hdrlen=12, len=172, rtpheader=0x7fbd1c3440b0, instance1=0x7fbd342d46d0, instance=0x7fbd1c342b50) at res_rtp_asterisk.c:4776

By: Ross Beer (rossbeer) 2017-05-24 04:02:00.467-0500

Further, crashes, this is happening multiple times a day.

Would adding the following resolve the issue? Its an additional NULL check on the codec.

#0  0x000000000051f883 in ast_format_cmp (format1=0x2bf73b8, format2=0x151) at format.c:247

{noformat}
if (format1->codec == NULL || format2->codec == NULL) {
               return AST_FORMAT_CMP_NOT_EQUAL;
       }
{noformat}

By: Richard Mudgett (rmudgett) 2017-05-24 11:41:43.682-0500

[~rossbeer] I've outlined what I think is needed to fix the issue on May 5 above.  When the channels join the native rtp bridge the bridge technology needs to save a pointer with a ref to the rtp instance structure for both channels in the bridge.  Then when a channel leaves the bridge the rtp instance's bridged with pointer can be guaranteed to be cleared.  As it is now when a channel leaves the bridge there is no guarantee that the rtp instance's bridged with pointer gets cleared.  Somehow both channel's rtp instance pointers are not being found so one of the rtp instance's bridged with pointer is not being cleared.  As a result, the rtp code tries to natively bridge a frame to a destroyed rtp instance and deadlocks on a destroyed lock.

By: Ross Beer (rossbeer) 2017-06-05 05:46:38.072-0500

This issue is still happening on the latest GIT version.

I have put out a bounty with no takers. This issue causes Asterisk to crash multiple times a day.

By: Ross Beer (rossbeer) 2017-06-07 10:06:41.717-0500

Uploading another backtrace.

By: George Joseph (gjoseph) 2017-06-08 08:07:04.614-0500

Do you have the exact asterisk version string for yesterday's backtrace?

By: Ross Beer (rossbeer) 2017-06-08 08:39:49.353-0500

The version was Asterisk GIT-13-13.15.0-rc1-196-g3e8eea0M with the TLS patch from ASTERISK-27001 for pjsip