[Home]

Summary:ASTERISK-23310: bridged channel crashes in bridge_p2p_rtp_write
Reporter:Jeremy Lainé (sharky)Labels:
Date Opened:2014-02-14 11:05:29.000-0600Date Closed:2014-03-04 09:41:22.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_rtp_asterisk
Versions:11.7.0 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-23419 [patch]Crash, segfault due to attempt to read an invalid rtp instance
Environment:Debian wheezyAttachments:( 0) ASTERISK-23310.diff
( 1) broken-peer
( 2) crash-backtrace.txt
( 3) extensions.conf
( 4) refcount-and-log.tar.gz
( 5) sip.conf
Description:I have encountered numerous crashes using Bridge, which always involve bridge_p2p_rtp_write. To sumarize the setup:

- two calls are fired up via Originate (one to an operator, one to a contact)
- the operator gets put into an extension which basically does Answer + Wait(1800)
- when the contact picks up, he gets bridged to the operator
- the contact hangs up => the operator returns to the extension and waits some more

.. and at some point things fall apart, and I start getting "bad magic number" warnings and eventually asterisk crashes, but always in bridge_p2p_rtp_write.

An example:

[Feb 14 15:47:15] DEBUG[25480][C-00000196] res_rtp_asterisk.c: Remote address is null, most likely RTP has been stopped
[Feb 14 15:47:15] DEBUG[25480][C-00000196] res_rtp_asterisk.c: Remote address is null, most likely RTP has been stopped
[Feb 14 15:47:15] DEBUG[25480][C-00000196] res_rtp_asterisk.c: Remote address is null, most likely RTP has been stopped
[Feb 14 15:47:15] DEBUG[24287] chan_sip.c: Auto destroying SIP dialog '13e862723108b2aa2dbf50f14b7327ed@MY.IP.ADDRESS.HERE:5060'
[Feb 14 15:47:15] DEBUG[24287] chan_sip.c: Destroying SIP dialog 13e862723108b2aa2dbf50f14b7327ed@MY.IP.ADDRESS.HERE:5060
[Feb 14 15:47:15] VERBOSE[24287] chan_sip.c: Really destroying SIP dialog '13e862723108b2aa2dbf50f14b7327ed@MY.IP.ADDRESS.HERE:5060' Method: BYE
[Feb 14 15:47:15] DEBUG[24287] rtp_engine.c: Destroyed RTP instance '0x7f5b980d17b8'
[Feb 14 15:47:15] ERROR[25480][C-00000196] astobj2.c: bad magic number for object 0x7f5b98146488. Object is likely destroyed.
[Feb 14 15:47:15] ERROR[25480][C-00000196] astobj2.c: bad magic number for object 0x7f5b98146488. Object is likely destroyed.
[Feb 14 15:47:15] DEBUG[25480][C-00000196] res_rtp_asterisk.c: Unsupported payload type received
Comments:By: Jeremy Lainé (sharky) 2014-02-15 10:48:47.309-0600

I have seen "bag magic number" errors in the following situation:

- leg B (contact) is issued a Bridge to leg A (operator)
- leg B is sent a reinvite
- leg A is sent a reinvite
- leg A accepts the reinvite with a 200 code
- leg B fails to reinvite with a 500 code
- leg B says BYE
- leg A goes back to the Wait
- leg B is destroyed
- bad magic number

By: Jeremy Lainé (sharky) 2014-02-23 08:15:43.564-0600

I can confirm that the described call sequence systematically leads to "bad magic number" warnings, and sometimes an actual crash. Attached is a simple python script called "broken-peer" which is a simple SIP UAS which has the following behaviour:

- accepts INVITEs
- echoes any received RTP
- fails re-INVITEs

It can be used to trigger the bad magic number situation.

By: Kinsey Moore (kmoore) 2014-02-26 10:28:18.118-0600

Hi Jeremy,
Could you enable reference count debugging as described at the following link, run the scenario again, and attach the resulting /tmp/refs to the issue?
https://wiki.asterisk.org/wiki/display/AST/Reference+Count+Debugging#ReferenceCountDebugging-EnablingReferenceCountDebuggingforallModules

The debug log (similar to what is provided in the example in the description) from the same run would also be very helpful.

By: Jeremy Lainé (sharky) 2014-03-01 06:22:53.652-0600

I will try rebuilding with reference counting, but there is definitely something fishy with the sequence of events when one leg of a Bridge fails / is terminated.

I have reduced my dialplan to the attached extensions.conf, you can try it yourself.

1. launch the broken-peer script I included

2. originate a call using to the "leg-a" context

channel originate SIP/alice@trunk-good extension s@leg-a

3. once it is connected originate a call to the "leg-b" context

channel originate SIP/bob@trunk-broken extension s@leg-b

4. the calls will be bridged, the bridge will fail, the first call returns to "leg-a"

5. wait 30 seconds for the "leg-b" dialog to be destroyed => bad magic number

By: Jeremy Lainé (sharky) 2014-03-01 08:21:04.641-0600

I have attached the full log (debug=3, verbose=3, sip debug=on) and the reference count log.

By: Kinsey Moore (kmoore) 2014-03-03 11:14:55.300-0600

Thanks for the logs, Jeremy. This should be enough to find out exactly what is happening.

By: Kinsey Moore (kmoore) 2014-03-03 14:34:27.523-0600

Hi Jeremy,
Could you test the attached patch?

It looks like what is happening is that when the bad side fails the remote bridging attempt, everything that was initialized for the attempt was not cleared out appropriately and the good channel ended up with a non-refcounted pointer to the bad channel's RTP instance which would normally be cleared out upon exit from the remote bridge attempt. This only caused things to blow up once the bad channel actually got destroyed since the bad channel was still holding a reference to the instance.

By: Jeremy Lainé (sharky) 2014-03-04 00:51:30.380-0600

Thanks Kinsey! Your patch seems to have fixed the issue, when I run my test case now I don't get any bad magic number errors anymore.

Is there any way I can help integrate an automated unit test for this scenario?

By: Kinsey Moore (kmoore) 2014-03-04 07:23:30.326-0600

I've actually been trying to work one up as my reproduction method, but haven't had any luck so far. With this verification that the patch works, I should be able to tweak the test I have written to reproduce the bug on unpatched code. Thanks for the offer, but I think I have a handle on it.

By: Jeremy Lainé (sharky) 2014-03-04 07:30:00.986-0600

I am on IRC (jeremy_laine) if you want a chat.

By: Kinsey Moore (kmoore) 2014-03-04 09:45:53.367-0600

The test for this is up at https://reviewboard.asterisk.org/r/3297/