ASTERISK-22079: Segfault: INTERNAL_OBJ (user_data=0x6374652f) at astobj2.c:120

[Home]

Summary: ASTERISK-22079: Segfault: INTERNAL_OBJ (user_data=0x6374652f) at astobj2.c:120

Reporter: Jamuel Starkey (jamuel) Labels:

Date Opened: 2013-07-15 13:47:01 Date Closed: 2014-03-25 11:20:46

Priority: Major Regression?

Status: Closed/Complete Components: Core/General

Versions: 1.8.23.0 Frequency of
Occurrence One Time

Related
Issues:
is related to ASTERISK-20551 Segfault when scheduled provisional keepalive is handled - dialog has already been destroyed

Environment: Linux CentOS 5.9 32-bit as a VM on XenServer 5.6 SP2 Attachments: ( 0) asterisk-22079-backtrace-leif.txt
( 1) asterisk-22079-gdb-output.txt
( 2) backtrace_asterisk.11.2.certified.txt
( 3) provisional_keepalive_fix.diff

Description: We discovered this segfault while running a debug build 1.8.23.0-rc1. Although not entirely clear it's likely that this issue was triggered during an asterisk "reload" operation that would have been sent over AMI.

Attaching the output of gdb (bt, bt full, thread apply all bt) as per the Debug guidelines found on the wiki.

Comments: By: Jamuel Starkey (jamuel) 2013-07-15 13:50:07.336-0500

gdb output for core. (bt, bt full, thread apply all bt).
By: Rusty Newton (rnewton) 2013-07-25 14:24:31.755-0500

May be the same issue as ASTERISK-20551 or similar.
By: Rusty Newton (rnewton) 2013-07-25 14:26:26.558-0500

Are you able to reproduce the issue?

Can you gather an Asterisk log with SIP debug (also VERBOSE and DEBUG messages set to level 5) and AMI debug leading up to the crash?

Also a ref count log would be very nice: https://wiki.asterisk.org/wiki/display/AST/Reference+Count+Debugging

By: Jamuel Starkey (jamuel) 2013-07-25 14:47:44.430-0500

We've only ever seen this crash once so getting the additional data that you've requested is impossible until we can figure out how to make it reproducible.
By: Rusty Newton (rnewton) 2013-07-26 14:38:31.407-0500

Ok. Suspending this until further debug can be provided, or until you figure out a way to reproduce the issue. Until then it doesn't look like there is anything we can do. Just comment on here when you have more data and we can re-open, or ping us on #asterisk-bugs (irc.freenode.net)
By: Steve Davies (one47) 2014-01-23 06:17:27.406-0600

Not especially useful, but I can add a "me too" - Sadly the production box in question running 1.8.24.0 does not have core dumps enabled, and it has happened exactly once.

asterisk[21356]: segfault at 1e9d ip 080829e6 sp b4509370 error 4 in asterisk[8048000+18e000]

# addr2line -e ./debian/tmp/usr/sbin/asterisk 080829e6
/usr/src/asterisk-1.8.24.0/main/astobj2.c:120

By: Matt Jordan (mjordan) 2014-01-23 06:46:14.885-0600

Steve: was yours triggered by a reload operation as well? Or do you think it might have been something different?
By: Steve Davies (one47) 2014-01-23 07:49:48.003-0600

There is no sign of a reload event in any of the logs, and that system has plentiful logs that should report a need to reload anything.

I'll go with "something different", though the asterisk logs are quiet for 3 minutes prior to the crash, so I have no clue what that something might be. Sorry I'm not being more helpful, except perhaps in excluding a reload as the cause?

I plan to go and read some code. I'll update here if I see anything.
By: Steve Davies (one47) 2014-01-23 12:17:27.130-0600

From a read-though of the code, if the following happens, you will double-unref. This is described in very generic terms, and might not be possible, but it is my working theory... Comments are most welcome.

Thread 1
1) provisional_keepalive_sched fires send_provisional_keepalive()
2) Locking starts to happen
3) pvt is locked for a moment... wait

Thread 2
1) I have the PVT lock that you are waiting for
2) Something happens and I cancel provisional_keepalive_sched
3) I unref the PVT for that sched
4) I unlock the PVT

Thread 1
1) I get my lock and send a provisional keepalive
2) I set sched_id to -1
3) I unref the PVT for that sched

*BANG* the PVT was already unref'ed by thread 2 above and I just went negative :( Of course I may not be negative quite yet, that might happen later...

One "fix" might be to check for an un-changed value in provisional_keepalive_sched_id after you've grabbed the locks, and bail if it is changed. I think you also get a problem if update_provisional_keepalive() is called after the sched event has started as that might cause a similar double-unref.

In fact, if this is real, how many other places might this happen in Asterisk??? Perhaps you can shoot this down in flames for me?
By: Leif Madsen (lmadsen) 2014-01-30 12:31:50.328-0600

Reopening this issue because it appears to be actively looked at, plus we just ran into it :)
By: Matt Jordan (mjordan) 2014-01-30 12:35:43.216-0600

I think our discussion on asterisk-dev is showing that we can't shoot it down :-( (Obligatory reference: [http://lists.digium.com/pipermail/asterisk-dev/2014-January/064885.html])

Even with your trace above, I think I'm still confused how the scheduler delete is occurring (that is, how it is un'refing the pvt). If the scheduler callback has fired, that should have removed the item from the scheduler, meaning the scheduler delete will never succeed. Some extra debug may help illuminate how it is getting into that spot.

Or else I'm just missing something :-)

By: Steve Davies (one47) 2014-01-30 12:44:29.593-0600

Matt,

Perhaps I am mistaken about how the scheduler fires. I thought it was:

{code}
task timer expires
res = run task()
if ( res <= 0 ) {
delete task
} else {
reschedule same task in res milliseconds
}
{code}

So the event cannot be deleted until AFTER the task has run because it needs the result of the run to know whether to delete it or resched it.

By: Rusty Newton (rnewton) 2014-01-30 13:29:42.331-0600

bq. Some extra debug may help illuminate how it is getting into that spot.

I'm not sure whether Matt meant debug in the code, or logging in general. However it may be helpful if those running into the issue could post logs captured before the crash. Maybe poking through those will give us an idea how to reproduce the issue.

Of course preferably with DEBUG type messages tweaked up to 5 or above if possible.

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
By: Steve Davies (one47) 2014-02-03 07:05:42.852-0600

Just FYI, my scheduler comments above can be ignored - my mistake. Matt is of course correct that the scheduled item is removed from the sched list while being run, and is then re-added or freed after the run based on the event's return value.

Two alternative thoughts to throw out there:

1) Is it possible that the scheduled send_provisional_keepalive_full() run is waiting on a lock that is being held by __sip_destroy? The sched event's own pvt ref will keep the pvt from being freed, but by the time it gets the lock, pvt->owner may be invalid, and the pvt itself may be a useless skeleton of its former self. This assumes that it is possible to destroy the chan but leave pvt->owner set. I see that __sip_destroy does clear owner->tech_pvt, but relies on something else to clean up pvt->owner, which may or may not be significant. All other chan_sip instances of cleaning owner->tech_pvt also clean pvt->owner.

2) If update_provisional_keepalive() is called while send_provisional_keepalive_full() is waiting on the PVT lock, then pvt->provisional_keepalive_sched_id will be changed to a new sched_id value by update_provisional_keepalive(), but that new sched_id then may be overwritten with -1 by send_provisional_keepalive_full(), killing the pvt's reference to a schedule and "leaking" the reference.

This is a rare crash, and I've not been able to cause it even by intentionally adding a Wait(61) to the dialplan to encourage the use of send_provisional_keepalive_full().

By: Leif Madsen (lmadsen) 2014-02-03 09:03:29.325-0600

Adding another backtrace that we generated on our own system that appears to be the same issue.
By: Matteo (mpiazzatnetbug) 2014-02-20 03:12:12.862-0600

It's happened also in a machine of mine. It's look the same issue. See the backtrace attached.

Asterisk versione 11.2.certified

By: Matteo (mpiazzatnetbug) 2014-02-20 03:17:12.032-0600

Few second before the segfault into messsage file there are these lines:

[2014-02-18 10:35:42] ERROR[1721] astobj2.c: bad magic number for 0x9f152a90. Object is likely destroyed.
[2014-02-18 10:35:42] WARNING[1721] chan_sip.c: Unable to get seqno from ''
[2014-02-18 10:35:42] ERROR[1721] astobj2.c: bad magic number for 0x9f152a90. Object is likely destroyed.
[2014-02-18 10:35:47] WARNING[1721] chan_sip.c: Retransmission timeout reached on transmission 3cf57a34c47c-14zp42q5mgep for seqno 2 (Critical Response) -- See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions
Packet timed out after 31999ms with no response
[2014-02-18 10:36:33] NOTICE[4380] features.c: pickup SIP/4002-00012afb attempt by SIP/4003-00012b03
[2014-02-18 10:36:42] ERROR[1721] astobj2.c: bad magic number for 0x9f152a90. Object is likely destroyed.
[2014-02-18 10:36:42] WARNING[1721] chan_sip.c: Unable to get seqno from ''
[2014-02-18 10:36:42] ERROR[1721] astobj2.c: bad magic number for 0x9f152a90. Object is likely destroyed.
By: Steve Davies (one47) 2014-02-28 12:12:25.329-0600

I can now reproduce this issue - Not easily, but it can be done. There are 2 bugs:

1) The #if 0 block in send_provisional_keepalive_full() should not be disabled for the reasons described in the comment in the function. It enables the easy leaking of a pvt ref, and does not fix the crash.

2) With the above fix applied, the crash becomes a ref-counter assertion instead of an attempt to ref an invalid chunk of memory. My comment above describes the reason for this correctly:

{quote}
2) If update_provisional_keepalive() is called while send_provisional_keepalive_full() is waiting on the PVT lock, then pvt->provisional_keepalive_sched_id will be changed to a new sched_id value by update_provisional_keepalive(), but that new sched_id then may be overwritten with -1 by send_provisional_keepalive_full(), killing the pvt's reference to a schedule and "leaking" the reference.
{quote}

Basically if the keepalive fires at the same time as the keepalive is being updated, you can lose the sched_id value, preventing proper cleanup at call-end. Here is a DEBUG_REFS trace of the bug happening with the #if 0 removed already

{code}
Problem: net Refcount not zero for object b4ef1b50
Object b4ef1b50 history:
0xb4ef1b50 =1 chan_sip.c:8156:sip_alloc (allocate a dialog(pvt) struct)
0xb4ef1b50 +1 chan_sip.c:8279:sip_alloc (link pvt into dialogs table) [@1]
0xb4ef1b50 +1 chan_sip.c:4158:__sip_reliable_xmit (__sip_reliable_xmit: setting pkt->owner) [@2]
0xb4ef1b50 +1 chan_sip.c:4321:sip_scheddestroy (setting ref as passing into ast_sched_add for __sip_autodestruct) [@3]
0xb4ef1b50 -1 chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@4]
0xb4ef1b50 +1 chan_sip.c:8634:find_call (pedantic ao2_find in dialogs) [@3]
0xb4ef1b50 -1 chan_sip.c:4399:__sip_ack (unref pkt cur->owner dialog from sip ack before freeing pkt) [@4]
0xb4ef1b50 -1 chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@3]
0xb4ef1b50 +1 chan_sip.c:8634:find_call (pedantic ao2_find in dialogs) [@2]
0xb4ef1b50 -1 chan_sip.c:4340:sip_cancel_destroy (remove ref for autokillid) [@3]
0xb4ef1b50 +1 chan_sip.c:7647:sip_new (sip_new: set chan->tech_pvt to i) [@2]
0xb4ef1b50 +1 chan_sip.c:26987:start_session_timer (adding session timer ref) [@3]
0xb4ef1b50 +1 chan_sip.c:4534:update_provisional_keepalive (Increment refcount to pass dialog pointer to sched callback) [@4]
0xb4ef1b50 -1 chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@5]
0xb4ef1b50 -1 chan_sip.c:4531:update_provisional_keepalive (when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr) [@4]
0xb4ef1b50 +1 chan_sip.c:4534:update_provisional_keepalive (Increment refcount to pass dialog pointer to sched callback) [@3]
0xb4ef1b50 -1 chan_sip.c:4531:update_provisional_keepalive (when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr) [@4]
0xb4ef1b50 +1 chan_sip.c:4534:update_provisional_keepalive (Increment refcount to pass dialog pointer to sched callback) [@3]
0xb4ef1b50 -1 chan_sip.c:6753:sip_hangup (unref ast->tech_pvt) [@4]
0xb4ef1b50 +1 chan_sip.c:4321:sip_scheddestroy (setting ref as passing into ast_sched_add for __sip_autodestruct) [@3]
0xb4ef1b50 -1 chan_sip.c:26948:stop_session_timer (removing session timer ref) [@4]
0xb4ef1b50 -1 chan_sip.c:6794:sip_hangup (when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr) [@3]
0xb4ef1b50 +1 chan_sip.c:4158:__sip_reliable_xmit (__sip_reliable_xmit: setting pkt->owner) [@2]
0xb4ef1b50 +1 chan_sip.c:8634:find_call (pedantic ao2_find in dialogs) [@3]
0xb4ef1b50 -1 chan_sip.c:4399:__sip_ack (unref pkt cur->owner dialog from sip ack before freeing pkt) [@4]
0xb4ef1b50 -1 chan_sip.c:26396:handle_request_do (throw away dialog ptr from find_call at end of routine) [@3]
0xb4ef1b50 +1 chan_sip.c:3283:dialog_unlink_all (Let's bump the count in the unlink so it doesn't accidentally become dead before we are done) [@2]
0xb4ef1b50 -1 chan_sip.c:3285:dialog_unlink_all (unlinking dialog via ao2_unlink) [@3]
0xb4ef1b50 -1 chan_sip.c:3349:dialog_unlink_all (Let's unbump the count in the unlink so the poor pvt can disappear if it is time) [@2]
0xb4ef1b50 -1 chan_sip.c:4275:__sip_autodestruct (The ref to a dialog passed to this sched callback is going out of scope; unref it.) [@1]
0xb4ef1b50 **call destructor** chan_sip.c:4275:__sip_autodestruct (The ref to a dialog passed to this sched callback is going out of scope; unref it.)
0xb4ef1b50 -1 chan_sip.c:4511:send_provisional_keepalive_full (dialog ref for provisional keepalive) [@0]
{code}

I think there is a fundamental issue here with the scheduler - Perhaps the scheduler should pass the sched_id into all of its callbacks, along with the data so that the called function can verify that the sched_id matches what it thinks it is working on, and act differently if necessary?

A quick workaround that will probably reduce the problem to *almost* zero would be for send_provisional_keepalive_full to grab a copy of pvt->provisional_keepalive_sched_id before getting any locks, and check that it is unchanged after the sip_pvt_lock_full() delay has occurred.

If it has changed, then do the ao2_ref(pvt,-1) but do not kill the new value of pvt->provisional_keepalive_sched_id, and don't bother sending the keepalive.

Thoughts? My weekend is starting, I will submit patches next week if no-one does so before me.

P.S. here is the dialplan to cause the issue, just run calls at it constantly until it breaks. It can take several tens of minutes, or it can happen really quickly. I run a call per second at it and it takes between 2 and 30 minutes (usually.)

{code}
exten => 900,1,NoOp(Crash Generator)
same => n,Ringing
same => n,Wait(60)
same => n,Progress
same => n,Hangup
{code}

By: Matt Jordan (mjordan) 2014-02-28 12:36:24.917-0600

Awesome. Thanks for the reproduction, analysis, and ref_debug log.

I agree with the workaround, although it's ugly :-) Really, the scheduler should be "smart enough" to handle those details for the caller. That would necessitate some full blown changes to the scheduler API, which - barring no other solution - we'd probably try to avoid in Asterisk 1.8 and 11.
By: Steve Davies (one47) 2014-03-03 06:26:32.386-0600

Attached is my attempt at a fix. It could be shortened with a "goto out", but I try to avoid doing that where possible.

This has been compile tested, but not put through our test rig yet.
By: Kinsey Moore (kmoore) 2014-03-07 16:03:29.678-0600

Hi [~one47],
Thanks for the patch! It looks like it will solve the issue. Could you go ahead and post this to reviewboard (now authenticates via crowd for those with signed license agreements)?
By: Leif Madsen (lmadsen) 2014-03-12 13:27:19.344-0500

I'm going to build this in our RPM now to do some testing against it since we've seen 2 crashes in the last couple of weeks in this area of the code.

CC: [~one47]
By: Steve Davies (one47) 2014-03-12 13:38:59.412-0500

Hi, I've not been around and see I've been asked to do something ref the reviewboard. How does that work? Is there a FAQ on the subject?
By: Leif Madsen (lmadsen) 2014-03-12 13:44:04.819-0500

[~one47] you can start here: https://wiki.asterisk.org/wiki/display/AST/Code+Review

Basically all you need to do is login to https://reviewboard.asterisk.org with your same credentials as you use on JIRA, then upload your patch to a new review. There are a few fields that need to be filled in. I think the documentation I provided you should help you define the values for those fields.
By: Steve Davies (one47) 2014-03-17 07:29:57.495-0500

As requested

https://reviewboard.asterisk.org/r/3368/

By: Kinsey Moore (kmoore) 2014-03-25 11:20:46.851-0500

This has been committed to 1.8, 11, 12, and trunk. Thanks everyone for working on this and figuring out what was going on!