ASTERISK-18663: BLF Subscriptions Causes SIP Deadlock

[Home]

Summary: ASTERISK-18663: BLF Subscriptions Causes SIP Deadlock

Reporter: Ross Beer (rossbeer) Labels:

Date Opened: 2011-10-04 07:46:08 Date Closed: 2011-12-06 11:13:16.000-0600

Priority: Blocker Regression?

Status: Closed/Complete Components: Channels/chan_sip/Subscriptions

Versions: 1.8.7.0 1.8.8.0 Frequency of
Occurrence Constant

Related
Issues:
must be completed before resolving ASTERISK-18499 Asterisk 1.8.8.0 Release Blockers

must be completed before resolving ASTERISK-18608 Asterisk 10.0.0-rc1 Blockers

must be completed before resolving ASTERISK-18847 Asterisk 10.0.0 Release Blockers

is related to ASTERISK-18760 Deadlock in SVN 1.8 version 342359

Environment: Cent OS 5 Attachments: ( 0) backtrace.txt
( 1) backtrace-12-45.txt
( 2) backtrace-2011-11-19_07-14.txt
( 3) backtrace-2011-11-19_12-20.txt
( 4) backtrace-2011-11-25_22-59.txt
( 5) backtrace-2011-11-30_21-45.txt
( 6) blf_deadlock.diff
( 7) blf_mwi_diff_12_05_11.txt
( 8) blf_mwi_diff_12_06_11.txt
( 9) core-show-locks.txt
(10) core-show-locks.txt
(11) ref_counting_patch.diff
(12) review_1563_alt_approach_1.8_V2.diff
(13) SIPTrace_Crash.txt

Description: When using Asterisk 1.8.7.0 and 1.6.2 a deadlock occurs when setting 'allowsubscribe=yes', 'callercounter = yes' and setting the subscrivecontext.

After around 30 minutes all SIP requests are blocked and further calls can not be made. Any calls in progress continue and CLI still responds to commands.

The asterisk server is using MySQL SIP real-time and ODBC voicemail, with 'pollmailboxes=yes' and 'pollfreq=30'.

At the point of deadlock, there are no error message displayed.

SIP peers have 'subscribemwi=yes'.

This issue can be re-produced every time 'allowsubscribe=yes' if set to no, the issue does not occur.

Comments: By: Leif Madsen (lmadsen) 2011-10-04 10:18:25.440-0500

If this is a deadlock, you will need to provide 'core show lock' output as described below.

Debugging deadlocks: Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install). This will then give you the console command "core show locks." When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt
# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt
gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!
By: Ross Beer (rossbeer) 2011-10-11 16:51:36.204-0500

core-show-locks.txt attached, back traces could not be generated. I'm not sure why 'bt' said no file.

Looks like the lock is on mwi
By: Gregory Hinton Nietsky (irroot) 2011-10-12 04:39:41.805-0500

the mwi events has been changed recently please try 1.8.8 and ill need a BT to fix this.
By: Ross Beer (rossbeer) 2011-10-21 13:58:50.586-0500

Attaching backtrace & core show locks for asterisk 1.8.8.0-rc2

Looks like the locks are in the same place, again when presence / blf subscriptions are set to yes the issue occurs.
By: Kinsey Moore (kmoore) 2011-11-02 16:15:37.692-0500

Could you give the attached patch a try? It should resolve the deadlock issue along with a little bit of cleanup.
By: Richard Mudgett (rmudgett) 2011-11-02 17:12:10.061-0500

The last change in [^blf_deadlock.diff] is checking mwipvt for NULL. That check is not safe to determine if p is the local dialog.
By: Kinsey Moore (kmoore) 2011-11-03 09:28:34.509-0500

Attached fixed diff. Thanks for the catch, Richard!
By: David Vossel (dvossel) 2011-11-03 11:30:40.244-0500

I've included a patch that removes the need to lock both the peer and the sip pvt at the same time. It has not been tested, I just wanted to include it on the issue so it could be discussed.
By: David Vossel (dvossel) 2011-11-03 18:30:00.375-0500

Okay, I have uploaded another patch regarding the approach I tried earlier. I tested this one and fixed all the issues with the old one. This one should deliver much better results.
By: Ross Beer (rossbeer) 2011-11-13 06:54:23.464-0600

I have been testing the fix and now asterisk 1.8.8-rc3 randomly halts. I'm attaching a core dump from the last crash.

Error message in the log file is:

[Nov 13 12:45:41] ERROR[25434] astobj2.c: bad magic number 0xa500746e for 0x2aaad03f6f20
By: Ross Beer (rossbeer) 2011-11-13 07:02:36.711-0600

Backtrace from Crash
By: Kinsey Moore (kmoore) 2011-11-14 09:16:27.255-0600

This issue should be fixed in latest 1.8 branch.
By: Matt Jordan (mjordan) 2011-11-14 16:00:32.409-0600

Ross -

We're going to roll 1.8.8-rc4, as r345063 in 1.8 should resolve the problem. This issue will stay open until you test with the 1.8 branch to confirm that you don't have the crash you were seeing with 1.8.8-rc3.
By: Leif Madsen (lmadsen) 2011-11-17 12:36:48.705-0600

I've just rolled and published Asterisk 1.8.8.0-rc4. Please test with that tag to verify we've resolved the issue for you.

Announcement at: http://www.asterisk.org/node/51692
By: Ross Beer (rossbeer) 2011-11-19 10:57:40.141-0600

After around 12 hours of testing asterisk has crashed twice. I attach both back traces.
By: Matt Jordan (mjordan) 2011-11-21 11:08:19.462-0600

Ross - would it be possible to get a debug Asterisk log (at least level 3) from around the time the crash occurs?
By: Ross Beer (rossbeer) 2011-11-21 14:25:43.561-0600

There isn't much that happens at the time of the crash, four minutes before this was logged:

[Nov 19 07:10:21] ERROR[27872] astobj2.c: bad magic number 0x0 for 0x2aaad04d7600

[Nov 19 07:13:46] NOTICE[27885] chan_sip.c: Received SIP subscribe for peer without mailbox:

[Nov 19 07:14:50] WARNING[27885] chan_sip.c: Received subscription for extension "11255*206" context "from-vh-internal" with Expire header less that 'minexpire' limit. Received "Expire: 30" min is 60

The last message was logged 3 times directly before the crash.

For the crash at 12:20 there is the following:

[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x6b736972 for 0x113d9ee0
[Nov 19 12:20:34] ERROR[1231] astobj2.c: bad magic number 0x37372050 for 0x2aaad43a4e50

Again just before this there was the message:

[Nov 19 12:20:08] WARNING[1245] chan_sip.c: Received subscription for extension "11255*206" context "from-vh-internal" with Expire header less that 'minexpire' limit. Received "Expire: 30" min is 60

Hope this helps. The 1.8.8-rc4 has been running with no crashes with 'allowsubscribe = no' so this does still appear to be an issue with presence.
By: Kinsey Moore (kmoore) 2011-11-21 16:43:40.740-0600

Hi Ross, is there any chance you could run asterisk under valgrind to see what the memory debugger can find? If that's not an option or doesn't crash, I'd like to get this same log with sip debugging turned on since the bad magic numbers appear to be printable characters (hopefully from sip messages).
By: Ross Beer (rossbeer) 2011-11-25 17:20:59.903-0600

After testing for around two hours the following was shown in the log:

[Nov 25 22:59:22] ERROR[9438] astobj2.c: bad magic number 0x0 for 0x13b519f0

The last SIP message sent was a NOTIFY for voicemail on a peer with 'subscribemwi=yes'.

Voicemail is stored in an ODBC database.

I attach a further backtrace, which also confirms that the issue is with MWI.

I hope this is enough to help resolve the issue.
By: Gregory Hinton Nietsky (irroot) 2011-11-27 03:00:12.605-0600

And you sure you are running 1.8.8.0-rc4 ?? the patch in rc4 should prevent this latest back trace.
By: Ross Beer (rossbeer) 2011-11-27 07:23:42.816-0600

Yes, "Connected to Asterisk 1.8.8.0-rc4".
By: Matt Jordan (mjordan) 2011-11-30 13:38:51.389-0600

As Kinsey and myself mentioned, a SIP trace (either from pcap, or from an Asterisk debug log with debug turned on) should help us determine why the BLF feature with MWI is having a problem. We have some ideas as to what is occurring, but having that packet capture - and ideally, an Asterisk debug log - would let us know whether or not we are on the correct track.
By: Ross Beer (rossbeer) 2011-11-30 14:57:45.819-0600

I will try and produce these for you, however it looks like the issue happens when a phone subscribes to voice mail and a message is waiting.

The subscribemwi = yes so notification messages are only send once the phone requests them. This is where I believe the issue lies.

I am doing further testing to see if I can reproduce this issue.
By: Ross Beer (rossbeer) 2011-11-30 16:00:20.206-0600

Attached is the SIP output at the point of crashing. It looks like the subscribe to voicemail is successful and the notify sent. However as the message is being sent Asterisk crashes.

This does not happen when presence is disabled.

By: Ross Beer (rossbeer) 2011-11-30 16:05:28.449-0600

Latest backtrace attached along with SIP Trace
By: Matt Jordan (mjordan) 2011-11-30 17:47:22.151-0600

Ross - I've attached a patch (ref_counting_patch.diff) to 1.8.8-rc4 (and specific to the release candidate) that does two things:
1) It is more aggressive in holding onto the peer during the handle_request_subscribe
2) It turns on ref count debugging. This will actually give us a trace of the reference counting of the SIP peers and dialogs in /tmp/refs - which will tell us why you get the bad magic number errors, etc. Note that the user your Asterisk instance runs under would have to have write permissions to the /tmp directory (I imagine that won't be too big of a problem)

If you could apply the patch to 1.8.8-rc4 and test, we'd appreciate it. If you still get the same failure, please attach up the /tmp/refs log file.

I've attempted to reproduce your situation using presence and MWI subscriptions, with Polycom phones that have messages waiting on startup. So far, no explosions - I'll let my test Asterisk run over night to see if this is more timing related. Thanks for all the feedback and testing!
By: Ross Beer (rossbeer) 2011-12-01 04:25:07.113-0600

I will patch Asterisk and come back to you. I managed to get it to crash using X-lite 4.
By: Ross Beer (rossbeer) 2011-12-02 16:18:36.026-0600

Hi Matt,

I've patched Asterisk and it looks like the issue may have been resolved. I'm going to leave it and see if it crashes any further.

One issue I think the patch has introduced is that X-lite when subscribing to voicemail now does not receive any updates for voicemail. The standard NOTIFY requests are still sent, so it's just when a phone subscribes to voicemail.

Once I have further feedback I will update you.

Thank you for your help with this issue.
By: Ross Beer (rossbeer) 2011-12-03 15:44:00.549-0600

Asterisk crashed after around 23 hours.

Refs log can be found here:

http://dl.dropbox.com/u/50827842/refs.zip

By: Matt Jordan (mjordan) 2011-12-05 08:41:11.415-0600

Ross - I noticed just before the crash you performed a module unload. Is there a reason why that was done?
By: Ross Beer (rossbeer) 2011-12-05 10:55:44.507-0600

Matt - No module reload was performed by myself. Asterisk safe does an automatic restart if the asterisk process dies. Could this be what you have seen in the log?
By: Matt Jordan (mjordan) 2011-12-05 12:05:56.068-0600

I don't think so. Its actually an explicit sequence of unload_module calls, which wouldn't happen if Asterisk crashed and then was restarted - we would actually see a sequence of load_module calls, as a crash won't allow chan_sip to unload cleanly. Since this is only the ref counts (which doesn't have timestamps), I'm not sure what caused it.

It does, however, give us an idea on exactly what causes your Asterisk crash. I've been able to reproduce it doing the following:
1. Allow a peer to register
2. Allow that peer to send at least two SUBSCRIBE messages
3. Reload chan_sip
4. Leave a voicemail for the peer

I think we'll have a patch put together addressing the root cause of this today.

(I also fixed the NOTIFY's not being sent out after a 200 OK for a SUBSCRIBE - thanks for catching that)
By: Ross Beer (rossbeer) 2011-12-05 13:03:04.432-0600

'asterisk -rx sip prune realtime <peer>' are carried out if I update some details for a peer, however the modules are not unloaded I believe.

A core restart when convenient is run when any changes are made to 'extensions.conf' or 'sip.conf'.

I do have a cron job that runs 'asterisk -rx logger rotate' over night however this shouldn't crash asterisk or reload the modules.

I'm a bit stumped as to why they have been reloaded. Could the restart when convenient have caused this? I probably used this when I patched the rc4!

Thank you, I will look forward to receiving the updated patch.

By: Matt Jordan (mjordan) 2011-12-05 14:32:07.215-0600

Ross - attached is a patch that we believe addresses the crash you're experiencing. Again, its a patch for the 1.8.8-rc4 release candidate - assuming this resolves your issue, we'll patch 1.8 on up.

Thanks for being patient and testing!
By: Ross Beer (rossbeer) 2011-12-05 15:38:55.868-0600

I've patched asterisk and will let you know if the issue is fixed.

I have noticed that the following messages are shown on the CLI now:

[Dec 5 21:35:34] NOTICE[5374]: chan_sip.c:24240 handle_request_subscribe: Received SIP subscribe for peer without mailbox:
[Dec 5 21:35:34] ERROR[5374]: astobj2.c:110 INTERNAL_OBJ: user_data is NULL

How do I resolve the mailbox issue, some peers do not need voicemail mailboxes?
By: Matt Jordan (mjordan) 2011-12-05 16:16:18.682-0600

That usually happens when we receive a SUBSCRIBE request for a peer that we don't know about. Sometimes that occurs when a SUBSCRIBE is sent before a REGISTER, or directly after a REGISTER but before we've authenticated the peer. The error message is a bit misleading here, as it is also bundled with the case where a SUBSCRIBE is sent but does not include a valid mailbox.

The former is not actually an error, as the peer should send a new SUBSCRIBE request after the REGISTER is finished processing. The latter should be logged, as its indicative of a configuration error. I'll fix that.

Usually I only have this happen when there are a large number of SUBSCRIBE messages being sent into Asterisk on initial start up, or when I have Asterisk being run under a debugger.

If you see these messages after initial start up, I'd be concerned.

By: Ross Beer (rossbeer) 2011-12-05 16:48:02.045-0600

Asterisk has been patched for around an hour and is still showing:

[Dec 5 22:37:50] ERROR[5374]: astobj2.c:110 INTERNAL_OBJ: user_data is NULL
[Dec 5 22:37:51] ERROR[5374]: astobj2.c:110 INTERNAL_OBJ: user_data is NULL
[Dec 5 22:37:53] ERROR[5374]: astobj2.c:110 INTERNAL_OBJ: user_data is NULL

The CLI only started showing these after applying the new patch. Asterisk uses realtime so could this be that a device hasn't registered and therefore the realtime cache does not have the user data?

Regarding the 'Received SIP subscribe for peer without mailbox', for peers that do no have mailboxes would setting the 'hasvoicemail=no' resolve this?

Thank you for your help
By: Matt Jordan (mjordan) 2011-12-06 08:26:33.127-0600

The ERROR message is a problem (and, in this case, misleading) that I'll resolve - the reason why you're seeing it now is due to one of two conditions:
1) You've received a SUBSCRIBE for a peer that you don't know about, either because it hasn't registered or because it isn't in sip.conf (or in your realtime cache)
2) You've received a SUBSCRIBE for a peer, but the mailbox specified doesn't exist

The hasvoicemail option in users.conf shouldn't affect this.

Either way, the ERROR message here is incorrect.
By: Matt Jordan (mjordan) 2011-12-06 09:00:10.135-0600

I've attached blf_mwi_diff_12_06_11 which fixes the misleading ERROR messages. This shouldn't affect Asterisk one way or the other with respect to the crash you were witnessing - so if you don't want to test this patch, that's okay.

If you haven't experienced any further crashes with Asterisk, we'll move ahead with including this into the 1.8 branch.
By: Ross Beer (rossbeer) 2011-12-06 09:22:56.794-0600

Matt - Asterisk has been running for the last 16 hours with no crashes. Please go ahead with the 1.8 branch and I will also test the new rc-5.

Thank you for resolving the issues, it is much appreciated.
By: Luke H (luckman212) 2011-12-06 11:27:27.709-0600

Congratulations to all for tracking down this very tricky scenario. Just curious if this patch will make it into 1.8.8.0?
By: Matt Jordan (mjordan) 2011-12-06 11:35:12.430-0600

This was a blocker for 1.8.8.0, and will be in the next release candidate.