[Home]

Summary:ASTERISK-17352: [patch] Deadlock of SIP takes out server
Reporter:James Van Vleet (jvanvleet)Labels:
Date Opened:2011-02-04 10:31:32.000-0600Date Closed:2011-06-28 15:31:52
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
is the original version of this clone:ASTERISK-18082 Deadlock of SIP or segfault when doing REFERs
Environment:Attachments:( 0) asterisk-restart-2011-04-11-1214.txt
( 1) backtrace.txt
( 2) backtrace.txt
( 3) backtrace1.txt
( 4) backtrace2.txt
( 5) backtrace3.txt
( 6) backtrace4.txt
( 7) bug18749.patch
( 8) core-show-locks.txt
( 9) core-show-locks2.txt
(10) core-show-locks3.txt
(11) core-show-locks4.txt
Description:This happens with both asterisk 1.8.0 and 1.8.2.3.  The server will take calls for some amount of time (less than a day - within a few hours) can all of the sudden will stop responding on the SIP.  I can see with a netstat -na that many of the receive queues (including 5060) are full of waiting packets.

I have compiled asterisk with DEBUG_THREADS and DONT_OPTIMIZE and was able to do both a core show locks and a backtrace for two of the crashes.  

The majority of our calls are SRTP and also the majority use fastAGI if that helps.

-James
Comments:By: James Van Vleet (jvanvleet) 2011-02-04 16:23:01.000-0600

looking around at some of the different issues with SIP out there I took a chance and tried out:

https://issues.asterisk.org/view.php?id=18403

to see if that was the fix for my issue.  I was running Asterisk SVN-trunk-r301791 for backtrace4 and coreshowlocks4.  Hope that helps.

-James

By: Jeff Peeler (jpeeler) 2011-02-07 13:15:34.000-0600

Yeah this isn't really related to 18403, but try the attached patch.

By: James Van Vleet (jvanvleet) 2011-02-07 16:22:58.000-0600

Thanks for the patch!  

I am working with the rest of our staff on when we will give it a try under load but I would expect it to be in the next 12-48 hours.   Or plan it is to run a patched 1.8.2.3 unless you have any objections (see a need to run against an SVN rev).

-James

By: Jeff Peeler (jpeeler) 2011-02-07 16:28:50.000-0600

I think any 1.8 version will be fine.



By: James Van Vleet (jvanvleet) 2011-02-09 09:58:25.000-0600

No love.

Same sort of deadlock without the no optimise and debug thread flags set.

When I tried to enable the flags the system would crash with

console output:
asterisk[31514]: segfault at c ip b75e2f37 sp b346c5a4 error 4 in libc-2.11.1.so[b7574000+157000]

and asterisk log:

[Feb  9 15:39:08] ERROR[31514] lock.c: pbx.c line 7816 (ast_async_goto): Error obtaining mutex: Invalid argument
[Feb  9 15:39:08] ERROR[31514] lock.c: pbx.c line 7826 (ast_async_goto): mutex 'chan' freed more times than we've locked!
[Feb  9 15:39:08] ERROR[31514] lock.c: pbx.c line 7826 (ast_async_goto): Error releasing mutex: Invalid argument


-James

By: James Van Vleet (jvanvleet) 2011-02-09 10:04:31.000-0600

Looking around it appears asterisk didn't dump core either. I suspect from reading I need to do a limit -c unlimited before it will?  

Let me know if you need me to make that change and get a back trace on the core file.  I want to know first as I am dropping about 25-30 calls in production each time it happens so I want them to count.

-James

By: Jeff Peeler (jpeeler) 2011-02-09 11:01:17.000-0600

No need to do that, if the same deadlock occurred then another patch will have to be made first.

By: James Van Vleet (jvanvleet) 2011-02-17 12:38:59.000-0600

Hi... just checking back in.  If no one is available to work the issue I am wondering if perhaps some quick analysis could be done on more or less what causes the deadlock?  

I am guessing it is around blind transfers - perhaps if I had some more clear idea I could devise a load test that would allow me to do some work on the issue.  

Thanks!

By: John Covert (jcovert) 2011-04-11 11:27:34

I'm seeing this as well, and it seems to happen as a result of almost any kind of error somewhere in the system.  Machine was almost idle when it most recently happened.  Console output in this case shows some sort of Jabber error, but I'm not actively using jabber.  In other cases, it seems like the last console output is often a failure to register with some peer that has gone briefly unreachable.

This is a serious problem for me (and many others as well), and I would hope that someone could try to see what is going on.

See upload file asterisk-restart-2011-04-11-1214.txt

I'm running 1.8.3.2 plus the 302314 version of chan_sip.c (which includes the 18403 patch referenced earlier) plus the two SIP channel patches I've submitted for issues 18596 and 18640.

Although the blind transfer patch (18403) is present, I do note that I had done a blind transfer some time before the hang occurred.



By: James Van Vleet (jvanvleet) 2011-07-01 09:48:02.874-0500

Backtrace running SVN-branch-1.8-r325545

By: James Van Vleet (jvanvleet) 2011-07-01 09:51:03.322-0500

This issue does not appear to be fixed, but it has been altered.  Where there used to be a deadlock on lots of unattended transfers, now a small number of them appear to setfault asterisk.  I have attached an updated backtrace (it is unclear how to change an issue to not fixed).

By: James Van Vleet (jvanvleet) 2011-07-01 10:39:47.698-0500

Tried 1.5.2rc1 and get what look like the same deadlocks that have been going on all along:
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0xb5cd0b70 (do_monitor           started at [25035] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 24550 handle_request_do &netlock 0xb6b96800 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8120fb2]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xae) [0x811a3d2]
/usr/lib/asterisk/modules/chan_sip.so(+0x7ba8c) [0xb6b52a8c]
/usr/lib/asterisk/modules/chan_sip.so(+0x7b86e) [0xb6b5286e]
/usr/sbin/asterisk(ast_io_wait+0x144) [0x8114e34]
/usr/lib/asterisk/modules/chan_sip.so(+0x7d3cf) [0xb6b543cf]
/usr/sbin/asterisk() [0x819c9c1]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb72ea985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb74fe13e]
=== ---> Lock #1 (chan_sip.c): MUTEX 22837 handle_request_refer p 0xb5365178 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8120fb2]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xae) [0x811a3d2]
/usr/sbin/asterisk(__ao2_lock+0x48) [0x8086df4]
/usr/lib/asterisk/modules/chan_sip.so(+0x74170) [0xb6b4b170]
/usr/lib/asterisk/modules/chan_sip.so(+0x7b1ba) [0xb6b521ba]
/usr/lib/asterisk/modules/chan_sip.so(+0x7bc65) [0xb6b52c65]
/usr/lib/asterisk/modules/chan_sip.so(+0x7b86e) [0xb6b5286e]
/usr/sbin/asterisk(ast_io_wait+0x144) [0x8114e34]
/usr/lib/asterisk/modules/chan_sip.so(+0x7d3cf) [0xb6b543cf]
/usr/sbin/asterisk() [0x819c9c1]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb72ea985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb74fe13e]
=== ---> Waiting for Lock #2 (pbx.c): MUTEX 9615 pbx_builtin_setvar_helper chan 0xf16dc50 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8120fb2]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xae) [0x811a3d2]
/usr/sbin/asterisk(__ao2_lock+0x48) [0x8086df4]
/usr/sbin/asterisk(pbx_builtin_setvar_helper+0xea) [0x815565c]
/usr/lib/asterisk/modules/chan_sip.so(+0x741ce) [0xb6b4b1ce]
/usr/lib/asterisk/modules/chan_sip.so(+0x7b1ba) [0xb6b521ba]
/usr/lib/asterisk/modules/chan_sip.so(+0x7bc65) [0xb6b52c65]
/usr/lib/asterisk/modules/chan_sip.so(+0x7b86e) [0xb6b5286e]
/usr/sbin/asterisk(ast_io_wait+0x144) [0x8114e34]
/usr/lib/asterisk/modules/chan_sip.so(+0x7d3cf) [0xb6b543cf]
/usr/sbin/asterisk() [0x819c9c1]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb72ea985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb74fe13e]
=== --- ---> Locked Here: rtp_engine.c line 1263 (ast_rtp_instance_bridge)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xb52ffb70 (pbx_thread           started at [ 5079] pbx.c ast_pbx_start())
=== ---> Lock #0 (rtp_engine.c): MUTEX 1262 ast_rtp_instance_bridge c0 0xb5351bf0 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8120fb2]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xae) [0x811a3d2]
/usr/sbin/asterisk(__ao2_lock+0x48) [0x8086df4]
/usr/sbin/asterisk(ast_rtp_instance_bridge+0x131) [0x815cf51]
/usr/sbin/asterisk(ast_channel_bridge+0xd25) [0x80be511]
/usr/sbin/asterisk(ast_bridge_call+0xd5f) [0x80f6978]
/usr/lib/asterisk/modules/app_dial.so(+0x106ce) [0xb61f86ce]
/usr/lib/asterisk/modules/app_dial.so(+0x11103) [0xb61f9103]
/usr/sbin/asterisk(pbx_exec+0x1d0) [0x81398c5]
/usr/lib/asterisk/modules/res_agi.so(+0x963c) [0xb6f2b63c]
/usr/lib/asterisk/modules/res_agi.so(+0xc0c5) [0xb6f2e0c5]
/usr/lib/asterisk/modules/res_agi.so(+0xcc18) [0xb6f2ec18]
/usr/lib/asterisk/modules/res_agi.so(+0xe0ef) [0xb6f300ef]
/usr/lib/asterisk/modules/res_agi.so(+0xe24d) [0xb6f3024d]
/usr/sbin/asterisk(pbx_exec+0x1d0) [0x81398c5]
/usr/sbin/asterisk() [0x8142a28]
/usr/sbin/asterisk(ast_spawn_extension+0x53) [0x814432b]
/usr/sbin/asterisk() [0x8144d2d]
/usr/sbin/asterisk() [0x8146a02]
/usr/sbin/asterisk() [0x819c9c1]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb72ea985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb74fe13e]
=== ---> Lock #1 (rtp_engine.c): MUTEX 1263 ast_rtp_instance_bridge c1 0xf16dc50 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8120fb2]
/usr/sbin/asterisk(__ast_pthread_mutex_trylock+0xae) [0x811a73a]
/usr/sbin/asterisk(__ao2_trylock+0x48) [0x8086e88]
/usr/sbin/asterisk(ast_rtp_instance_bridge+0x1c9) [0x815cfe9]
/usr/sbin/asterisk(ast_channel_bridge+0xd25) [0x80be511]
/usr/sbin/asterisk(ast_bridge_call+0xd5f) [0x80f6978]
/usr/lib/asterisk/modules/app_dial.so(+0x106ce) [0xb61f86ce]
/usr/lib/asterisk/modules/app_dial.so(+0x11103) [0xb61f9103]
/usr/sbin/asterisk(pbx_exec+0x1d0) [0x81398c5]
/usr/lib/asterisk/modules/res_agi.so(+0x963c) [0xb6f2b63c]
/usr/lib/asterisk/modules/res_agi.so(+0xc0c5) [0xb6f2e0c5]
/usr/lib/asterisk/modules/res_agi.so(+0xcc18) [0xb6f2ec18]
/usr/lib/asterisk/modules/res_agi.so(+0xe0ef) [0xb6f300ef]
/usr/lib/asterisk/modules/res_agi.so(+0xe24d) [0xb6f3024d]
/usr/sbin/asterisk(pbx_exec+0x1d0) [0x81398c5]
/usr/sbin/asterisk() [0x8142a28]
/usr/sbin/asterisk(ast_spawn_extension+0x53) [0x814432b]
/usr/sbin/asterisk() [0x8144d2d]
/usr/sbin/asterisk() [0x8146a02]
/usr/sbin/asterisk() [0x819c9c1]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb72ea985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb74fe13e]
=== ---> Waiting for Lock #2 (chan_sip.c): MUTEX 28150 sip_get_rtp_peer p 0xb5365178 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8120fb2]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xae) [0x811a3d2]
/usr/sbin/asterisk(__ao2_lock+0x48) [0x8086df4]
/usr/lib/asterisk/modules/chan_sip.so(+0x92edf) [0xb6b69edf]
/usr/sbin/asterisk(ast_rtp_instance_bridge+0x33a) [0x815d15a]
/usr/sbin/asterisk(ast_channel_bridge+0xd25) [0x80be511]
/usr/sbin/asterisk(ast_bridge_call+0xd5f) [0x80f6978]
/usr/lib/asterisk/modules/app_dial.so(+0x106ce) [0xb61f86ce]
/usr/lib/asterisk/modules/app_dial.so(+0x11103) [0xb61f9103]
/usr/sbin/asterisk(pbx_exec+0x1d0) [0x81398c5]
/usr/lib/asterisk/modules/res_agi.so(+0x963c) [0xb6f2b63c]
/usr/lib/asterisk/modules/res_agi.so(+0xc0c5) [0xb6f2e0c5]
/usr/lib/asterisk/modules/res_agi.so(+0xcc18) [0xb6f2ec18]
/usr/lib/asterisk/modules/res_agi.so(+0xe0ef) [0xb6f300ef]
/usr/lib/asterisk/modules/res_agi.so(+0xe24d) [0xb6f3024d]
/usr/sbin/asterisk(pbx_exec+0x1d0) [0x81398c5]
/usr/sbin/asterisk() [0x8142a28]
/usr/sbin/asterisk(ast_spawn_extension+0x53) [0x814432b]
/usr/sbin/asterisk() [0x8144d2d]
/usr/sbin/asterisk() [0x8146a02]
/usr/sbin/asterisk() [0x819c9c1]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb72ea985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb74fe13e]
=== --- ---> Locked Here: chan_sip.c line 22837 (handle_request_refer)
=== -------------------------------------------------------------------
===
=======================================================================


Asterisk 1.8.5-rc1 built by root


By: James Van Vleet (jvanvleet) 2011-07-01 10:48:35.171-0500

Asterisk 1.8.5-rc1 backtrace for sip transfers segfaulting


By: James Van Vleet (jvanvleet) 2011-07-01 10:50:09.217-0500

Managed with more testing to segfault Asterisk 1.8.5-rc1 instead of deadlocking it.  backtrace attached.