Summary: | ASTERISK-24456: SIP deadlock in transfer scenario between Asterisk Servers | ||
Reporter: | Peter Katzmann (pk16208) | Labels: | |
Date Opened: | 2014-10-27 04:04:47 | Date Closed: | 2015-02-03 18:06:46.000-0600 |
Priority: | Critical | Regression? | |
Status: | Closed/Complete | Components: | Channels/chan_local Channels/chan_sip/General |
Versions: | 11.13.0 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | Ubuntu precise | Attachments: | ( 0) asterisk-sip-loc ( 1) backtrace.txt ( 2) backtrace-threads.txt ( 3) deadlock.txt ( 4) serverchef.txt ( 5) serversek.txt ( 6) threads.txt |
Description: | We have 3 Asterisk server (a,b and c)
we have also 3 different user (Caller, Chef, Sek) . User Caller on Server a User Chef on Server b User Sek on Server c Now Caller Dials to Chef Chef has a call rule via agi to only accept direct calls from Sek, all other calls are transferred to Sek. So the call from Caller on server a is tranfered to Sek on server c automatically Sek accepts the calls Now Sek will do a unattended transfer of caller to Chef Now the server b with Chef shows no action on the dialplan anymore and blocks Sometimes also crashes. | ||
Comments: | By: Peter Katzmann (pk16208) 2014-10-27 04:06:01.609-0500 gdb backtrace after system blocks By: Peter Katzmann (pk16208) 2014-10-27 04:16:57.207-0500 Dialplan execution By: Rusty Newton (rnewton) 2014-11-05 11:41:01.269-0600 Peter can you also provide the output of "core show locks" during the deadlock as mentioned in the instructions for [Getting Information for a Deadlock|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock] ? You said it also crashes sometimes? Can you provide a [backtrace|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace] from the next crash? For the logs, please provide new logs with the DEBUG logger channels also enabled. You can follow the instructions for [Collecting Debug Information|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information] By: Peter Katzmann (pk16208) 2014-11-06 01:59:25.575-0600 core show threads By: Peter Katzmann (pk16208) 2014-11-06 05:31:38.879-0600 Backtrace of cordump with asterisk 11.14rc1 By: Peter Katzmann (pk16208) 2014-11-06 05:33:22.027-0600 When i enabled all debug option for core show locks it was a littel bit harder to force the coredump. I also noticed a new Message during call flow: [Nov 6 11:19:07] WARNING[363][C-0000000c]: channel.c:1309 __ast_queue_frame: Exceptionally long voice queue length queuing to Local/1000@from-internal-4803977-acl-00000007;2 Also the load of the asterisk server skyrocket By: Peter Katzmann (pk16208) 2014-11-07 03:58:48.559-0600 Attached is a new coredump trace. Also it seems that it is not a sip deadlock but has something to do with local channels, because i get this error message about the exceptional long packages. During this time the server is very unresponsive By: Rusty Newton (rnewton) 2014-11-07 16:48:56.284-0600 {quote} When i enabled all debug option for core show locks it was a littel bit harder to force the coredump. {quote} I didn't understand whether you were able to get the output of "core show locks" or not. Did I overlook it? Thanks! By: Peter Katzmann (pk16208) 2014-11-09 13:40:32.214-0600 Well, core show locks showed nothing. But as said already, after building and using the debugging version, i got these new error messages about the large unexpected frames. During this Time the system was nearly unusable due to high load, the other connected asterisk servers reported server was gone. And mostly i get a coredump. By: Private Name (falves11) 2014-11-09 14:21:50.232-0600 Dear Rusty. I am the reporter of ASTERISK-24015. I am available to try a patch. My business greatly affected by this issue. By: Matt Jordan (mjordan) 2014-11-10 09:37:28.975-0600 [~falves11]: this issue has nothing to do with ASTERISK-24015. [~pk16208]: When this issue first occurred, did you have {{DEBUG_THREADS}} enabled? If you disable {{DEBUG_THREADS}} completely, does the issue go away? By: Peter Katzmann (pk16208) 2014-11-10 13:44:12.390-0600 Hmm, what do you mean exactly. The first backtrace is with a version without DEBUG_THREAD build. This shows same behavior but not the message about the frames in local channel. The second coredump is a DEBUG_THREAD build. This one shows the message and comes back to normal state sometimes, but mostly crashes. By: Rusty Newton (rnewton) 2014-11-13 10:39:05.354-0600 I'm unable to reproduce the issue. Peter can you provide simple SIP configuration and dialplan that would allow us to reproduce the issue? If we can reproduce , that would make this much easier to track down. Also, for your own reproduction, can you provide logs with the DEBUG channel enabled and turned up as I mentioned before? If possible, for both the crashing scenario and the locking scenario. By: Peter Katzmann (pk16208) 2014-11-14 08:12:14.761-0600 Well i'm trying to narrow down the problem. The server skyrocks load when a calls b but was forwarded to c. But only while a and c are connected not during ringing. Audio is delayed also. When i now try to transfer the call back from c to b, the system even loose sip connection or crashes. Is there any possibility to find the root cause of this high load inside asterisk ? Channel Location State Application(Data) SIP/asterisk-asteris (None) Up AppDial((Outgoing Line)) Local/1000@from-inte (None) Up AppDial((Outgoing Line)) Local/1000@from-inte s@to-node:9 Up Dial(SIP/asterisk asteriskdev1 SIP/2202-00000004 s@macro-executeagi:8 Up Dial(Local/1000@from-internal- By: Rusty Newton (rnewton) 2014-11-18 18:07:11.368-0600 {quote} Is there any possibility to find the root cause of this high load inside asterisk ? {quote} The traces provided and the debug requested will help a developer determine what is going on. To clarify - you have described that your server Sek, the final receiver of the transfer ends up in a blocking/locked state and sometimes crashes. To investigate both the crash and the potential deadlock, we need: * Backtraces from the core dump generated when crashing. * Backtrace when Asterisk is seemingly in a deadlock * Asterisk logs, including the DEBUG channel and SIP trace for both the crashing and locking scenarios. * Channel driver and dialplan configuration that would allow us to reproduce the issue. Once you have provided all of the information, please make sure to clarify which attached file contains what information. By: Peter Katzmann (pk16208) 2014-12-08 04:00:36.633-0600 Destination asterisk with user 2203 deadlock after transfer from user 1000 Backtrace and sip and debug logs By: Peter Katzmann (pk16208) 2014-12-08 04:03:47.035-0600 SIP Log and debug enabled. Backtrace of deadlocked asterisk. Last lines bevore death: <-------------> --- (16 headers 24 lines) --- == Setting global variable '130_10_14_249_is' to 'alive' == Setting global variable '130_10_15_15_is' to 'alive' lnx06-asteriskdev2*CLI> lnx06-asteriskdev2*CLI> lnx06-asteriskdev2*CLI> == Setting global variable '130_10_14_249_is' to 'alive' == Setting global variable '130_10_15_15_is' to 'alive' lnx06-asteriskdev2*CLI> core show channels Channel Location State Application(Data) SIP/asterisk-asteris 2203@macro-dial-loca Up Dial(SIP/2203&SIP/dect2203,,It SIP/asterisk-asteris (None) Up AppDial((Outgoing Line)) lnx06-asteriskdev2*CLI> core show channels lnx06-asteriskdev2*CLI> core show thre It's not possible to provide a simple dialplan szenario due to several agi's involved By: Matt Jordan (mjordan) 2014-12-24 09:03:49.115-0600 A few comments, and then a suggestion: # Your first backtrace ({{backtrace.txt}}) did have {{DEBUG_THREADS}} enabled. Looking at the core dump: {noformat} #0 0xb715bcb7 in pthread_mutex_lock () from /lib/i386-linux-gnu/libpthread.so.0 #0 0xb715bcb7 in pthread_mutex_lock () from /lib/i386-linux-gnu/libpthread.so.0 No symbol table info available. #1 0xb766d504 in pthread_mutex_lock () from /lib/i386-linux-gnu/libc.so.6 No symbol table info available. #2 0x0813c6c7 in ast_reentrancy_lock (lt=0x6b736972) at /usr/local/src/asterisk-11.14/asterisk-11.14.0-rc1/include/asterisk/lock.h:420 res = 135518668 {noformat} The {{ast_reentrancy_lock}} call does not get called unless you have {{DEBUG_THREADS}} enabled: {code} nt __ast_pthread_mutex_lock(const char *filename, int lineno, const char *func, const char* mutex_name, ast_mutex_t *t) { int res; #ifdef DEBUG_THREADS struct ast_lock_track *lt = NULL; int canlog = t->tracking && strcmp(filename, "logger.c"); #ifdef HAVE_BKTR struct ast_bt *bt = NULL; #endif if (t->tracking) { lt = ast_get_reentrancy(&t->track); } if (lt) { #ifdef HAVE_BKTR struct ast_bt tmp; /* The implementation of backtrace() may have its own locks. * Capture the backtrace outside of the reentrancy lock to * avoid deadlocks. See ASTERISK-22455. */ ast_bt_get_addresses(&tmp); ast_reentrancy_lock(lt); if (lt->reentrancy < AST_MAX_REENTRANCY) { lt->backtrace[lt->reentrancy] = tmp; bt = <->backtrace[lt->reentrancy]; } ast_reentrancy_unlock(lt); ast_store_lock_info(AST_MUTEX, filename, lineno, func, mutex_name, t, bt); #else ast_store_lock_info(AST_MUTEX, filename, lineno, func, mutex_name, t); #endif } #endif /* DEBUG_THREADS */ {code} There are severe problems with {{DEBUG_THREADS}} in that version of Asterisk 11 - so as far as I can tell, this issue only arises when {{DEBUG_THREADS}} is enabled. That's not surprising, since {{DEBUG_THREADS}} causes what will appear to be a deadlock when it is enabled. # Your second backtrace - {{backtrace-threads.txt}} - has symbols stripped out of it, which makes it hard to tell what is going on. Because of that, it is hard to tell why the locks are still being held. You will need to make sure that all of the symbols are in all of the modules that Asterisk is using. # The log file simply shows that a call was being processed. The lack of additional {{chan_sip}} messages after some period of time is concerning, but without further explanation as to what you feel should have been happening when, I'd be guessing at what the issue is. Again, however, if you are running with {{DEBUG_THREADS}}, you're sitting on a tick time bomb. It wouldn't shock me if {{chan_sip}} suddenly got locked up with that enabled. # If this is _not_ actually being caused by {{DEBUG_THREADS}}, _then_, if we cannot reproduce your issue, or you can't provide all of the configuration information to reproduce the issue, the issue is going to be closed. That means narrowing this down to a simple dialplan that reproduces the issue, and/or providing all of the information so that a bug marshal can reproduce the issue. It's pointless to have an issue that a developer can't fix. So, some suggestions: # Disable {{DEBUG_THREADS}}: ## It causes problems in your version of Asterisk. Don't run with it. ## Even if it didn't cause spurious problems that mimic a deadlock, it is a performance killer. Running a production server with {{DEBUG_THREADS}} enabled will severely cripple Asterisk's ability to handle calls. # If you want to run with {{DEBUG_THREADS}}, then, since we have now fixed the issue that {{DEBUG_THREADS}} is causing, please test out the latest from the 11 branch: {noformat} svn co http://svn.asterisk.org/svn/asterisk/branches/11 {noformat} If you continue to see deadlock like symptoms with the latest from the 11 branch, the please do the following: ## Make sure you have {{DONT_OPTIMIZE}} and {{BETTER_BACKTRACES}} enable in menuselect ## Get the output of {{core show locks}} and attach it to this issue ## Get a {{gdb}} backtrace similar to your {{backtrace-threads}} and attach it to this issue. Make sure all the symbols are there. Thanks! By: Peter Katzmann (pk16208) 2014-12-30 08:22:40.182-0600 I'm currently out of office due to season holidays but will report back as soon as possible By: Rusty Newton (rnewton) 2015-01-15 09:07:29.985-0600 Please let us know what you find. By: Peter Katzmann (pk16208) 2015-01-16 04:25:33.354-0600 Sorry can't get it compiled due to the bug related to ASTERISK-24358. This is also with gcc 4.6 By: Matt Jordan (mjordan) 2015-01-16 07:19:25.915-0600 Really? Are you using {{chan_vpb}}? If not, why not just disable that module in menuselect? By: Peter Katzmann (pk16208) 2015-01-30 06:26:56.411-0600 Hello, it seems that the problem does not occur anymore after some dialplan changes. So i have no chance to reproduce the problem anymre. You can close the case By: Rusty Newton (rnewton) 2015-02-03 18:06:46.956-0600 Ok. I'm thinking we probably wouldn't have been able to track this down regardless. Thanks for letting us know. |