[Home]

Summary:ASTERISK-24456: SIP deadlock in transfer scenario between Asterisk Servers
Reporter:Peter Katzmann (pk16208)Labels:
Date Opened:2014-10-27 04:04:47Date Closed:2015-02-03 18:06:46.000-0600
Priority:CriticalRegression?
Status:Closed/CompleteComponents:Channels/chan_local Channels/chan_sip/General
Versions:11.13.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Ubuntu preciseAttachments:( 0) asterisk-sip-loc
( 1) backtrace.txt
( 2) backtrace-threads.txt
( 3) deadlock.txt
( 4) serverchef.txt
( 5) serversek.txt
( 6) threads.txt
Description:We have 3 Asterisk server (a,b and c)
we have also 3 different user (Caller, Chef, Sek) .
User Caller on Server a
User Chef on Server b
User Sek on Server c

Now Caller Dials to Chef
Chef has a call rule via agi to only accept direct calls from Sek, all other calls are transferred to Sek.

So the call from Caller on server a is tranfered to Sek on server c automatically
Sek accepts the calls

Now Sek will do a unattended transfer of caller to Chef

Now the server b with Chef shows no action on the dialplan anymore and blocks

Sometimes also crashes.
Comments:By: Peter Katzmann (pk16208) 2014-10-27 04:06:01.609-0500

gdb backtrace after system blocks

By: Peter Katzmann (pk16208) 2014-10-27 04:16:57.207-0500

Dialplan execution

By: Rusty Newton (rnewton) 2014-11-05 11:41:01.269-0600

Peter can you also provide the output of "core show locks" during the deadlock as mentioned in the instructions for [Getting Information for a Deadlock|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock] ?

You said it also crashes sometimes? Can you provide a [backtrace|https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace] from the next crash?

For the logs, please provide new logs with the DEBUG logger channels also enabled. You can follow the instructions for [Collecting Debug Information|https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information]

By: Peter Katzmann (pk16208) 2014-11-06 01:59:25.575-0600

core show threads

By: Peter Katzmann (pk16208) 2014-11-06 05:31:38.879-0600

Backtrace of cordump with asterisk 11.14rc1

By: Peter Katzmann (pk16208) 2014-11-06 05:33:22.027-0600

When i enabled all debug option for core show locks it was a littel bit harder to force the coredump.

I also noticed a new Message during call flow:

[Nov  6 11:19:07] WARNING[363][C-0000000c]: channel.c:1309 __ast_queue_frame: Exceptionally long voice queue length queuing to Local/1000@from-internal-4803977-acl-00000007;2


Also the load of the asterisk server skyrocket

By: Peter Katzmann (pk16208) 2014-11-07 03:58:48.559-0600

Attached is  a new coredump trace.

Also it seems that it is not a sip deadlock but has something to do with local channels, because i get this error message about the exceptional long packages.
During this time the server is very unresponsive


By: Rusty Newton (rnewton) 2014-11-07 16:48:56.284-0600

{quote}
When i enabled all debug option for core show locks it was a littel bit harder to force the coredump.
{quote}

I didn't understand whether you were able to get the output of "core show locks" or not. Did I overlook it?

Thanks!

By: Peter Katzmann (pk16208) 2014-11-09 13:40:32.214-0600

Well,
core show locks showed nothing.
But as said already, after building and using the debugging version, i got these new error messages about the large unexpected frames.
During this Time the system was nearly unusable due to high load, the other connected asterisk servers reported server was gone.

And mostly i get a coredump.

By: Private Name (falves11) 2014-11-09 14:21:50.232-0600

Dear Rusty. I am the reporter of ASTERISK-24015. I am available to try a patch. My business greatly affected by this issue.

By: Matt Jordan (mjordan) 2014-11-10 09:37:28.975-0600

[~falves11]: this issue has nothing to do with ASTERISK-24015.

[~pk16208]: When this issue first occurred, did you have {{DEBUG_THREADS}} enabled? If you disable {{DEBUG_THREADS}} completely, does the issue go away?

By: Peter Katzmann (pk16208) 2014-11-10 13:44:12.390-0600

Hmm,
what do you mean exactly.
The first backtrace is with a version without DEBUG_THREAD build.
This shows same behavior but not the message about the frames in local channel.

The second coredump is a DEBUG_THREAD build. This one shows the message and comes back to normal state sometimes, but mostly crashes.

By: Rusty Newton (rnewton) 2014-11-13 10:39:05.354-0600

I'm unable to reproduce the issue.

Peter can you provide simple SIP configuration and dialplan that would allow us to reproduce the issue?

If we can reproduce , that would make this much easier to track down.

Also, for your own reproduction, can you provide logs with the DEBUG channel enabled and turned up as I mentioned before? If possible, for both the crashing scenario and the locking scenario.

By: Peter Katzmann (pk16208) 2014-11-14 08:12:14.761-0600

Well i'm trying to narrow down the problem.

The server skyrocks load when a calls b but was forwarded to c.
But only while a and c are connected not during ringing. Audio is delayed also.

When i now try to transfer the call back from c to b, the system even loose sip connection or crashes.

Is there any possibility to find the root cause of this high load inside asterisk ?

Channel              Location             State   Application(Data)            
SIP/asterisk-asteris (None)               Up      AppDial((Outgoing Line))      
Local/1000@from-inte (None)               Up      AppDial((Outgoing Line))      
Local/1000@from-inte s@to-node:9          Up      Dial(SIP/asterisk asteriskdev1
SIP/2202-00000004    s@macro-executeagi:8 Up      Dial(Local/1000@from-internal-



By: Rusty Newton (rnewton) 2014-11-18 18:07:11.368-0600

{quote}
Is there any possibility to find the root cause of this high load inside asterisk ?
{quote}
The traces provided and the debug requested will help a developer determine what is going on.

To clarify - you have described that your server Sek, the final receiver of the transfer ends up in a blocking/locked state and sometimes crashes.

To investigate both the crash and the potential deadlock, we need:

* Backtraces from the core dump generated when crashing.
* Backtrace when Asterisk is seemingly in a deadlock
* Asterisk logs, including the DEBUG channel and SIP trace for both the crashing and locking scenarios.
* Channel driver and dialplan configuration that would allow us to reproduce the issue.

Once you have provided all of the information, please make sure to clarify which attached file contains what information.

By: Peter Katzmann (pk16208) 2014-12-08 04:00:36.633-0600

Destination asterisk with user 2203 deadlock after transfer from user 1000

Backtrace and sip and debug logs

By: Peter Katzmann (pk16208) 2014-12-08 04:03:47.035-0600

SIP Log and debug enabled.
Backtrace of deadlocked asterisk.

Last lines bevore death:
<------------->
--- (16 headers 24 lines) ---
 == Setting global variable '130_10_14_249_is' to 'alive'
 == Setting global variable '130_10_15_15_is' to 'alive'
lnx06-asteriskdev2*CLI>
lnx06-asteriskdev2*CLI>
lnx06-asteriskdev2*CLI>
 == Setting global variable '130_10_14_249_is' to 'alive'
 == Setting global variable '130_10_15_15_is' to 'alive'
lnx06-asteriskdev2*CLI> core show channels
Channel              Location             State   Application(Data)            
SIP/asterisk-asteris 2203@macro-dial-loca Up      Dial(SIP/2203&SIP/dect2203,,It
SIP/asterisk-asteris (None)               Up      AppDial((Outgoing Line))      
lnx06-asteriskdev2*CLI> core show channels
lnx06-asteriskdev2*CLI> core show thre


It's not possible to provide a simple dialplan szenario due to several agi's involved

By: Matt Jordan (mjordan) 2014-12-24 09:03:49.115-0600

A few comments, and then a suggestion:

# Your first backtrace ({{backtrace.txt}}) did have {{DEBUG_THREADS}} enabled. Looking at the core dump:
{noformat}
#0  0xb715bcb7 in pthread_mutex_lock () from /lib/i386-linux-gnu/libpthread.so.0
#0  0xb715bcb7 in pthread_mutex_lock () from /lib/i386-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0xb766d504 in pthread_mutex_lock () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#2  0x0813c6c7 in ast_reentrancy_lock (lt=0x6b736972) at /usr/local/src/asterisk-11.14/asterisk-11.14.0-rc1/include/asterisk/lock.h:420
       res = 135518668
{noformat}
The {{ast_reentrancy_lock}} call does not get called unless you have {{DEBUG_THREADS}} enabled:
{code}
nt __ast_pthread_mutex_lock(const char *filename, int lineno, const char *func,
const char* mutex_name, ast_mutex_t *t)
{
int res;

#ifdef DEBUG_THREADS
struct ast_lock_track *lt = NULL;
int canlog = t->tracking && strcmp(filename, "logger.c");
#ifdef HAVE_BKTR
struct ast_bt *bt = NULL;
#endif

if (t->tracking) {
lt = ast_get_reentrancy(&t->track);
}

if (lt) {
#ifdef HAVE_BKTR
struct ast_bt tmp;

/* The implementation of backtrace() may have its own locks.
* Capture the backtrace outside of the reentrancy lock to
* avoid deadlocks. See ASTERISK-22455. */
ast_bt_get_addresses(&tmp);

ast_reentrancy_lock(lt);
if (lt->reentrancy < AST_MAX_REENTRANCY) {
lt->backtrace[lt->reentrancy] = tmp;
bt = &lt->backtrace[lt->reentrancy];
}
ast_reentrancy_unlock(lt);

ast_store_lock_info(AST_MUTEX, filename, lineno, func, mutex_name, t, bt);
#else
ast_store_lock_info(AST_MUTEX, filename, lineno, func, mutex_name, t);
#endif
}
#endif /* DEBUG_THREADS */
{code}
There are severe problems with {{DEBUG_THREADS}} in that version of Asterisk 11 - so as far as I can tell, this issue only arises when {{DEBUG_THREADS}} is enabled. That's not surprising, since {{DEBUG_THREADS}} causes what will appear to be a deadlock when it is enabled.
# Your second backtrace - {{backtrace-threads.txt}} - has symbols stripped out of it, which makes it hard to tell what is going on. Because of that, it is hard to tell why the locks are still being held. You will need to make sure that all of the symbols are in all of the modules that Asterisk is using.
# The log file simply shows that a call was being processed. The lack of additional {{chan_sip}} messages after some period of time is concerning, but without further explanation as to what you feel should have been happening when, I'd be guessing at what the issue is. Again, however, if you are running with {{DEBUG_THREADS}}, you're sitting on a tick time bomb. It wouldn't shock me if {{chan_sip}} suddenly got locked up with that enabled.
# If this is _not_ actually being caused by {{DEBUG_THREADS}}, _then_, if we cannot reproduce your issue, or you can't provide all of the configuration information to reproduce the issue, the issue is going to be closed. That means narrowing this down to a simple dialplan that reproduces the issue, and/or providing all of the information so that a bug marshal can reproduce the issue. It's pointless to have an issue that a developer can't fix.

So, some suggestions:
# Disable {{DEBUG_THREADS}}:
## It causes problems in your version of Asterisk. Don't run with it.
## Even if it didn't cause spurious problems that mimic a deadlock, it is a performance killer. Running a production server with {{DEBUG_THREADS}} enabled will severely cripple Asterisk's ability to handle calls.
# If you want to run with {{DEBUG_THREADS}}, then, since we have now fixed the issue that {{DEBUG_THREADS}} is causing, please test out the latest from the 11 branch:
{noformat}
svn co http://svn.asterisk.org/svn/asterisk/branches/11
{noformat}
If you continue to see deadlock like symptoms with the latest from the 11 branch, the please do the following:
## Make sure you have {{DONT_OPTIMIZE}} and {{BETTER_BACKTRACES}} enable in menuselect
## Get the output of {{core show locks}} and attach it to this issue
## Get a {{gdb}} backtrace similar to your {{backtrace-threads}} and attach it to this issue. Make sure all the symbols are there.

Thanks!

By: Peter Katzmann (pk16208) 2014-12-30 08:22:40.182-0600

I'm currently out of office due to season holidays but will report back as soon as possible

By: Rusty Newton (rnewton) 2015-01-15 09:07:29.985-0600

Please let us know what you find.

By: Peter Katzmann (pk16208) 2015-01-16 04:25:33.354-0600

Sorry can't get it compiled due to the bug related to ASTERISK-24358.

This is also with gcc 4.6

By: Matt Jordan (mjordan) 2015-01-16 07:19:25.915-0600

Really? Are you using {{chan_vpb}}?

If not, why not just disable that module in menuselect?

By: Peter Katzmann (pk16208) 2015-01-30 06:26:56.411-0600

Hello,
it seems that the problem does not occur anymore after some dialplan changes.

So i have no chance to reproduce the problem anymre.

You can close the case

By: Rusty Newton (rnewton) 2015-02-03 18:06:46.956-0600

Ok. I'm thinking we probably wouldn't have been able to track this down regardless. Thanks for letting us know.