[Home]

Summary:ASTERISK-17436: random deadlocks - SIP messages not being processed
Reporter:Henry Fernandes (usinternet)Labels:
Date Opened:2011-02-18 15:28:13.000-0600Date Closed:2013-04-19 11:00:56
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General Resources/res_timing_pthread
Versions:Frequency of
Occurrence
Related
Issues:
is related toASTERISK-21389 res_timing_pthread fails to return from write, causing timer dependent operations to block indefinitely
Environment:Attachments:( 0) backtrace-threads.txt
( 1) backtrace-threads-11021911:55:58.txt
( 2) backtrace-threads-11021912:05:44.txt
( 3) core-show-locks.txt
( 4) core-show-locks-11021911:55:58.txt
( 5) core-show-locks-11021912:05:44.txt
Description:Just upgraded to Asterisk 1.8.2 for production and we're getting random deadlocks.  When this happens, it looks like Asterisk is not able to process new SIP messages.  Any INVITEs or REGISTERs that arrive are not being outputted to the console or logs.

This seems to happen randomly.  I have not been able to reproduce it at will although it does happen about 5-10 times per day.  The server is handling about 2000 calls/day, so it's not really that busy.  There are about 75 phones connected to the server.

Please see attached backtrace and lock output for more information.  It shows the deadlocks but I'm not sure which is the root issue here.

****** ADDITIONAL INFORMATION ******

I've been able to reproduce this on Asterisk 1.8.2.2 and 1.8.2.3.  OS is CentOS release 5.5 (Final).

Comments:By: Damian Minkov (damencho) 2011-02-19 13:55:13.000-0600

We are experiencing the same locks on asterisk 1.8.2.3, we experienced it and before using 1.8.0, but rarely than after upgrading to latest.
We've seen it on Ubuntu Server 10.10 and Ubuntu Server 10.04.
After not able to process the INVITES and all lock ends and normal operation resumes, fantom calls are experienced, the INVITES that were not processed during lock are than processed. The duration of the locks vary between 5 and 10 minutes.
Lately when trying to produce backtraces and list of locks, this immediately ends the lock period and normal operation continues. I'm attaching two such logs from today.

By: Luke H (luckman212) 2011-02-20 21:45:49.000-0600

I believe I'm also experiencing this.  CentOS 5.5, very small PBX (Atom230 cpu) and only a handful of phones.  Some phones are using TCP/TLS to register.  Deadlocks seem to be most common during REGISTER or INVITE.  

I'm curious, what hardware timer (clocksource) are you using on these machines? e.g. HPET, ACPI_PM, TSC, etc.  Also, what timer modules in Asterisk?  (pthread, dahdi, timerfd)  I think this may be key.

By: Damian Minkov (damencho) 2011-02-21 01:33:22.000-0600

Most of the software that connects to the machine is using TCP.

In asterisk we use the timer module timerfd.
*CLI> timing test
Attempting to test a timer with 50 ticks per second.
Using the 'timerfd' timing module for this test.
It has been 1000 milliseconds, and we got 50 timer ticks.

The hardware clocksource I beleave its tsc:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
# cat  /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

By: Henry Fernandes (usinternet) 2011-02-21 09:32:50.000-0600

Our phones connect via UDP.  Our timing source is DAHDI.

CLI> timing test
Attempting to test a timer with 50 ticks per second.
Using the 'DAHDI' timing module for this test.
It has been 1019 milliseconds, and we got 51 timer ticks

By: Casey Morford (cmorford) 2011-02-21 11:00:53.000-0600

I can confirm we are having this issue with 1.8.2.3 as well.  I doesn't seemed to crash for us a frequently, but it is definitely in deadlock.

Out of curiosity, are you seeing a unusually high number of chan_sip __sip_xmit warnings?

By: Luke H (luckman212) 2011-02-21 11:04:51.000-0600

Hmm ok, I'm using acpi_pm on my particular box, so I guess it's probably not related to the 'clocksource' -- but I do believe it is possibly a bug in pthreads.  I am in the process of swapping some things around to try to eliminate pthreads completely & just use dahdi timing as a test.  

But I believe this completely KILLS music on hold, at least on 1.8.x

By: Luke H (luckman212) 2011-02-22 11:16:55.000-0600

I can confirm that re-enabling res_timing_dahdi has stopped the deadlocks on my PBX.  MOH still seems to hook into res_timing_pthreads for something, my 'module show like timing' looks like this

pbx*CLI> module show like timing
Module                         Description                              Use Count
res_timing_dahdi.so            DAHDI Timing Interface                   1        
res_timing_pthread.so          pthread Timing Interface                 2        
2 modules loaded

anyway, as I expected, this has killed my MOH.  MOH is not working correctly with res_timing_dahdi enabled for some reason.

By: cmaj (cmaj) 2011-02-22 13:13:48.000-0600

luckman212, I'm running only res_timing_dahdi for timing, and MOH works fine. This is on 1.8.3-rc2 and a Sangoma card - couple T1's clocked from the telco - with ulaw across the board.

As for the last core show locks, core-show-locks-11021912:05:44.txt, that looks like Mysql lock contention, one lock being held to the Mysql backend for both CDR insertion and realtime config. If whoever posted that could try turning off Mysql CDR logging for a bit, then it might help them. Or try setting "batchmode=yes" in cdr.conf so you can reduce some contention for that lock (this would cause a post of the CDR in a separate thread from the channel Hangup routine.)

By: isrl (isrl) 2011-02-22 13:27:13.000-0600

Luckyman212: i also have a problem with moh and the dahdi timer but only if i use a external source like madplay for moh

I tried debugging that but i havent found anything yet

and when going to pthreads i have every few hours a deadlock which i reported at
https://issues.asterisk.org/view.php?id=18867

By: Luke H (luckman212) 2011-02-22 13:33:04.000-0600

Yes I am using mpg123 for my moh source to stream, so I guess that's the issue.  Has a separate bug been filed for that?  or is it fixed in SVN-- do you know?

By: isrl (isrl) 2011-02-22 13:53:38.000-0600

nope i tried from 1.8.1 all the way up to svn plus a few open patches in the bug tracker and the problem still exsits

i havent open'd a bug report as i didnt know how to debug the problem
now we know thats its when using a external program to stream moh

By: Damian Minkov (damencho) 2011-02-24 04:58:28.000-0600

We've disabled cdr_mysql.so and again we had the lock today.
I had to restart it to get it fixed.
The interesting was:

CLI> core show channels
Channel              Location             State   Application(Data)            
0 active channels
1 active call

CLI> sip show channels
I had one channel in BYE

And when trying
CLI> sip show channel + typing <tab> and the console locked.

Unfortunately I didn't manage to get backtrace and lock dumps cause we moved back the asterisk installation on other machine. And this one was not compiled with debug information. Will recompile now if happens again to trace it.

By: Luke H (luckman212) 2011-02-24 07:52:34.000-0600

How do we even begin to debug why using mpg123 for a MOH stream + using res_timing_dahdi causes no sound?  I wouldn't even begin to know where to start looking.  Because it's not crashing or coredumping.

By: Ronald Chan (loloski) 2011-02-24 13:12:01.000-0600

From luckman212:

Yes I am using mpg123 for my moh source to stream, so I guess that's the issue. Has a separate bug been filed for that? or is it fixed in SVN-- do you know?

Yes please file a separate bug for this one :D

By: isrl (isrl) 2011-02-24 13:30:00.000-0600

i file'd a bug yesterday
https://issues.asterisk.org/view.php?id=18877

By: Luke H (luckman212) 2011-02-24 13:49:21.000-0600

whoops I just filed one too
https://issues.asterisk.org/view.php?id=18884

By: Matt Jordan (mjordan) 2013-04-19 10:56:48.584-0500

There is a patch on ASTERISK-21389 that should prevent res_timing_pthread from blocking callers. That should resolve this issue.

If you test with that patch and this issue is still a problem, please let a bug marshal know in #asterisk-bugs and we will be happy to reopen this issue. Thanks!