[Home]

Summary:ASTERISK-21387: Asterisk unresponsive while waiting for lock MUTEX 6239 __ast_pbx_run c
Reporter:Duane Larson (osiris123d)Labels:
Date Opened:2013-04-06 15:42:20Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Addons/res_config_mysql
Versions:11.0.1 11.2.1 11.3.0 13.18.4 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Debian 6.0.4Attachments:( 0) backtrace-threads.txt
( 1) core-show-locks.txt
( 2) gdb.txt
( 3) gdb_debug.txt
Description:I have tested on three different versions of Asterisk and I still have the same issue.  It appears to happen once or twice a day.  It appears to happen when a VoIP phone trys to make a call (SIP Invite).  I have two Asterisk servers that are both behind an OpenSIPS server that load balances between the two servers.  Sometimes a SIP INVITE will make only one server have a deadlock and the call can go out the second server, but most of the time both Asterisk servers will deadlock and calls cannot be made.  When I am notified that there is an issue I get on the box and execute "netstat -nap |grep 5060" add see that "Recv-Q" column value is building up.

I've posted on the Mailing list and was told that version 11.3 fixed a lot of deadlock issues but the upgrade did not help.
See Mailing List email here
http://lists.digium.com/pipermail/asterisk-users/2013-April/278436.html

It usually takes about 10 minutes before Asterisk becomes responsive again or else before 10 minutes is up I could restart Asterisk and everything will be back to normal.

I see in the message logs the following errors

On the 11.0.1 Asterisk server
WARNING[23723][C-00000010] chan_sip.c: Unable to cancel schedule ID 11473.  This is probably a bug (chan_sip.c: update_provisional_keepalive, line 4406).

On the 11.2.1 Asterisk server
WARNING[3493][C-0000001f] chan_sip.c: Unable to cancel schedule ID 30810.  This is probably a bug (chan_sip.c: update_provisional_keepalive, line 4683).


When I look in chan_sip.c on both servers I see that they are the same line of code

AST_SCHED_DEL_UNREF(sched, pvt->provisional_keepalive_sched_id, dialog_unref(pvt, "when you delete the provisional_keepalive_sched_id, you should dec the refcount for the stored dialog ptr"));


I've seen the following debug logs

Asterisk version 11.0.1
[Apr  3 21:39:42] DEBUG[12984] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 11805 instead

Asterisk version 11.2.1
[Apr  3 21:39:50] DEBUG[1854] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 12423 instead


Here is a gdb debug I grabbed when the issue occurred with version 11.3.0

http://pastebin.com/gd291Bqz

I am thinking this is a bug since it happens randomly.  Any help is appreciated.
Comments:By: Rusty Newton (rnewton) 2013-04-10 18:29:16.767-0500

Thanks for the report! We may need some additional info to move forward

Please read through the https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

Please follow https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace especially "It is recommended that you enable both DONT_OPTIMIZE and BETTER_BACKTRACES"

Then make sure to follow https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock for grabbing the traces.

Attach the trace for 11.3.0 to the issue using "More Actions > Attach Files"

By: Duane Larson (osiris123d) 2013-04-11 14:14:00.941-0500

Ok.  So I recompiled with BETTER_BACKTRACES, DONT_OPTIMIZE and DEBUG_THREADS and set up my init.d file so that when Asterisk is started it adds the "-g" arg.

root@MediaServer01:/tmp# ps -C asterisk u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
asterisk  8503  0.5 64.7 846564 165656 ?       Ssl  Apr10   5:21 /home/asterisk/asterisk-bin/sbin/asterisk -U asterisk -G asterisk -g

I had two 11.3 servers fail today and I can't find any core file.  The only file in /tmp is refs and there is no core file in /home/asterisk/asterisk-bin/sbin.

I will try to run asterisk directly from the /home/asterisk/asterisk-bin/sbin directory as root and see if that will produce a core file.

On a side note I just wanted to say that I spun up a new server and installed Asterisk 1.8.21 and this version does not appear to give me the bug I am seeing.



By: Jamuel Starkey (jamuel) 2013-04-11 14:19:18.803-0500

Duane: Don't forget to tickle the Waiting for Feedback status otherwise folks might miss your update.

By: Duane Larson (osiris123d) 2013-04-11 18:51:46.196-0500

Thanks Jamuel.  I just had a thought and wanted to make sure I am wrong.  With my issue Asterisk never crashes.  It just doesn't accept any SIP messages for about 10 or 15 minutes and then things go back to normal.  If Asterisk never crashes should I still expect to see a core dump???

By: Duane Larson (osiris123d) 2013-04-12 20:18:02.353-0500

Guys I don't think I can get a core dump since Asterisk doesn't actually crash.  I am going to add the latest gdb to the files since that is all I can grab when this happens.

By: Duane Larson (osiris123d) 2013-04-12 20:20:22.652-0500

Asterisk doesn't do a core dump when this happens so I don't know what else to provide.

By: Duane Larson (osiris123d) 2013-04-12 20:40:18.624-0500

I am about to attach a gdb.txt to this ticket.  I killed asterisk by doing a "kill -6" and creating a core, but now I see that I need to do the following in order to get info for a deadlock

gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk `pidof asterisk` > /tmp/backtrace-threads.txt

By: Matt Jordan (mjordan) 2013-04-13 20:14:20.866-0500

You're correct - since there is no crash, there won't be any core file. The backtraces you've attached indicate that the deadlock is occurring in {{res_config_mysql}}.

The article Rusty linked should also tell you how to get the output of a 'core show locks' - this will give us more information about the deadlock. If you can get the CLI output of this, that would be appreciated.

All of that being said, {{res_config_mysql}} is an extended support module and help for it typically comes from the Asterisk developer community. Response times on issues for extended support modules may reflect that.

Altneratively, you could use ODBC as your realtime backend driver, i.e., {{res_config_odbc}}. That may alleviate the deadlock you're encountering.

By: Duane Larson (osiris123d) 2013-04-16 14:12:54.922-0500

I was just able to capture the 'core show locks' info and also the backtrace-threads.  Please see the uploaded attachments.  In the past I used the odbc connector but it would timeout it's connection with mysql.  Because of this whenever someone would call voicemail there would be a long pause because Asterisk had to re-establish its connection with MySQL.  the Res_config_mysql solved this issue.

By: Rusty Newton (rnewton) 2013-04-17 18:21:37.702-0500

Duane, did you ever file the timeout issue with res_config_odbc ? Is it the same as described here ASTERISK-20037 ? If it was another issue, you should file it with the appropriate logs and debug.

Also, please remove and re-attach your debug output as .txt so that other users of JIRA can easily view it from their browser without having to download the files.

By: Duane Larson (osiris123d) 2013-04-17 22:42:54.200-0500

I just uploaded the attachments in .txt format.  I posted my issue in the Asterisk forum but never heard back from anyone.  Here is the post so you can see the details.

http://forums.asterisk.org/viewtopic.php?p=170284

My issue doesn't sound the same as the ASTERISK-20037 issue because my MySQL server never failed.  It was always Asterisk closing the ODBC session after 15 seconds and taking forever to re-establish a new connection when a call came in.  The time it took Asterisk to re-establish the connection to MySQL was too long and my OpenSIPS timeouts were setting the asterisk server to failed and moving onto another asterisk server.

By: Duane Larson (osiris123d) 2013-04-20 13:37:56.349-0500

I went ahead and also opened an issue for the ODBC problem I was having.

ASTERISK-21662

I tried to go back to using the ODBC connection but I see the same issue is still occurring.  Normally I would think that it might be an issue with my MySQL database but this database is a completely different database from the one I originally had the issue on.

By: Duane Larson (osiris123d) 2013-04-25 16:20:00.490-0500

Any updates by chance?