[Home]

Summary:ASTERISK-25468: Deadlock in chan_sip - core show locks shows do_monitor lock
Reporter:Barry Flanagan (barryf)Labels:
Date Opened:2015-10-16 03:36:15Date Closed:2016-09-23 17:00:52
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:13.6.0 Frequency of
Occurrence
Related
Issues:
is duplicated byASTERISK-26204 Locks in chan_sip
is duplicated byASTERISK-26371 Asterisk freezing randomly
is duplicated byASTERISK-26585 Asterisk certified 13.8-cert1 & 13.8-cert2 & 13.8-cert3 : freeze on 'sip reload'
Environment:Debian 8 and Ubuntu 14.04.3. Asterisk latest 13.6.0 from Git. Realtime using odbc/mysql Attachments:( 0) ast-13.5.0-gdb.txt
( 1) ASTERISK-25468_gdb-output.txt
( 2) backtrace-threads.txt
( 3) backtrace-threads-13.10.0-rc2.txt
( 4) backtrace-threads1409.txt
( 5) core-show-locks.txt
( 6) coreshowlocks-13.10.0-rc2.txt
( 7) core-show-locks-1409.txt
( 8) full-20160914.txt
( 9) full-log.txt.gz
Description:I am trying to bring a new server into an existing cluster of Asterisk boxes and I keep getting the same problem.

Servers are all behind a Kamailio LB, and when I add this new server to the dispatcher group, kamailio starts sending REGISTER and  SUBSCRIBE requests to the new server. After a few minutes chan_sip just hangs, no longer processing any traffic at all. Nothing shows up in the logs, and Asterisk itself is still running. I can see the incoming SIP packets using sngrep, but Asterisk does not see them.

I have tried this on KVM and Openvz virtual servers as well as physical servers, and have tried both Debian 8 and Ubuntu 14.04 with the exact same results.

When chan_sip freezes, 'core show locks' shows the following every time:

{code}
=======================================================================
=== GIT-13-f8707ae
=== Currently Held Locks
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0x7fba21a4c700 LWP:13423 (do_monitor           started at [28932] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 28903 do_monitor &monlock 0x7fba319054a0 (1)
       main/backtrace.c:59 __ast_bt_get_addresses() (0x46777f+1D)
       main/lock.c:258 __ast_pthread_mutex_lock() (0x5379ef+C7)
       channels/chan_sip.c:28904 do_monitor()
       main/utils.c:1237 dummy_start()
       :0 start_thread()
       libc.so.6 clone() (0x7fbab9e25410+6D)
=== -------------------------------------------------------------------
{code}

There is no core sump when this happens. SIP simply stops responding, peers do not expire, etc.

I have managed to get a gdb backtrace from the running process using
{code}
gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk <pid>
{code}

Hopefully that will give some clue. I will upload it as an attachment.

Any help much appreciated.
Comments:By: Asterisk Team (asteriskteam) 2015-10-16 03:36:18.725-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Barry Flanagan (barryf) 2015-10-16 03:38:48.236-0500

Output of

gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk  5897

By: Stefan Tichy (st) 2015-10-16 04:18:38.932-0500

It might be the same problem as described in issue ASTERISK-25388

By: Barry Flanagan (barryf) 2015-10-16 04:31:26.100-0500

Thanks Stephan. Could be. I am going to try version 13.2.0 as the other servers in this cluster are running that version and are stable. If that turns out to work, I'll try each release up to 13.6.0 and see when it breaks.


By: Rusty Newton (rnewton) 2015-10-16 12:01:20.860-0500

Please do report between which versions you find the problem beginning to occur. Thanks!

By: Barry Flanagan (barryf) 2015-10-16 15:45:54.596-0500

Hi Rusty, Stefan,

It appears to have happened between 13.4.0 and 13.5.0. I have tested 13.2.0 - 13.6.0 and the issue appears in .5 and .6

=======================================================================
=== 13.5.0
=== Currently Held Locks
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0x7f4bc0f8d700 LWP:14221 (do_monitor           started at [28915] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 28886 do_monitor &monlock 0x7f4bcfffd4a0 (1)
       main/backtrace.c:59 __ast_bt_get_addresses() (0x467743+1D)
       main/lock.c:258 __ast_pthread_mutex_lock() (0x537a4d+C7)
       channels/chan_sip.c:28887 do_monitor()
       main/utils.c:1237 dummy_start()
       nptl/pthread_create.c:312 start_thread()
       libc.so.6 clone() (0x7f4c58789410+6D)
=== -------------------------------------------------------------------
===
=======================================================================

I'll upload the full gdb backtrace for 13.5 as well, in case it is of any use.



By: Barry Flanagan (barryf) 2015-10-16 15:49:39.299-0500

Output of
gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk  14188

on Asterisk version 13.5.0


By: Rusty Newton (rnewton) 2015-10-21 09:21:02.663-0500

Thanks, that helps a lot! Can you also provide a debug log leading up to the locking situation?

We only need the last ~10 minutes or so up to the locking situation happening.

By: Barry Flanagan (barryf) 2015-10-21 09:54:00.450-0500

Hi Rusty.

I have uploaded full-log.txt.gz which is from that time. Sorry, but full debug was not on and I have now moved this server into production (running 13.4.0!)

Hope it helps.

-Barry



By: Bobby Hakimi (bobbymc) 2015-10-29 17:19:05.584-0500

looks like i have the same issue on version 11.17, ASTERISK-25456 you can see the last core show locks file i attached.




By: Leandro Dardini (ldardini) 2016-02-22 03:29:27.195-0600

I can say even asterisk 13.7.0 and asterisk 13.7.2 are affected by this problem. Here the core show locks and backtrace.

By: Leandro Dardini (ldardini) 2016-02-22 03:30:52.308-0600

Core show locks and backtrace from 13.7.2 locking up

By: Bobby Hakimi (bobbymc) 2016-03-16 23:41:41.595-0500

this is still going on with the latest 11 branch, any updates?

By: Richard Mudgett (rmudgett) 2016-03-17 12:10:17.469-0500

[~bobbymc] This is not the same issue as ASTERISK-25456 as the core show locks and backtraces are showing a problem specific to code after v11.

By: Leandro Dardini (ldardini) 2016-07-06 14:31:39.066-0500

More on this issue, I found a way to lock asterisk immediately using chan_sip on asterisk 13.10.0-rc2, by using the followme command with multiple channels. I am attaching the files...

By: Leandro Dardini (ldardini) 2016-07-06 14:35:17.002-0500

Backtrace for deadlock on asterisk 13.10.0-rc2. This backtrace and the "core show locks" output was made by using a "followme" command with multiple channels. The same dialplan and setup works perfectly on asterisk 13.2.0

By: Joshua C. Colp (jcolp) 2016-07-06 14:37:01.642-0500

[~ldardini] Your issue appears to be separate and is actually ODBC related. It should be under a separate issue.

By: Joshua C. Colp (jcolp) 2016-07-06 14:47:21.452-0500

[~ldardini] The complete console output (with debug going to console in logger.conf and core set debug 3) and configuration would also be useful on the new issue.

By: Antonis Psaras (apsaras) 2016-09-14 08:47:34.375-0500

We have the same issue on asterisk 13.11.
Please find attached backtrace and full log

By: George Joseph (gjoseph) 2016-09-21 18:50:58.365-0500

I've submitted a patch that should fix this issue.

https://gerrit.asterisk.org/#/c/3962/1

If someone could please try the patch over 13.11 or the current 13 git branch and provide feedback that's be great.