[Home]

Summary:ASTERISK-19463: Asterisk deadlocks during startup with mutex errors
Reporter:Matt Jordan (mjordan)Labels:
Date Opened:2012-03-02 08:30:02.000-0600Date Closed:2012-11-18 14:12:45.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:General Tests/testsuite
Versions:11 1.8.9.3 10.1.3 Frequency of
Occurrence
Related
Issues:
causesASTERISK-20945 "Unable to connect to remote asterisk" message on service asterisk start, even though service is running
causesASTERISK-20947 astcanary exits immediately because of wrong pid argument
Environment:CentOS 6 32-bit (2.6.32-220.el6.i686), 64-bit (2.6.32-220.el6.x86_64) Asterisk (1.8, 10, and trunk)Attachments:( 0) ast_19239_bt.txt
( 1) ast_19240_bt.txt
( 2) backtrace.txt
( 3) bt_full_shutdown.txt
( 4) logger_deadlock.txt
Description:[Edit] 11/15/12

This particular issue has morphed slightly as time has gone on.  Once we solved the original logger thread problem, we quickly ran into another apparent deadlock.  This one has proven much more difficult to resolve.

On initial inspection (see ast_19239_bt), it appears as if something failed to unlock the recursive mutex that protects the lock tracking information in front of the thread registration list's read/write lock.  A deep inspection of the DEBUG_THREADS code surrounding the various read/write lock code proved this wasn't the case.

Although the deadlock/lock failures happen only sporadically on the 64-bit test system, they happen much more frequently on a 32-bit system, which made it easier to reproduce.  Although the failures are sporadic, one thing that is consistent is that it only happens on startup - once a test got past the thread registration step, it would never occur on any other rwlock.

This implied that something was wrong with the thread registration.  This also proved to not be the case - its incredibly simple code with no possibility of a recursive rwlock/wrlock call.

On to plan C.  We put the following debug code into the ast_reentrancy_lock/unlock:

{noformat}
static inline int ast_reentrancy_lock(struct ast_lock_track *lt)
{
int res;
res = pthread_mutex_lock(&lt->reentr_mutex);
if (res) {
fprintf(stderr, "Failed to lock reentrancy mutex: %s[%d]\n", strerror(res), res);
abort();
}
return res;
}

static inline int ast_reentrancy_unlock(struct ast_lock_track *lt)
{
int res;
res = pthread_mutex_unlock(&lt->reentr_mutex);
if (res) {
fprintf(stderr, "Failed to unlock reentrancy mutex: %s[%d]\n", strerror(res), res);
abort();
}
return res;
}
{noformat}

And then we saw the following:
{noformat}
Running ['tests/apps/voicemail/check_voicemail_new_user/run-test'] ...
[Nov 09 19:00:13] WARNING[30837]: asterisk.asterisk:127 errReceived: Asterisk 127.0.0.1 received error: Failed to unlock reentrancy mutex: Operation not permitted[1]
{noformat}

[Rut roh|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-80]

In a recursive pthread mutex, a failure to unlock a reentrant mutex implies that the thread unlocking the mutex doesn't own it.  Except in this case, the call to unlock the mutex happens 5 lines of code after we lock it.

The only explanation that I have for this is that the thread ID that the mutex recorded when it locked the mutex is no longer the thread ID that is returned when it unlocks the mutex.

For reference: [pthread_mutex_unlock|http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_unlock.c;h=f9fe10b0f2b6f880de046df476082ecb9ffe3adf;hb=HEAD]
{noformat}
 55   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
 56     {
 57       /* Recursive mutex.  */
 58       if (mutex->__data.__owner != THREAD_GETMEM (THREAD_SELF, tid))
 59         return EPERM;
 60
 61       if (--mutex->__data.__count != 0)
 62         /* We still hold the mutex.  */
 63         return 0;
 64       go
{noformat}

SO... why would this happen?

Well, on startup, the Asterisk Test Suite executes Asterisk by sending it into the background.  Asterisk, when it starts with the appropriate option, calls [daemon|http://www.kernel.org/doc/man-pages/online/pages/man3/daemon.3.html] which forks the currently running process.  Forking a process does not carry over threads and may result in undefined behavior for things that interact with them.

To test this, we moved the call to daemon higher in the startup sequence and deliberately removed attempts to call into logger before the daemon happened.  This immediately resolved the locking issues - whereas [before|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-80] we'd have one to (many) failures, we now had [none|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-94] (at least not related to this problem).

A number of test runs were conducted without any recurrances of the problem.

Moral of the story: combining fork with threads == bad.

[Edit] 03/05/12

Uploaded two new backtraces, one from an instance of Asterisk that hung (ast_19239_bt), and one from the instance of Asterisk that was executing a remote command (ast_19240_bt).  Interestingly enough, in the backtrace ast_19239_bt, Asterisk was not in the logger locks, but in a rwlock that protects registering threads.

[Edit] 03/02/12

Originally, this issue appeared to occur during startup, when a remote Asterisk instance executed a "core wait fullybooted" against an Asterisk instance that was starting up.

Now, however, the same behaviour can occur when a remote Asterisk instance executes "core stop gracefully" - so when this occurs when Asterisk is starting up does not seem to be related to the problem, so much as a remote Asterisk instance sending a command.

Note that in both instances, the logger thread is waiting on the condition variable to be signalled, and a remote console command is being handled.

Additional note:
1) No log statements were ever logged for the instance of Asterisk corresponding to bt_full_shutdown
2) In the backtrace, both the logger_thread and the listener thread have the same thread data object passed to their thread startup routines:

Thread 44 (Thread 0xb74ffb70 (LWP 26037)):
...
#10 0x0807b698 in listener (unused=0x0) at asterisk.c:1387
#11 0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
#12 0x0012ba09 in start_thread () from /lib/libpthread.so.0
#13 0x0059e15e in clone () from /lib/libc.so.6

Thread 43 (Thread 0xb74c3b70 (LWP 26038)):
...
#7  0x0812dba9 in logger_thread (data=0x0) at logger.c:1053
#8  0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
#9  0x0012ba09 in start_thread () from /lib/libpthread.so.0
#10 0x0059e15e in clone () from /lib/libc.so.6

Since this should have been allocated in utils.c just prior to creating the thread, this is ... odd.  Note that this was the case in the other backtrace as well.

*Note:* This is a red herring.  We free the thread argument as the first action in dummy_start, so the memory is available for usage again.

[Edit] - original issue description below. This is somewhat misleading, as the issue appeared to only occur on startup.

It is possible for Asterisk to enter into a deadlock if there are no log messages available to be logged.  This typically happens on startup, before logger.conf has been processed.  In the logger thread, if the list of log messages is empty, the logger thread waits on the condition logcond after it has acquired the list mutex:

{code}
AST_LIST_LOCK(&logmsgs);
if (AST_LIST_EMPTY(&logmsgs)) {
if (close_logger_thread) {
break;
} else {
ast_cond_wait(&logcond, &logmsgs.lock);
}
}
next = AST_LIST_FIRST(&logmsgs);
AST_LIST_HEAD_INIT_NOLOCK(&logmsgs);
AST_LIST_UNLOCK(&logmsgs);
{code}

Unfortunately, if close_logger_thread is true, we fail to unlock the logmsgs list mutex.  This causes a deadlock on any thread that attempts to log a message.

Backtrace attached.  Note that a 'core show locks' could not be obtained as the deadlock occurs in the logger and fairly early in Asterisk startup; however, the backtrace readily demonstrates the problem.

Thread 3: the logger thread holding the lock
Thread 4: blocked waiting on Thread 3
Thread 1: blocked waiting on Thread 3
Comments:By: Matt Jordan (mjordan) 2012-03-02 11:39:52.319-0600

Apparently, this can happen elsewhere other then during startup - see bt_full_shutdown.  In this case, a remote asterisk console attempted to send "core stop gracefully".  Will edit issue description.

By: Matt Jordan (mjordan) 2012-03-05 08:47:51.602-0600

Attached backtraces (ast_19239_bt, ast_19240_bt) taken from 64-bit machine

By: Matt Jordan (mjordan) 2012-04-02 09:55:17.922-0500

Much thanks to Richard for pointing out the first deadlock (the one first noticed in logger):

for (;;) {
/* ... */
AST_LIST_LOCK(&logmsgs);
if (AST_LIST_EMPTY(&logmsgs)) {
if (close_logger_thread) {
break;
} else {
ast_cond_wait(&logcond, &logmsgs.lock);
}
}
next = AST_LIST_FIRST(&logmsgs);
AST_LIST_HEAD_INIT_NOLOCK(&logmsgs);
AST_LIST_UNLOCK(&logmsgs);

If close_logger_thread is true, we break out of the loop without unlocking logmsgs.

By: Matt Jordan (mjordan) 2012-04-04 09:28:09.084-0500

After Richard's patch, the deadlock no longer occurs.

However, the Asterisk process does exit completely, with only a single log message sent to the logs and no indication as to why it exited.

Additionally, the 64-bit machine exhibits a deadlock while registered threads - new backtrace attached (backtrace.txt)

By: Matt Jordan (mjordan) 2012-05-11 16:37:15.798-0500

ast_register_thread update:

valgrind with dtd detected the following race condition (amongst others):

==580== Conflicting store by thread 1 at 0x08219fc0 size 4
==580==    at 0x807C69D: really_quit (asterisk.c:1744)
==580==    by 0x807BFB1: quit_handler (asterisk.c:1632)
==580==    by 0x8080DC1: monitor_sig_flags (asterisk.c:3187)
==580==    by 0x808392A: main (asterisk.c:4011)
==580== Allocation context: Data section of /usr/sbin/asterisk
==580== Other segment start (thread 3)
==580==    at 0x402E871: pthread_rwlock_unlock (drd_pthread_intercepts.c:1225)
==580==    by 0x81263BD: __ast_rwlock_unlock (lock.c:868)
==580==    by 0x807886E: ast_register_thread (asterisk.c:406)
==580==    by 0x81AC112: dummy_start (utils.c:1001)
==580==    by 0x402F5AC: vgDrd_thread_wrapper (drd_pthread_intercepts.c:281)
==580==    by 0x420DA08: start_thread (in /lib/libpthread-2.12.so)
==580==    by 0x412443D: clone (in /lib/libc-2.12.so)
==580== Other segment end (thread 3)
==580==    at 0x41195F6: poll (in /lib/libc-2.12.so)
==580==    by 0x807AF56: listener (asterisk.c:1329)
==580==    by 0x81AC164: dummy_start (utils.c:1004)
==580==    by 0x402F5AC: vgDrd_thread_wrapper (drd_pthread_intercepts.c:281)
==580==    by 0x420DA08: start_thread (in /lib/libpthread-2.12.so)
==580==    by 0x412443D: clone (in /lib/libc-2.12.so)

The ast_socket file descriptor is not protected by anything.  If really_quit is fired, it potentially can change the file descriptor value to a -1 inside the listener thread when we wouldn't expect it to (for example, between the initial check for ast_socket > 0 and the ast_poll).  This isn't terribly likely, but again, neither is the behavior we're seeing.

Note that for this to occur, something would still have had to call really_quit.  Since this appears to only occur when a remote console failed to connect, its possible there's contention on ast_config_AST_SOCKET which causes Asterisk to bail out initially.

By: Matt Jordan (mjordan) 2012-11-16 09:54:13.231-0600

... and the above comment, while true, had nothing to do with the problem either.  Issue description updated.