[Home]

Summary:ASTERISK-24287: Race conditons and other problems in res_config_pgsql
Reporter:Etienne Lessard (hexanol)Labels:
Date Opened:2014-08-29 14:48:03Date Closed:2017-02-28 08:42:32.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_config_pgsql
Versions:11.12.0 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:
Description:While looking to fix a deadlock in res_config_pgsql, I found that there was quite a few problems with the module. Here's what I found:

In the find_table function, if database is NULL and the table is not found in the cache, than the lock on the psql_tables list is not released. An easy way to produce such a deadlock is to call the command "realtime show psql cache foobar" from the CLI; if your database doesn't have a table named "foobar", then the next thread which will try to acquire the psql_tables lock will deadlock.

The find_table function sometimes calls the pgsql_exec function, but it doesn't lock the pgsql_lock mutex before, and the find_table function is not always called with the pgsql_lock mutex locked.  This can cause a range of undefined behaviour if other threads are executing in res_config_pgsql, ranging from crash to deadlock. I've personally observed both.

The command "realtime show pgsql status" use the pgsql connection without obtaining the lock first.

The ESCAPE_STRING macro reference the pgsql connection, but once again, the lock is not systematically acquired before the macro is used, so this can lead to undefined behaviour.

I also found other, less critical problems.

I know the res_config_psql module is an "extended support" module. I would have liked to provide a patch to fix the numerous problems, but didn't have the time, and I instead made the switch to res_config_odbc. That said, I've still decided to open this bug, to let other people know that there's quite a few issues with the module.
Comments:By: Steve Davies (one47) 2015-06-11 08:57:10.861-0500

I am using Asterisk 11.18.0 and have also spent some time trying to find a deadlock in res_config_pgsql. From the debug I've managed to grab, it appears that the call to PGexec() blocks on a poll() in libpq.so forever, and is in fact stuck in the kernel somewhere! As a result of this the thread itself grabs many locks (below), and never lets them go, causing chaos.

Postgres Bug#6342 from back in December 2011 seems to refer to this issue, but no resolution was ever found.

My current solution is to not use realtime :(

Steve

=== Thread ID: 0xb2ec3b70 (handle_tcptls_connection started at [  747] tcptls.c ast_tcptls_server_root())
=== ---> Lock #0 (chan_sip.c): MUTEX 29129 handle_request_do &netlock 0xb6557a40 (1)
       main/logger.c:1701 ast_bt_get_addresses() (0x81506d2+19)
       main/lock.c:258 __ast_pthread_mutex_lock() (0x8148a96+94)
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       res_http_websocket.so <unknown>()
       main/http.c:754 handle_uri()
       main/http.c:991 httpd_helper_thread()
       main/tcptls.c:696 handle_tcptls_connection()
       main/utils.c:1223 dummy_start()
       :0 start_thread()
       libc.so.6 clone() (0xb7701010+5E)
=== ---> Lock #1 (chan_sip.c): MUTEX 9083 sip_pvt_lock_full pvt 0xb2d60f98 (1)
       main/logger.c:1701 ast_bt_get_addresses() (0x81506d2+19)
       main/lock.c:258 __ast_pthread_mutex_lock() (0x8148a96+94)
       main/astobj2.c:198 __ao2_lock() (0x8094df4+7C)
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       res_http_websocket.so <unknown>()
       main/http.c:754 handle_uri()
       main/http.c:991 httpd_helper_thread()
       main/tcptls.c:696 handle_tcptls_connection()
       main/utils.c:1223 dummy_start()
       :0 start_thread()
       libc.so.6 clone() (0xb7701010+5E)
=== ---> Lock #2 (chan_sip.c): MUTEX 17308 register_verify peer 0xad1af50 (1)
       main/logger.c:1701 ast_bt_get_addresses() (0x81506d2+19)
       main/lock.c:258 __ast_pthread_mutex_lock() (0x8148a96+94)
       main/astobj2.c:198 __ao2_lock() (0x8094df4+7C)
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       res_http_websocket.so <unknown>()
       main/http.c:754 handle_uri()
       main/http.c:991 httpd_helper_thread()
       main/tcptls.c:696 handle_tcptls_connection()
       main/utils.c:1223 dummy_start()
       :0 start_thread()
       libc.so.6 clone() (0xb7701010+5E)
=== ---> Lock #3 (res_config_pgsql.c): MUTEX 1126 config_pgsql &pgsql_lock 0xb6e57160 (1)
       main/logger.c:1701 ast_bt_get_addresses() (0x81506d2+19)
       main/lock.c:258 __ast_pthread_mutex_lock() (0x8148a96+94)
       res_config_pgsql.so <unknown>()
       main/config.c:2693 ast_config_internal_load() (0x80ed39c+1FF)
       main/config.c:2714 ast_config_load2() (0x80ed62c+43)
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       chan_sip.so <unknown>()
       res_http_websocket.so <unknown>()
       main/http.c:754 handle_uri()
       main/http.c:991 httpd_helper_thread()
       main/tcptls.c:696 handle_tcptls_connection()
       main/utils.c:1223 dummy_start()
       :0 start_thread()
       libc.so.6 clone() (0xb7701010+5E)
=== -------------------------------------------------------------------

Thread back-trace (truncated)

#0  0xb76f380c in poll () from /lib/i386-linux-gnu/libc.so.6
#1  0xb6e307d6 in ?? () from /usr/lib/libpq.so.5
#2  0xb6e308cb in ?? () from /usr/lib/libpq.so.5
#3  0xb6e30953 in ?? () from /usr/lib/libpq.so.5
#4  0xb6e2e7a2 in PQgetResult () from /usr/lib/libpq.so.5
#5  0xb6e2ea48 in ?? () from /usr/lib/libpq.so.5
#6  0xb6e4adc9 in _pgsql_exec (database=0xb6e574a0 "asterisk", tablename=0xb2ebffbc "ast_config",
etc...

Process kernel stack for thread that is stuck:

[<c109777b>] __generic_file_aio_write+0x25e/0x282
[<c1043a90>] __dequeue_signal+0xf/0xce
[<c10447cd>] ptrace_stop+0x10c/0x1a0
[<c1045440>] get_signal_to_deliver+0x1e9/0x44d
[<c100b36b>] do_signal+0x2f/0x4c2
[<c10cccd5>] do_sync_write+0x0/0xdc
[<c10ccd7d>] do_sync_write+0xa8/0xdc
[<c10f30fb>] fsnotify+0x1d1/0x1e8
[<c121428b>] sys_send+0x19/0x1d
[<c1214a23>] sys_socketcall+0xf2/0x1cd
[<c100b988>] do_notify_resume+0x1e/0x65
[<c12c4850>] work_notifysig+0x13/0x1b
[<ffffffff>] 0xffffffff


By: Sean Bright (seanbright) 2017-02-28 08:42:32.315-0600

Etienne,

I believe that all of the issues that you enumerated were resolved by [this commit|https://gerrit.asterisk.org/#/c/5076]. It will be in the next release of Asterisk 13 and 14.