ASTERISK-19011: crashing res_odbc because of use of obj->con while reconnecting

[Home]

Summary: ASTERISK-19011: crashing res_odbc because of use of obj->con while reconnecting

Reporter: Walter Doekes (wdoekes) Labels:

Date Opened: 2011-12-13 02:16:40.000-0600 Date Closed: 2012-02-27 09:18:20.000-0600

Priority: Minor Regression? No

Status: Closed/Complete Components: Resources/res_odbc

Versions: 1.8.8.0 Frequency of
Occurrence Constant

Related
Issues:

Environment: Attachments: ( 0) issueA19011_combine_read_and_write_locks_WORK_IN_PROGRESS.patch

Description: Hi,

there is potential for res_odbc to crash in the right circumstances: (1) it's reconnecting (2) someone else is attempting to do a query.

There is this mutex:

{code}
struct odbc_obj {
ast_mutex_t lock;
SQLHDBC con; /*!< ODBC Connection Handle */
{code}

But it's only used by the reconnecting party, not by the users of obj->con. This means that other threads can attempt to execute queries on the SQLHDBC that is being closed.

h3. Steps to reproduce

Note that we explicitly *do* *not* create the necessary sippeer table. One could argue that it's my own fault for not having the right tables. But (a) there are cases when a table is gone for a short while (table re-creation) and (b) the issue could still occur with the right tables, but it's harder to reproduce.

{noformat}
>>> extconfig.conf <<<

[settings]
sippeers => odbc,asterisk,sippeer

>>> modules.conf <<<

[modules]
autoload=no
load => chan_sip.so
load => res_odbc.so
load => res_config_odbc.so

>>> res_odbc.conf <<<

[asterisk]
enabled => yes
dsn => asterisk
pre-connect => yes

>>> sip.conf <<<

[general]

{noformat}

Start asterisk.
Start a shell and run this:
{noformat}
$ while true; do sipsak -U -s sip:100@localhost; done
{noformat}
Type this in the console:
{noformat}
*CLI> sip show peer 123 load
{noformat}

Result:
{noformat}
...
res_odbc: Connected to asterisk [asterisk]
SQL Execute returned an error -1: 42S02: [MySQL][ODBC 3.51 Driver][mysqld-5.0.51a-24+lenny5-log]Table 'asterisk.sippeer' doesn't exist (100)
SQL Execute error -1! Verifying connection to asterisk [asterisk]...
Connection is down attempting to reconnect...
SQL Execute returned an error -1: 42S02: [MySQL][ODBC 3.51 Driver][mysqld-5.0.51a-24+lenny5-log]Table 'asterisk.sippeer' doesn't exist (100)
SQL Execute error -1! Verifying connection to asterisk [asterisk]...
Error in my_thread_global_end(): 1 threads didn't exit
Disconnected 0 from asterisk [asterisk]
Segmentation fault
{noformat}

h3. Backtrace
{noformat}
Program terminated with signal 11, Segmentation fault.
#0 0x00007f4845eb55ae in ?? () from /usr/lib/libodbc.so.1
{noformat}

{noformat}
(gdb) back
#0 0x00007f4845eb55ae in ?? () from /usr/lib/libodbc.so.1
#1 0x00007f4846103eb6 in ast_odbc_prepare_and_execute (obj=0x2aebb28, prepare_cb=0x7f483f2ff04a <custom_prepare>, data=0x7f48441a49e0) at res_odbc.c:647
#2 0x00007f483f2ffea3 in realtime_odbc (database=0x7f48441a6160 "asterisk", table=0x7f48441a6060 "sippeer", ap=0x7f48441a62a0) at res_config_odbc.c:209
#3 0x000000000049f89f in ast_load_realtime_helper (family=0x7f483f5c7bd7 "sippeers", ap=0x7f48441a62a0) at config.c:2332
#4 0x000000000049fa7d in ast_load_realtime (family=0x7f483f5c7bd7 "sippeers") at config.c:2363
#5 0x00007f483f52202c in realtime_peer_by_name (name=0x7f48441a6458, addr=0x0, ipaddr=0x7f48441a64a0 "", var=0x7f48441a6478, varregs=0x0) at chan_sip.c:4743
#6 0x00007f483f522674 in realtime_peer (newpeername=0x7f48441a6c54 "100", addr=0x0, devstate_only=0, which_objects=2) at chan_sip.c:4895
#7 0x00007f483f522d50 in find_peer (peer=0x7f48441a6c54 "100", addr=0x0, realtime=1, which_objects=2, devstate_only=0, transport=0) at chan_sip.c:5009
#8 0x00007f483f559123 in register_verify (p=0x2b1a158, addr=0x7f48441a7cc0, req=0x7f48441a7260, uri=0x2b27ca1 "sip:localhost") at chan_sip.c:14575
#9 0x00007f483f593bee in handle_request_register (p=0x2b1a158, req=0x7f48441a7260, addr=0x7f48441a7cc0, e=0x2b27ca1 "sip:localhost") at chan_sip.c:24736
#10 0x00007f483f5954df in handle_incoming (p=0x2b1a158, req=0x7f48441a7260, addr=0x7f48441a7cc0, recount=0x7f48441a721c, nounlock=0x7f48441a7218) at chan_sip.c:25008
#11 0x00007f483f595e64 in handle_request_do (req=0x7f48441a7260, addr=0x7f48441a7cc0) at chan_sip.c:25168
#12 0x00007f483f595a84 in sipsock_read (id=0x2b26460, fd=6, events=1, ignore=0x0) at chan_sip.c:25102
#13 0x00000000004ef950 in ast_io_wait (ioc=0x7f4840006da0, howlong=1000) at io.c:288
#14 0x00007f483f5979f9 in do_monitor (data=0x0) at chan_sip.c:25631
#15 0x00000000005866ed in dummy_start (data=0x2b263f0) at utils.c:1004
#16 0x00007f4846ef19ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#17 0x00007f484774270d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#18 0x0000000000000000 in ?? ()
{noformat}

{noformat}
(gdb) thread 11
[Switching to thread 11 (Thread 13152)]#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
(gdb) back
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f4846ef45d9 in _L_lock_953 () from /lib/libpthread.so.0
#2 0x00007f4846ef43fb in __pthread_mutex_lock (mutex=0x7f48460fe580) at pthread_mutex_lock.c:61
#3 0x00007f4845eb56d0 in ?? () from /usr/lib/libodbc.so.1
#4 0x00007f4846108209 in odbc_obj_disconnect (obj=0x2aebb28) at res_odbc.c:1451
#5 0x00007f4846104445 in ast_odbc_sanity_check (obj=0x2aebb28) at res_odbc.c:745
#6 0x00007f4846103ee1 in ast_odbc_prepare_and_execute (obj=0x2aebb28, prepare_cb=0x7f483f2ff04a <custom_prepare>, data=0x7fff59a093e0) at res_odbc.c:655
#7 0x00007f483f2ffea3 in realtime_odbc (database=0x7fff59a0ab60 "asterisk", table=0x7fff59a0aa60 "sippeer", ap=0x7fff59a0aca0) at res_config_odbc.c:209
#8 0x000000000049f89f in ast_load_realtime_helper (family=0x7f483f5c7bd7 "sippeers", ap=0x7fff59a0aca0) at config.c:2332
#9 0x000000000049fa7d in ast_load_realtime (family=0x7f483f5c7bd7 "sippeers") at config.c:2363
#10 0x00007f483f5220be in realtime_peer_by_name (name=0x7fff59a0ae58, addr=0x0, ipaddr=0x7fff59a0aea0 "", var=0x7fff59a0ae78, varregs=0x0) at chan_sip.c:4749
#11 0x00007f483f522674 in realtime_peer (newpeername=0x260e1be "123", addr=0x0, devstate_only=0, which_objects=2) at chan_sip.c:4895
#12 0x00007f483f522d50 in find_peer (peer=0x260e1be "123", addr=0x0, realtime=1, which_objects=2, devstate_only=0, transport=0) at chan_sip.c:5009
#13 0x00007f483f56a5f7 in _sip_show_peer (type=0, fd=1, s=0x0, m=0x0, argc=5, argv=0x7fff59a0bac8) at chan_sip.c:17183
#14 0x00007f483f56a001 in sip_show_peer (e=0x7f483f7ed008, cmd=-4, a=0x7fff59a0bcd0) at chan_sip.c:17076
#15 0x00000000004984ea in ast_cli_command_full (uid=-1, gid=-1, fd=1, s=0x260e700 "sip show peer 123 load") at cli.c:2502
#16 0x0000000000440b68 in consolehandler (s=0x260e700 "sip show peer 123 load") at asterisk.c:1810
#17 0x0000000000447f80 in main (argc=2, argv=0x7fff59a0e1e8) at asterisk.c:3896
{noformat}

h3. Cause
There are two causes for this:

Cause (1): there is a sanity check (SELECT 1) to check that the DB is still up. However the check is not performed. An unconditional reconnect is done instead.
{noformat}
ast_odbc_prepare_and_execute sets obj->up to 0 before calling ast_odbc_sanity_check,
result: ast_odbc_sanity_check skips the SELECT 1 and performs the reconnect regardless
{noformat}
Is there a reason why obj->up was set to 0? Removing that makes this a lot harder to reproduce:
{code}

Index: res/res_odbc.c
===================================================================
--- res/res_odbc.c (revision 348047)
+++ res/res_odbc.c (working copy)
@@ -647,7 +647,6 @@
SQLFreeHandle(SQL_HANDLE_STMT, stmt);
stmt = NULL;

- obj->up = 0;
/*
* While this isn't the best way to try to correct an error, this won't automatically
* fail when the statement handle invalidates.
{code}

Cause (2): it seems like using obj->con is perfectly thread safe, except when we're reconnecting.

To fix this, one might use a rwlock instead of a mutex.
(a) When we're going to use obj->con, do a (shared) read-lock on obj->lock.
(b) When we're going to close obj->con, do a (exclusive) write-lock on obj->lock. (Better have it be recursive too.. so we can upgrade our lock from read to write.)

It may affect performance a bit, but I'm hoping the sharedness of the readlock will cause it to have a minimal impact.

Thoughts?

Comments: By: Walter Doekes (wdoekes) 2011-12-13 03:33:57.954-0600

I was thinking something along the lines of this. But it's not all it can be.

(1) The assertion fails. (FRACK!, Failed assertion obj->con == NULL)
(2) It introduces a lot of extra locking and possibly locking issues.

It does however resolve the crash, without having removed obj->up=0.
By: Matt Jordan (mjordan) 2011-12-13 12:01:43.770-0600

I'd say put it up on reviewboard if nothing else.