Summary: | ASTERISK-19011: crashing res_odbc because of use of obj->con while reconnecting | ||
Reporter: | Walter Doekes (wdoekes) | Labels: | |
Date Opened: | 2011-12-13 02:16:40.000-0600 | Date Closed: | 2012-02-27 09:18:20.000-0600 |
Priority: | Minor | Regression? | No |
Status: | Closed/Complete | Components: | Resources/res_odbc |
Versions: | 1.8.8.0 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | Attachments: | ( 0) issueA19011_combine_read_and_write_locks_WORK_IN_PROGRESS.patch | |
Description: | Hi, there is potential for res_odbc to crash in the right circumstances: (1) it's reconnecting (2) someone else is attempting to do a query. There is this mutex: {code} struct odbc_obj { ast_mutex_t lock; SQLHDBC con; /*!< ODBC Connection Handle */ {code} But it's only used by the reconnecting party, not by the users of obj->con. This means that other threads can attempt to execute queries on the SQLHDBC that is being closed. h3. Steps to reproduce Note that we explicitly *do* *not* create the necessary sippeer table. One could argue that it's my own fault for not having the right tables. But (a) there are cases when a table is gone for a short while (table re-creation) and (b) the issue could still occur with the right tables, but it's harder to reproduce. {noformat} >>> extconfig.conf <<< [settings] sippeers => odbc,asterisk,sippeer >>> modules.conf <<< [modules] autoload=no load => chan_sip.so load => res_odbc.so load => res_config_odbc.so >>> res_odbc.conf <<< [asterisk] enabled => yes dsn => asterisk pre-connect => yes >>> sip.conf <<< [general] {noformat} Start asterisk. Start a shell and run this: {noformat} $ while true; do sipsak -U -s sip:100@localhost; done {noformat} Type this in the console: {noformat} *CLI> sip show peer 123 load {noformat} Result: {noformat} ... res_odbc: Connected to asterisk [asterisk] SQL Execute returned an error -1: 42S02: [MySQL][ODBC 3.51 Driver][mysqld-5.0.51a-24+lenny5-log]Table 'asterisk.sippeer' doesn't exist (100) SQL Execute error -1! Verifying connection to asterisk [asterisk]... Connection is down attempting to reconnect... SQL Execute returned an error -1: 42S02: [MySQL][ODBC 3.51 Driver][mysqld-5.0.51a-24+lenny5-log]Table 'asterisk.sippeer' doesn't exist (100) SQL Execute error -1! Verifying connection to asterisk [asterisk]... Error in my_thread_global_end(): 1 threads didn't exit Disconnected 0 from asterisk [asterisk] Segmentation fault {noformat} h3. Backtrace {noformat} Program terminated with signal 11, Segmentation fault. #0 0x00007f4845eb55ae in ?? () from /usr/lib/libodbc.so.1 {noformat} {noformat} (gdb) back #0 0x00007f4845eb55ae in ?? () from /usr/lib/libodbc.so.1 #1 0x00007f4846103eb6 in ast_odbc_prepare_and_execute (obj=0x2aebb28, prepare_cb=0x7f483f2ff04a <custom_prepare>, data=0x7f48441a49e0) at res_odbc.c:647 #2 0x00007f483f2ffea3 in realtime_odbc (database=0x7f48441a6160 "asterisk", table=0x7f48441a6060 "sippeer", ap=0x7f48441a62a0) at res_config_odbc.c:209 #3 0x000000000049f89f in ast_load_realtime_helper (family=0x7f483f5c7bd7 "sippeers", ap=0x7f48441a62a0) at config.c:2332 #4 0x000000000049fa7d in ast_load_realtime (family=0x7f483f5c7bd7 "sippeers") at config.c:2363 #5 0x00007f483f52202c in realtime_peer_by_name (name=0x7f48441a6458, addr=0x0, ipaddr=0x7f48441a64a0 "", var=0x7f48441a6478, varregs=0x0) at chan_sip.c:4743 #6 0x00007f483f522674 in realtime_peer (newpeername=0x7f48441a6c54 "100", addr=0x0, devstate_only=0, which_objects=2) at chan_sip.c:4895 #7 0x00007f483f522d50 in find_peer (peer=0x7f48441a6c54 "100", addr=0x0, realtime=1, which_objects=2, devstate_only=0, transport=0) at chan_sip.c:5009 #8 0x00007f483f559123 in register_verify (p=0x2b1a158, addr=0x7f48441a7cc0, req=0x7f48441a7260, uri=0x2b27ca1 "sip:localhost") at chan_sip.c:14575 #9 0x00007f483f593bee in handle_request_register (p=0x2b1a158, req=0x7f48441a7260, addr=0x7f48441a7cc0, e=0x2b27ca1 "sip:localhost") at chan_sip.c:24736 #10 0x00007f483f5954df in handle_incoming (p=0x2b1a158, req=0x7f48441a7260, addr=0x7f48441a7cc0, recount=0x7f48441a721c, nounlock=0x7f48441a7218) at chan_sip.c:25008 #11 0x00007f483f595e64 in handle_request_do (req=0x7f48441a7260, addr=0x7f48441a7cc0) at chan_sip.c:25168 #12 0x00007f483f595a84 in sipsock_read (id=0x2b26460, fd=6, events=1, ignore=0x0) at chan_sip.c:25102 #13 0x00000000004ef950 in ast_io_wait (ioc=0x7f4840006da0, howlong=1000) at io.c:288 #14 0x00007f483f5979f9 in do_monitor (data=0x0) at chan_sip.c:25631 #15 0x00000000005866ed in dummy_start (data=0x2b263f0) at utils.c:1004 #16 0x00007f4846ef19ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #17 0x00007f484774270d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #18 0x0000000000000000 in ?? () {noformat} {noformat} (gdb) thread 11 [Switching to thread 11 (Thread 13152)]#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 (gdb) back #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 #1 0x00007f4846ef45d9 in _L_lock_953 () from /lib/libpthread.so.0 #2 0x00007f4846ef43fb in __pthread_mutex_lock (mutex=0x7f48460fe580) at pthread_mutex_lock.c:61 #3 0x00007f4845eb56d0 in ?? () from /usr/lib/libodbc.so.1 #4 0x00007f4846108209 in odbc_obj_disconnect (obj=0x2aebb28) at res_odbc.c:1451 #5 0x00007f4846104445 in ast_odbc_sanity_check (obj=0x2aebb28) at res_odbc.c:745 #6 0x00007f4846103ee1 in ast_odbc_prepare_and_execute (obj=0x2aebb28, prepare_cb=0x7f483f2ff04a <custom_prepare>, data=0x7fff59a093e0) at res_odbc.c:655 #7 0x00007f483f2ffea3 in realtime_odbc (database=0x7fff59a0ab60 "asterisk", table=0x7fff59a0aa60 "sippeer", ap=0x7fff59a0aca0) at res_config_odbc.c:209 #8 0x000000000049f89f in ast_load_realtime_helper (family=0x7f483f5c7bd7 "sippeers", ap=0x7fff59a0aca0) at config.c:2332 #9 0x000000000049fa7d in ast_load_realtime (family=0x7f483f5c7bd7 "sippeers") at config.c:2363 #10 0x00007f483f5220be in realtime_peer_by_name (name=0x7fff59a0ae58, addr=0x0, ipaddr=0x7fff59a0aea0 "", var=0x7fff59a0ae78, varregs=0x0) at chan_sip.c:4749 #11 0x00007f483f522674 in realtime_peer (newpeername=0x260e1be "123", addr=0x0, devstate_only=0, which_objects=2) at chan_sip.c:4895 #12 0x00007f483f522d50 in find_peer (peer=0x260e1be "123", addr=0x0, realtime=1, which_objects=2, devstate_only=0, transport=0) at chan_sip.c:5009 #13 0x00007f483f56a5f7 in _sip_show_peer (type=0, fd=1, s=0x0, m=0x0, argc=5, argv=0x7fff59a0bac8) at chan_sip.c:17183 #14 0x00007f483f56a001 in sip_show_peer (e=0x7f483f7ed008, cmd=-4, a=0x7fff59a0bcd0) at chan_sip.c:17076 #15 0x00000000004984ea in ast_cli_command_full (uid=-1, gid=-1, fd=1, s=0x260e700 "sip show peer 123 load") at cli.c:2502 #16 0x0000000000440b68 in consolehandler (s=0x260e700 "sip show peer 123 load") at asterisk.c:1810 #17 0x0000000000447f80 in main (argc=2, argv=0x7fff59a0e1e8) at asterisk.c:3896 {noformat} h3. Cause There are two causes for this: Cause (1): there is a sanity check (SELECT 1) to check that the DB is still up. However the check is not performed. An unconditional reconnect is done instead. {noformat} ast_odbc_prepare_and_execute sets obj->up to 0 before calling ast_odbc_sanity_check, result: ast_odbc_sanity_check skips the SELECT 1 and performs the reconnect regardless {noformat} Is there a reason why obj->up was set to 0? Removing that makes this a lot harder to reproduce: {code} Index: res/res_odbc.c =================================================================== --- res/res_odbc.c (revision 348047) +++ res/res_odbc.c (working copy) @@ -647,7 +647,6 @@ SQLFreeHandle(SQL_HANDLE_STMT, stmt); stmt = NULL; - obj->up = 0; /* * While this isn't the best way to try to correct an error, this won't automatically * fail when the statement handle invalidates. {code} Cause (2): it seems like using obj->con is perfectly thread safe, except when we're reconnecting. To fix this, one might use a rwlock instead of a mutex. (a) When we're going to use obj->con, do a (shared) read-lock on obj->lock. (b) When we're going to close obj->con, do a (exclusive) write-lock on obj->lock. (Better have it be recursive too.. so we can upgrade our lock from read to write.) It may affect performance a bit, but I'm hoping the sharedness of the readlock will cause it to have a minimal impact. Thoughts? | ||
Comments: | By: Walter Doekes (wdoekes) 2011-12-13 03:33:57.954-0600 I was thinking something along the lines of this. But it's not all it can be. (1) The assertion fails. (FRACK!, Failed assertion obj->con == NULL) (2) It introduces a lot of extra locking and possibly locking issues. It does however resolve the crash, without having removed obj->up=0. By: Matt Jordan (mjordan) 2011-12-13 12:01:43.770-0600 I'd say put it up on reviewboard if nothing else. |