ASTERISK-27615: Dialplan deadlock when connection to external SQL server is lost

[Home]

Summary: ASTERISK-27615: Dialplan deadlock when connection to external SQL server is lost

Reporter: Jared Hull (fortytwo) Labels:

Date Opened: 2018-01-23 18:17:09.000-0600 Date Closed: 2018-02-06 14:58:31.000-0600

Priority: Critical Regression?

Status: Closed/Complete Components: CDR/cdr_adaptive_odbc Functions/func_cdr

Versions: 14.6.0 Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) core-asterisk-running-2018-01-26T17-11-24+0000-brief.txt
( 1) core-asterisk-running-2018-01-26T17-11-24+0000-full.txt
( 2) core-asterisk-running-2018-01-26T17-11-24+0000-locks.txt
( 3) core-asterisk-running-2018-01-26T17-11-24+0000-thread1.txt
( 4) core-asterisk-running-2018-01-26T20-21-53+0000-brief.txt
( 5) core-asterisk-running-2018-01-26T20-21-53+0000-full.txt
( 6) core-asterisk-running-2018-01-26T20-21-53+0000-locks.txt
( 7) core-asterisk-running-2018-01-26T20-21-53+0000-thread1.txt

Description: We have a cluster of SQL servers for CDR and realtime states that are used in dialplan. Recently we had a single SQL server lose network connectivity, and all Asterisk instances which used this server as their primary started to hang in dialplan.

If I stop the SQL service while leaving the server pingable, Asterisk will continue to work and simply return a few errors when the CDR is committed to the database.
{code}
res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'. CDR failed: INSERT INTO cdr_event_log (
{code}
If I 'service network stop' on the SQL server to simulate network failure, asterisk stops executing dialplan related to func_odbc and cdr_adaptive_odbc. It is as if it still thinks the SQL connection is there, and refuses to failover to another DSN in the case of func_odbc. cdr_adaptive_odbc doesn't even have failover connections (this would be a very useful feature) so I don't know what can be done about that, other than to skip CDR and throw an error.

Dialplan to reproduce:
{code}
exten => 101,1,noop(${CDR(anything)})
exten => 102,1,noop(${ODBC_blacklist_global(42)})
{code}
Example of cdr_adaptive_odbc.conf entry:
{code}
[default]
connection=dev-dallas-sql1
table=cdr_event_log
{code}
Example of func_odbc.conf entry:
{code}
[blacklist_global]
dsn=dev-dallas-sql1,dev-dallas-sql2,dev-dallas-sql3
readsql=SELECT COUNT(*) FROM blacklist_global WHERE cid_number='${SQL_ESC(${ARG1})}'
{code}

Comments: By: Asterisk Team (asteriskteam) 2018-01-23 18:17:10.596-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: George Joseph (gjoseph) 2018-01-24 07:34:53.035-0600

Hi Jared,

A few things we need...
Linux distribution and version.
unixodbc and mysql odbc connector version.
/etc/odbc.ini and odbcinst.ini.

Also, can you try the failure scenario with only func_odbc, then with only cdr_adaptive_odbc and see if anything changes if only one of them is trying to get to the server? Maybe both of them together are causing a deadlock.

By: Jared Hull (fortytwo) 2018-01-24 13:01:46.865-0600

Same issue when loading one or the other.

CentOS 7.3.1611
Kernel 3.10.0-514.21.1.el7.x86_64
unixODBC version 2.3.1 release 11.el7.x86_64

My odbc.ini entries all look like this:
{noformat}
[dev-dallas-sql1]
Description=MySQL for Asterisk CDR
Driver=MySQL
Driver64=MySQL
Server=dev-dallas-sql1.myserver.net
UserName=root
Password=password
Database=container_999999
{noformat}

odbcinst.ini
{noformat}
[MySQL]
Description=ODBC for MySQL
Driver64=/usr/lib64/libmyodbc5.so
Setup64=/usr/lib64/libodbcmyS.so
FileUsage=1
DontDLClose=1
CPTimeout=1000
CPTimeToLive=100
Threading=2
Pooling=yes
{noformat}

Here is something that might indicate where the trouble lies.
{noformat}exten => 102,1,noop(${ODBC_blacklist_global(42)}){noformat}
{noformat}
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: pbx_variables.c:474 ast_str_substitute_variables_full: Evaluating 'SQL_ESC(${ARG1})' (from 'SQL_ESC(${ARG1})}'' len 16)
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: pbx_variables.c:474 ast_str_substitute_variables_full: Evaluating 'ARG1' (from 'ARG1})' len 4)
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: pbx_variables.c:381 ast_str_retrieve_variable: Result of 'ARG1' is '42'
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: pbx_variables.c:508 ast_str_substitute_variables_full: Function SQL_ESC(42) result is '42'
-----when locked up, Asterisk does not get farther than this-----
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: res_odbc.c:931 odbc_obj_connect: Connecting dev-dallas-sql1(0x25f62e8)
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: res_odbc.c:969 odbc_obj_connect: res_odbc: Connected to dev-dallas-sql1 [dev-dallas-sql1 (0x25f62e8)]
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: res_odbc.c:848 _ast_odbc_request_obj2: Created ODBC handle 0x25f62e8 on class 'dev-dallas-sql1', new count is 1
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: func_odbc.c:910 acf_odbc_read: Got collength of 8 and maxcol of 19 for column 'COUNT(*)' (offset 0)
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: func_odbc.c:957 acf_odbc_read: Got coldata of '0'
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: func_odbc.c:981 acf_odbc_read: buf is now set to '0'
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: func_odbc.c:983 acf_odbc_read: buf is now set to '0'
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: res_odbc.c:718 ast_odbc_release_obj: Releasing ODBC handle 0x25f62e8 into pool
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: pbx_variables.c:708 pbx_substitute_variables_helper_full: Function ODBC_blacklist_global(42) result is '0'
[Jan 24 15:42:39] DEBUG[10075][C-00000026]: pbx.c:2914 pbx_extension_helper: Launching 'NoOp'
-- Executing [102@from-internal:1] NoOp("PJSIP/103a1-0000002d", "0") in new stack
{noformat}

Entires in res_odbc.conf
{noformat}
[dev-dallas-sql1]
enabled => yes
dsn => dev-dallas-sql1
username => root
password => password
pre-connect => yes
{noformat}

By: George Joseph (gjoseph) 2018-01-24 14:05:12.908-0600

A few things...
You have connection pooling turned on. The CentOS7 default versions of unixodbc and mysql-connector-odbc are not stable with connection pooling.

Here's some history...
http://lists.digium.com/pipermail/asterisk-dev/2016-June/075580.html

We recommend unixodbc 2.3.4 and whatever the latest mysql-connector-odbc version is, both built from source if the distribution doesn't provide the latest packages.

If you can't upgrade, you can try turning pooling off in odbcinst.ini as a workaround.

Whichever route you take, if the problem persists, we'd like a backtrace taken from the running asterisk instance while it's locked up.

You can get that by running
{{ sudo /var/lib/asterisk/scripts/ast_coredumper --running}}
but it will effectively halt asterisk altogether while it's doing the dump.

Also, Asterisk 14 is no longer supported so going down to 13.19 or up to 15.2 might be a good idea.

By: Jared Hull (fortytwo) 2018-01-24 18:24:26.562-0600

I've spun up a fresh instance with
unixODBC 2.3.5-1.el7
mysql-connector-odbc 5.2.5-6.el7
I've tried pooling=no and no pooling option at all in odbcinst.ini
I'm still having problems.

If I start networking on the SQL server while the dialplan is locked up, calls will start getting through again and I will see something like this.
{noformat}res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1{noformat}
If I restart networking and SQL at the same time (since SQL stops functioning when I kill the network), I instead see this as dialplan starts executing again.
{noformat}func_odbc.c:486 execute: SQL Exec Direct failed (-1)![SELECT COUNT(*) FROM blacklist_global WHERE cid_number='03a1']{noformat}

When bringing networking back up, lines which get/set with the function CDR() will sometimes do this.
{noformat}
res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
[Jan 24 23:31:30] WARNING[39]: cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'. CDR failed: INSERT INTO cdr_event_log (
[Jan 24 23:31:30] WARNING[39]: res_odbc.c:962 odbc_obj_connect: res_odbc: Error SQLConnect=-1 errno=2003 [unixODBC][MySQL][ODBC 5.2(w) Driver]Can't connect to MySQL server on 'dev-dallas-sql1
[Jan 24 23:31:30] WARNING[39]: cdr_adaptive_odbc.c:436 odbc_log: cdr_adaptive_odbc: Unable to retrieve database handle for 'dev-dallas-sql1:cdr_event_log'. CDR failed: INSERT INTO cdr_event_log (
{noformat}

An interesting pattern, for the first call that dialplan processes after the SQL server goes offline everything will be fine.
{noformat}All of the gets with Set(var=${CDR(something)}) and the sets with SET(CDR(something)=${var}) will work.{noformat}
First call is ok all the way through.
{noformat}
-- Executing [*97@from-internal:1] Goto("PJSIP/103a1-00000024", "vm103,1") in new stack
-- Goto (from-internal,vm103,1)
-- Executing [vm103@from-internal:1] Gosub("PJSIP/103a1-00000024", "subroutine-cdr,s,1(11,58de7c34bfec4)") in new stack
-- Executing [s@subroutine-cdr:1] Set("PJSIP/103a1-00000024", "saverecpath=") in new stack
-- Executing [s@subroutine-cdr:2] Set("PJSIP/103a1-00000024", "savebillable=") in new stack
-- Executing [s@subroutine-cdr:3] ForkCDR("PJSIP/103a1-00000024", "e") in new stack
-- Executing [s@subroutine-cdr:4] Set("PJSIP/103a1-00000024", "CDR(cdr_event_type)=11") in new stack
-- Executing [s@subroutine-cdr:5] Set("PJSIP/103a1-00000024", "CDR(cdr_eventID)=58de7c34bfec4") in new stack
-- Executing [s@subroutine-cdr:6] Set("PJSIP/103a1-00000024", "CDR(cdr_callerid_num)=103a1") in new stack
-- Executing [s@subroutine-cdr:7] Set("PJSIP/103a1-00000024", "CDR(cdr_callerid_name)=x103 &") in new stack
-- Executing [s@subroutine-cdr:8] Set("PJSIP/103a1-00000024", "CDR(cdr_record_file_location)=") in new stack
-- Executing [s@subroutine-cdr:9] Set("PJSIP/103a1-00000024", "CDR(cdr_billable)=") in new stack
-- Executing [s@subroutine-cdr:10] Return("PJSIP/103a1-00000024", "") in new stack
-- Executing [vm103@from-internal:2] StopMixMonitor("PJSIP/103a1-00000024", "") in new stack
-- Executing [vm103@from-internal:3] VoiceMailMain("PJSIP/103a1-00000024", "103@voicemail,s") in new stack
== Spawn extension (from-internal, vm103, 3) exited non-zero on 'PJSIP/103a1-00000024'
-- Executing [h@from-internal:1] Set("PJSIP/103a1-00000024", "CDR(cdr_hangupcause)=16") in new stack
-- Executing [h@from-internal:2] GotoIf("PJSIP/103a1-00000024", "0?txfax") in new stack
-- Executing [h@from-internal:3] GotoIf("PJSIP/103a1-00000024", "0?rg") in new stack
-- Executing [h@from-internal:4] GotoIf("PJSIP/103a1-00000024", "0?rg") in new stack
-- Executing [h@from-internal:5] GotoIf("PJSIP/103a1-00000024", "0?queue") in new stack
-- Executing [h@from-internal:6] Goto("PJSIP/103a1-00000024", "done") in new stack
-- Goto (from-internal,h,28)
-- Executing [h@from-internal:28] NoOp("PJSIP/103a1-00000024", "") in new stack
{noformat}
All calls after the first will start to hang at the first get var=CDR()
{noformat}
-- Executing [*97@from-internal:1] Goto("PJSIP/103a1-00000025", "vm103,1") in new stack
-- Goto (from-internal,vm103,1)
-- Executing [vm103@from-internal:1] Gosub("PJSIP/103a1-00000025", "subroutine-cdr,s,1(11,58de7c34bfec4)") in new stack
---presumably it is hanging in line 1 of this subroutine which is Set(saverecpath=${CDR(cdr_record_file_location)})
{noformat}

I can start 3 calls which execute the voicemail dialplan as long as I don't hang any of them up.
When I hang them up, the first call to be hung up finishes the hangup handler but the rest stop after the first line
{noformat}
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: channel.c:2590 ast_softhangup_nolock: Soft-Hanging (0x80) up channel 'PJSIP/103a1-0000002d'
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: pbx_variables.c:381 ast_str_retrieve_variable: Result of 'HANGUPCAUSE' is '16'
[Jan 25 00:13:11] DEBUG[31908][C-0000002e]: pbx.c:2914 pbx_extension_helper: Launching 'Set'
-- Executing [h@from-internal:1] Set("PJSIP/103a1-0000002d", "CDR(cdr_hangupcause)=16") in new stack
--the next lines are not not in the verbose debug log for the other calls being hung up, I think they are still stuck in the first line or maybe it can't finalise the CDR--
[Jan 25 00:13:11] DEBUG[39]: cdr.c:1296 cdr_object_finalize: Finalized CDR for PJSIP/103a1-0000002d - start 1516839184.775040 answer 1516839184.804170 end 1516839191.362747 dispo ANSWERED
same => n,GotoIf($["${CDR(cdr_event_type)}" = "2"]?txfax)
{noformat}
All of this seems to suggest that the first instance of a set CDR()= blocks all gets and sets thereafter.
{noformat}exten => h,1,Set(CDR(cdr_hangupcause)=${HANGUPCAUSE}){noformat}

I recompiled asterisk with BETTER_BACKTRACES and DEBUG_THREADS, but I still can not get ast_coredumper to write anything other than headers to the 4 files it creates.
There I can see in the verbose debug log that is any different from things that work, all I see is missing lines because dialplan stops working.

I can try to get some logs for you and figure out how to dump the core, but I think the quick fix for us right now might be to implement an HTTP API instead of ODBC for our CDR and dialplan func_odbc queries.
By: George Joseph (gjoseph) 2018-01-26 07:35:15.025-0600

We're taking a look but in the mean time if {{ast_coredumper --running}}
isn't giving you output, make sure a "core-asterisk-runninig-*" file was created in /tmp.
You can also use the {{\--no-default-search}} option to prevent ast_coredumper from processing any core file that may already be in the directory.

By: Jared Hull (fortytwo) 2018-01-26 11:23:26.075-0600

I've updated to Asterisk 15.2 and the problem still persists.
Attached is the output generated by ast_coredumper, for this test case I:
# stopped networking on the SQL server
# dialed my *97 voicemail dialplan and hung up, which should have finalised the CDR an attempted to insert it into the unreachable SQL server
# dialed *97 again, and the dialplan stopped at saverecpath=CDR(cdr_record_file_location)
By: Jared Hull (fortytwo) 2018-01-26 15:09:53.203-0600

Ok, I've got a good idea about what is going on now.
When Asterisk finalises CDR with {{odbc_log() from cdr_adaptive_odbc.c}} or executes a dialplan ODBC_ function with {{func_odbc.c}}, it calls {{connection_dead() from res_odbc.c}}. This function calls {{SQLGetConnectAttr() from libmyodbc5.so}} to get the resulting attribute of SQL_ATTR_CONNECTION_DEAD, then it calls {{mysql_ping() from libmysqlclient.so.18}} which doesn't return unless I restore networking on the SQL server. There might be a bug in a SQL library I am using.

That aside, I don't think asterisk should be hanging on subsequent ODBC_ dialplan functions and CDR() gets/sets.
By: Jared Hull (fortytwo) 2018-01-26 16:51:43.633-0600

I'm also attaching the core dump for the ODBC_ function hang.
core-asterisk-running-2018-01-26T20-21-53+0000
By: Jared Hull (fortytwo) 2018-02-06 15:38:43.339-0600

A bit of an update on this, I've recompiled just about everything related and the problem persists.

From this configuration
{noformat}Asterisk 14.6
unixODBC-2.3.1-11.el7.x86_64
Standard Centos7 mysql libraries (MariaDB 5.5.56-2.el7)
mysql-connector-odbc.x86_64 5.2.5-6.el7{noformat}
To this configuration
{noformat}Asterisk 15.2
unixODBC-2.3.5-1.el7.x86_64
MariaDB-shared (mysql libraries from the official MariaDB 10.2 repo)
mariadb-connector-odbc-3.0.2{noformat}

If the mysql libraries refuse to return from mysql_ping(), maybe it would be best for Asterisk to try a `SELECT 1` and look at the exception to determine if the connection is alive?
By: Richard Mudgett (rmudgett) 2018-02-07 16:47:56.429-0600

I see why calls to the CDR function can block. The CDR code has one thread to process CDR events from the system. CDR function requests to read or write to CDRs are handled by that one CDR processing thread. Once the first call completes, the CDR processing thread tries to write completed CDRs to the back ends and blocks trying to access the database. Subsequent CDR function requests are now blocked because the CDR processing thread is blocked.

You can avoid the CDR processing thread blocking by enabling batch mode. In batch mode, when CDRs are finalized they get queued to another thread to write them to the CDR back ends. There is a comment in cdr.c that not using batch mode is legacy behavior. See the CDR options documentation \[1] about configuring batch mode.

Now for the thread blocking in the database connector library. This looks more like a bug in the library. The {{connection_dead()}} routines in func_odbc.c and res_odbc.c check {{SQL_ATTR_CONNECTION_DEAD}} before trying a {{SELECT 1}}. To see if your {{SELECT 1}} idea would even help, as a test you can remove the {{SQL_ATTR_CONNECTION_DEAD}} check to do the {{SELECT 1}} check instead.

\[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+15+Configuration_cdr
By: Jared Hull (fortytwo) 2018-02-07 18:40:41.839-0600

Batch mode does seem to help avoid the CDR problem, but certain bits of my dialplan still hang when I use ODBC_ functions. They must also be handled by a single thread, which is being blocked by the dead connection check.

I've patched out the {{SQL_ATTR_CONNECTION_DEAD}} checks in favour of {{SELECT 1}} but it still blocks the thread.
It is hanging in the same place.
{noformat}#0 0x00007fe65ac8e70d in read () from /lib64/libpthread.so.0
#1 0x00007fe5ebb4daf0 in vio_read () from /usr/lib64/mysql/libmysqlclient.so.18
#2 0x00007fe5ebb4db71 in vio_read_buff () from /usr/lib64/mysql/libmysqlclient.so.18
#3 0x00007fe5ebb31d0a in my_real_read(st_net*, unsigned long*) () from /usr/lib64/mysql/libmysqlclient.so.18
#4 0x00007fe5ebb32b7c in my_net_read () from /usr/lib64/mysql/libmysqlclient.so.18
#5 0x00007fe5ebb257ac in cli_safe_read () from /usr/lib64/mysql/libmysqlclient.so.18
#6 0x00007fe5ebb26ca6 in cli_read_query_result () from /usr/lib64/mysql/libmysqlclient.so.18
#7 0x00007fe5ebb27e56 in mysql_real_query () from /usr/lib64/mysql/libmysqlclient.so.18
#8 0x00007fe5f0434244 in do_query () from /usr/lib64/libmyodbc5.so
#9 0x00007fe5f043626a in my_SQLExecute () from /usr/lib64/libmyodbc5.so
#10 0x00007fe5f3ba6439 in SQLExecute () from /lib64/libodbc.so.2
#11 0x00007fe5f3dfb6d1 in connection_dead (connection=0x1e37c88, class=0x1e33218) at res_odbc.c:797
{noformat}

There must be something very wrong with my environment, I guess Asterisk is doing all it can.
By: Richard Mudgett (rmudgett) 2018-02-08 09:34:21.928-0600

Batch mode just shifts the database access to another thread to write the CDR records. The ODBC functions need to access the database right then so they would still have to block waiting for a result.

From my reading about {{SQL_ATTR_CONNECTION_DEAD}}, it is not supposed to query or ping the database but simply return the last known connection status. I think this is a bug in the database connector library and is outside asterisk's control.
By: Jared Hull (fortytwo) 2018-02-08 10:32:09.126-0600

Thanks for the help. I will experiment with my libraries to see if I can work out the bug.