[Home]

Summary:ASTERISK-27314: chan_sip: Crash/Deadlock with realtime peers (MySQL)
Reporter:Cyril Ramière (Cyril.r)Labels:crash deadlock
Date Opened:2017-10-04 18:05:30Date Closed:2020-01-14 11:13:34.000-0600
Priority:CriticalRegression?
Status:Closed/CompleteComponents:General
Versions:14.5.0 14.6.2 15.0.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Linux Ubuntu 16.04.3 LTS (4.4.0-96-generic)Attachments:( 0) asterisk.zip
( 1) asterisk-2017-10-04T15-15-31+0200.coredumps.tar.gz
( 2) core_show_locks.txt
( 3) core-asterisk-running-2017-10-04T15-15-31+0200.zip
( 4) core-asterisk-running-2017-10-04T15-15-31+0200-brief.txt
( 5) core-asterisk-running-2017-10-04T15-15-31+0200-full.txt
( 6) core-asterisk-running-2017-10-04T15-15-31+0200-full.txt
( 7) core-asterisk-running-2017-10-04T15-15-31+0200-locks.txt
( 8) core-asterisk-running-2017-10-04T15-15-31+0200-thread1.txt
Description:Hi,

I think I found a problem with asterisk > 13.x (tested on 14.5.0 & 14.6.2 & 15.0.0) & MySQL realtime.

When there are two (or more) incoming calls made almost simultaneously, or the server is "under load", the asterisk server stops responding to INVITEs almost instantaneously, locks one CPU to 100%, then acts strange (it is not fully locked but it doesn't accept new calls, while keeping active calls up)

Found out that when asterisk is "stuck", doing "realtime mysql status" in the console causes a segfault or freeze the console input.

After digging through the problem, it seems that the "sippeers => mysql,asterisk,sipUsers" line in extconfig.conf is causing all the troubles.

Note : sql schema should be fine, I used the file path/to/src/asterisk-x.x.x/contrib/realtime/mysql/mysql_config.sql, connection to the server is fine too.

Even with an empty table the problem happens.

I made my tests with SIPP to reproduce the issue, here is the command (or you can just "load" the server with real calls):

(limit=10 calls at rate 3calls/sec to exten 1000)
sipp -sn uac -d 600000 -s "1000" 10.1.1.104 -l 10 -r 3 -mp 5606

Configuration is :

Ubuntu 16.04.3 LTS (4.4.0-96-generic)

- sip.conf : my trunk
- extensions.conf : static dialplan for test + switch realtime (not used for the test)
- extconfig.conf : use for MOH, extensions, peers (problem here)
- res_mysql.conf : our SQL server

The exact same configurations is fine on our older asterisks V13.9.1 (no problems).

Here are the "core show locks" result, dumps attached to the ticket.

core show locks
[Edit by Rusty - Removed and attached to issue as core_show_locks.txt - Please attached all lengthy debug to the ticket]
Comments:By: Asterisk Team (asteriskteam) 2017-10-04 18:05:30.903-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2017-10-05 07:33:31.759-0500

Please attach a concise configuration (whatever is needed in /etc/asterisk/) that can be used to reproduce the issue with your sipp test.

By: Cyril Ramière (Cyril.r) 2017-10-05 18:20:26.886-0500

Hello,

I've attached the configuration I used, this is a "minimum" configuration just to reproduce the problem. [asterisk.zip]

The problem seems linked to the latency/response time of the sql server.

I was able to run about 100 simultaneous calls with 5 calls/sec fine with a local sql server, but it locked up after a while (seems that just a little lag/delay can trigger the issue).
When RDS is used (remote SQL server with about 10-20ms connection delay) the problem happens almost instantaneously.

You can use the following command to add some latency to your network adapter, then the problem appears instantaneously on the second call.

{{tc qdisc add dev ethXX root netem delay 30ms}}



By: Cyril Ramière (Cyril.r) 2017-10-05 18:20:52.136-0500

Configuration files for asterisk

By: Joshua C. Colp (jcolp) 2017-10-20 11:20:17.731-0500

This issue appears to be a result of the MySQL library itself, and not Asterisk:

{noformat}
Thread 27 (Thread 0x7f3558152700 (LWP 62061)):
#0  0x00007f3517b6ec7c in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#1  0x00007f3517b7ca8b in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#2  0x00007f3517b61f0d in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#3  0x00007f3517b49316 in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#4  0x00007f3517b25588 in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#5  0x00007f3517b25827 in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#6  0x00007f3517b265ac in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#7  0x00007f3517b1af07 in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#8  0x00007f3517b2182b in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#9  0x00007f3517b14b26 in mysql_ping () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20
No symbol table info available.
#10 0x00007f3518107bc1 in mysql_reconnect (conn=0x3049f80) at res_config_mysql.c:1374
       __PRETTY_FUNCTION__ = "mysql_reconnect"
#11 0x00007f35181018ff in realtime_mysql (database=0x7f355814f850 "asterisk", table=0x7f355814f950 "sippeersnew", rt_fields=0x7f3530058030) at res_config_mysql.c:348
       dbh = 0x3049f80
       result = 0x0
       row = 0xcdf350
       fields = 0x7f351830c620 <mysql_engine>
       numFields = 0
       i = 5226433
       sql = 0x7f353000e4e0
       buf = 0x7f353000e410
       stringp = 0x7f355814f810 "`372024X5177"
       chunk = 0x4fc188 <find_engine+452> "H211330H213]350dH3034%("
       op = 0x0
       escape = 0x7f3518108df8 ""
       field = 0x7f3530058030
       var = 0x0
       prev = 0x0
       __PRETTY_FUNCTION__ = "realtime_mysql"
#12 0x00000000004fca00 in ast_load_realtime_all_fields (family=0x7f35178b2247 "sippeers", fields=0x7f3530058030) at config.c:3257
{noformat}

The mysql_ping function is blocking and never returns, which then causes everything else wanting to use MySQL to block.

What MySQL client library are you using? Is it the latest?

By: Cyril Ramière (Cyril.r) 2017-10-23 08:15:59.039-0500

Hi Joshua,

I thought of this and tried removing mysql-client (was 5.7.19-0ubuntu0.16.04.1 amd64) and replaced with mariadb-client (10.0.31-0ubuntu0.16.04.2 amd64), no difference.

OS is Ubuntu 16.04.3 LTS with latest updates, fresh install.

I reproduced the issue on AWS with the AWS provided ubuntu AMI, and on a VmWare vm that I installed myself from scratch.

On the very same machine, doing full uninstall + reinstall of asterisk 13.9.1, it works without problems.
Reinstalling newer version of asterisk causes the issue to be back.

Best regards

By: Joshua C. Colp (jcolp) 2017-10-23 08:25:07.279-0500

It's possible that somehow a change that was made caused it to happen, but I don't see how since it appears to be library level. Very strange indeed.

By: Cyril Ramière (Cyril.r) 2017-10-23 08:30:28.020-0500

Is there is anything I can do to help you?

By: Joshua C. Colp (jcolp) 2017-10-23 08:35:32.770-0500

I was just doing triage to ensure everything that is needed for someone to look at this is present. I don't have anything else to ask for, and I'm not actively working on it. Both chan_sip and res_config_mysql are community supported so it is up to the community to look into things.

By: Kevin Harwell (kharwell) 2017-10-23 17:22:47.211-0500

[~Cyril.r] since you state that going back to 13.9.1 works fine it's possible that a problem was introduced sometime between that version and the current one. If you could narrow down which version specifically it might help find the change that may be causing this. So for instance 13.17.2 is the current version, but does it happen in 13.13.0? If so then the change was before that version, 13.11.0?, etc...

Another thing you could try as well is see if this happens with the pjsip channel driver.

By: Asterisk Team (asteriskteam) 2017-11-07 12:00:01.555-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines