ASTERISK-26601: res_odbc: Deadlock when getting connection

[Home]

Summary: ASTERISK-26601: res_odbc: Deadlock when getting connection

Reporter: Carl Fortin (phonefxg) Labels:

Date Opened: 2016-11-15 10:30:34.000-0600 Date Closed: 2017-02-28 06:53:40.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Applications/app_voicemail Resources/res_odbc Resources/res_pjsip

Versions: 14.1.1 Frequency of
Occurrence

Related
Issues:
is a clone of ASTERISK-26859 res_odbc: Deadlock when getting connection

Environment: Asterisk Realtime 14.1.0 rc1 PJSIP Driver mysql Ver 5.1.73 pjproject 2.5.5 spandsp 0.0.6 jansson 2.7 CentOS 6.6 64 bits on Vmware Number of phones : 700 Average concurrent calls: 16 Attachments: ( 0) backtrace-threads_2.txt
( 1) backtrace-threads.txt
( 2) backtrace-threads.txt
( 3) backtrace-threads.txt
( 4) Channel_list.txt
( 5) Channel_list.txt
( 6) Channel_list.txt
( 7) Console_output.txt
( 8) Console_output.txt
( 9) Console_output.txt
(10) core-show-locks.txt
(11) core-show-locks.txt
(12) core-show-locks.txt
(13) core-show-locks.txt
(14) Debug_Asterisk_Backtrace.sh
(15) Debug_log.txt
(16) extconfig.conf
(17) full
(18) full.txt
(19) full-log-before-_restarting.txt
(20) messages.txt
(21) messages.txt
(22) ODBC_pool.txt
(23) res_odbc.conf
(24) sorcery.conf
(25) task_procesor.txt
(26) Task_Processor_Starting_Asterisk.txt
(27) Task_processors_list.txt
(28) Task_processors_list.txt
(29) Task_processors_list.txt
(30) Task_processors_list.txt
(31) Task_processors_list2.txt
(32) top-output.txt
(33) top-output.txt
(34) voicemail_messages.sql
(35) voicemail_users.sql
(36) voicemail.conf

Description: I had Asterisk 14.1.0-rc1 running for 3 weeks, and all of the sudden PJSIP stopped working. Nothing in the console.
I had time to save the task_processor output before restarting asterisk.

After doing a restart to get the system back on I can see this in the log files:

taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks.

I did not find any message concerning taskprocessor before the system stopped functioning.

I'm aware that I am running an RC release, but looking at the release note, there were nothing concerning deadlock so I was thinking updating later.

Comments: By: Asterisk Team (asteriskteam) 2016-11-15 10:30:35.565-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Carl Fortin (phonefxg) 2016-11-15 10:32:44.293-0600

Task processors output right after pjsip stopped working.

By: Carl Fortin (phonefxg) 2016-11-15 10:42:17.743-0600

This is my debug log before PJSIP stopped working.
Calls stopped at time 10:33:57 and I restarted asterisk at 10:37:33.
By: Joshua C. Colp (jcolp) 2016-11-16 19:23:56.630-0600

Nothing stands out in the log and while the taskprocessors list shows some things in queue, there's nothing huge.

What we'd need when Asterisk is in this state is a backtrace[1] to confirm that something internally is hung. With the information provided there's nothing really we can narrow down or look into.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock
By: Carl Fortin (phonefxg) 2016-11-17 11:56:12.964-0600

Hi Joshua,

Obviously, we will not be able to find this bug with our test environment since we need a certain amount of active calls
combined with a hard to catch event. Our production system is compiled with DONT_OPTIMIZE and BETTER_BACKTRACES but not with DEBUG_THREADS.
I will have to recompile and re enable it.

It could take 3 or 4 weeks for the deadlock to occur, maybe more.
I made a script for extracting the backtrace before restarting asterisk when taskprocessors starts to pile up and everything falls apart:

Take a look at the script a did.

Could something like this work?
Any recommendations will be appreciated.
By: Carl Fortin (phonefxg) 2016-11-17 11:59:22.992-0600

My script to automatically create backtraces when taskprocessors pile up.
By: Rusty Newton (rnewton) 2016-11-23 15:23:48.417-0600

From what I can tell that should work to get the debug requested by Josh. Putting this in Waiting on Feedback for now.
By: Asterisk Team (asteriskteam) 2016-12-08 12:00:01.479-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
By: Carl Fortin (phonefxg) 2016-12-08 13:46:19.825-0600

Please reopen the ticket. My bash file worked. It just crashed an hour ago after 2 weeks.
I'm uploading my backtraces.
By: Asterisk Team (asteriskteam) 2016-12-08 13:46:20.148-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Carl Fortin (phonefxg) 2016-12-08 13:47:55.589-0600

Backtraces
By: Carl Fortin (phonefxg) 2016-12-08 13:53:26.873-0600

Backtrace of crash
By: Carl Fortin (phonefxg) 2016-12-08 13:55:05.081-0600

If fact the system crashed once, and my script detected high task processors on startup which is normal, so the crash could be unrelated with the deadlock.

By: Carl Fortin (phonefxg) 2016-12-08 14:00:21.090-0600

My debug log
By: Rusty Newton (rnewton) 2016-12-13 14:19:08.542-0600

The crash doesn't appear related. Is this the first time you have had the crash? Can you open a new issue for the crash and post any information and traces you have for it there? We don't want to muddy this other issue with the locking.
By: Carl Fortin (phonefxg) 2016-12-13 14:22:30.967-0600

It is the first time I had this crash, you are right I don't think it is related. I will open a new issue for it. Meanwhile we will leave this one open, because the lock will come up eventually.

By: Carl Fortin (phonefxg) 2016-12-14 12:17:12.590-0600

I have created another issue with my backtraces for the crash: ASTERISK-26662

By: Carl Fortin (phonefxg) 2016-12-22 07:30:17.773-0600

I have modified my script, to make sure I do not restart Asterisk if I detect high number of procceses in queue after a normal restart.
By: Joshua C. Colp (jcolp) 2017-01-03 07:08:33.838-0600

I have looked over the previous information again and am no closer to understanding what happened. A backtrace is still required for this deadlock scenario.
By: Carl Fortin (phonefxg) 2017-01-10 11:09:33.677-0600

Backtrace and locks
By: Carl Fortin (phonefxg) 2017-01-10 11:12:25.935-0600

taskprocessor list
By: Carl Fortin (phonefxg) 2017-01-10 11:15:59.261-0600

My debug log
By: Carl Fortin (phonefxg) 2017-01-10 11:19:46.106-0600

OK, I just got the deadlock after running asterisk for about a month.
I hope we can catch the bug with these new debug logs. I had about 26 actives calls when it happened.

By: Carl Fortin (phonefxg) 2017-01-10 11:21:29.002-0600

Here is an another task processor list that I've extracted just before restarting the service.
By: Carl Fortin (phonefxg) 2017-01-10 11:26:19.517-0600

My debuglog
By: Rusty Newton (rnewton) 2017-01-10 12:46:26.981-0600

Thanks for the additional info and thanks for attaching them with txt extensions. That makes it easier for everyone.
By: Rusty Newton (rnewton) 2017-01-10 12:52:45.121-0600

In addition, it might be time to go ahead and test with the latest 14 release. There have been a significant number of issues fixed since last Oct.
By: Rusty Newton (rnewton) 2017-01-10 13:06:25.869-0600

[~rmudgett] took a quick look at the recent data. This doesn't appear to be a deadlocking issue. The issue appears to be app_voicemail taskprocessor getting backed up and denying service. We are unsure what is going on that gets it into that state.

Tell us more about your app_voicemail configuration and backend. Attach any relevant configuration files or database dumps.
By: Carl Fortin (phonefxg) 2017-01-10 13:19:07.773-0600

I'm using database for storing messages as well as for voicemail users. I have made a sql dump of my 2 database for voicemail.

Most of the time I'm using the voicemail app this way:

same => n,VoiceMail(2228@default,su)

What else .. We have about 800 users and 75 of them subscribe to mwi, the rest of them all receive their voicemail by email.

I'm also using a trigger in sql for backing up the voicemail messages but it never been a problem over the years.

By: Carl Fortin (phonefxg) 2017-01-10 13:22:20.869-0600

My settings related to my voicemail.
By: Carl Fortin (phonefxg) 2017-01-11 08:57:03.677-0600

I wonder if increasing the threadpool_max_size from 50 to let's say 100 would help.
What about increasing the max_connections settings in res_odbc.conf?
I'm not sure about increasing the max_connections because for some reason, the docs says
to not set the value higher than 1 if using unixODBC version lower than 2.3.1.
I cannot update the unixODBC in Centos 6,version 2.2.14 seems to be the latest. That is what yum is telling me..

I'm logging slow queries, so I know that I don't have any in sql.

By: Richard Mudgett (rmudgett) 2017-01-11 10:20:47.545-0600

Your bottleneck is in app_voicemail's task processor so increasing pjsip's max thread pool size won't help. Since it appears that you are using ODBC for voicemail storage, increasing the max_connections should help, but you must have a newer version of unixODBC to do that. I know that unixODBC 2.3.1 will cause memory corruption [1] if you increase the max_connections and that is why the voicemail sample file says you need a newer version to increase the number of connections.

[1] http://lists.digium.com/pipermail/asterisk-dev/2016-June/075582.html
By: Carl Fortin (phonefxg) 2017-01-11 10:27:48.180-0600

I'm running out of ideas. I'm monitoring my sql server and everything seem to be running fine.
What else can I do?

By: Rusty Newton (rnewton) 2017-01-11 14:46:55.473-0600

I'm unsure at this time.. you can always check on https://community.asterisk.org and the users mailing list to see if others have ideas. For now I'm going to open up the issue since this is odd behavior that could be a bug.
By: Carl Fortin (phonefxg) 2017-01-13 07:45:48.008-0600

I would like to add that every time asterisk starts I get the following warning about the voicemail:
taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks.
What is asterisk checking about voicemail that demand so much processing at startup?

The warning goes away and come back when PJSIP stop functioning.

Even if a 1000 phones are registering and the database is very solicited by asterisk for registrations,endpoints and so on it seems that only the app_voicemail is complaining about it.

My unixODBC version is I would like to add that every time asterisk starts I get the following warning about the voicemail:
taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks.

The warning goes away and come back when PJSIP stop functioning.

Even if a 1000 phones are registering and the database is very solicited by asterisk for registrations,endpoints and so on it seems that only the app_voicemail is complaining about it.

My unixODBC version is 2.2.14 so it support pooling and might not have the bug so I'll give it a try to see if it helps.

Here's the output of my console when pooling is activated:

Name: asterisk
DSN: asterisk-connector
Number of active connections: 35 (out of 50)

I"ll post back when I find out more.

By: Richard Mudgett (rmudgett) 2017-01-13 13:10:08.896-0600

{quote}
I would like to add that every time asterisk starts I get the following warning about the voicemail:
taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks.
What is asterisk checking about voicemail that demand so much processing at startup?
{quote}
I haven't looked at the app_voicemail code but I would think asterisk is getting MWI for your 1000 mailboxes at startup so I can see why there could be 1000 tasks queued to the task processor. The fact that you have 1000 mailboxes and the message happens at startup are useful pieces of information.

When any task processor queue reaches its high water limit it causes pjsip to stop accepting any _new_ requests (by ignoring/discarding them) until the queue works off the backlog. Requests from current calls/dialogs should continue to be accepted in this state. Once the backlog is worked off enough to reach the low water limit then pjsip will accept new requests again. In other words this is a transient condition. It should also be a short lived condition so retransmitted packets would then get processed.

The task processor reached X scheduled tasks message is logged _only_ once per task processor. In your task processor list attachments, the app_voicemail task processor is the only one that has hit its high water limit and there is a relatively small or no backlog in any of the other queues. Since you say you get the message at start up for app_voicemail this cause may actually be a red herring. There is currently no way to tell if app_voicemail hits the high water limit again so we cannot rule out any _other_ causes of pjsip not working later.

So when your issue happens, is it transient or does it persist until you intervene?
Could something like fail2ban and iptables be blocking packets?
Are you using UDP, TCP, or TLS for your SIP transport?
By: Carl Fortin (phonefxg) 2017-01-13 13:41:12.714-0600

So when your issue happens, is it transient or does it persist until you intervene?

It is transient when asterisk first start and it will keep running fine until it happens a second time after a couple of weeks later. Then I have to restart asterisk because nothing appear in the console and PJSIP stops working.

Could something like fail2ban and iptables be blocking packets?
Iptables is disabled and I do not use fail2ban. I'ts a local network behind a hardware firewall.

Are you using UDP, TCP, or TLS for your SIP transport?

I'm using UDP.

By: Richard Mudgett (rmudgett) 2017-01-13 22:30:19.720-0600

I don't think the app_voicemail task processor is backing up a second time when PJSIP stops working. However, I have created a patch to log the warning each time a task processor hits its high water mark \[1] to create a high water alert instead of only once per Asterisk restart. Without this patch, you would actually have to enable at least {{core set debug 3}} logging to know if the high water limit was hit again. Given that this takes a couple weeks to happen that could fill up a hard drive with logs.

Have you verified if packets are reaching the Asterisk server?
A pcap would show the packets reaching the server.
{{pjsip set logger on}} will show if Asterisk is even seeing packets.
At least {{core set debug 3}} will output messages \[2] if it ignores incoming messages due to a task processor high water alert.

\[1] https://gerrit.asterisk.org/#/c/4721/
\[2] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
By: Carl Fortin (phonefxg) 2017-01-16 09:45:12.788-0600

I will install the patch as soon as I can. It will make it easier to find which task processor can't keep up.

Have you verified if packets are reaching the Asterisk server?

Unfortunately no, but next time I will for sure. It's running in a production environment, so every time it happens, I have to restart asterisk quickly. Hence the script I made to do it for me.

By: Carl Fortin (phonefxg) 2017-01-16 19:10:42.922-0600

I'm now running asterisk with the patch.
It's working because I never seen that many messages about task processors when starting asterisk:

I will know right away if another warning comes up. I'm forwarding the asterisk syslog to our Graylog server.
It will send me a email when a regular expression matches the task processor WARNING.

You were right about MWI. Take a look at the file I uploaded.Even though 85% of our users are getting their voicemail message by email, asterisk is checking to send unsolicited MWI when starting.
Please note that, some of our endpoints in ps_endpoints do not have aor and auth set. It's normal in our case.

By: Carl Fortin (phonefxg) 2017-01-16 19:11:36.853-0600

Many warning about task processors upon starting asterisk with the new patch.
By: Carl Fortin (phonefxg) 2017-01-20 10:11:33.994-0600

I just got the bug 15 minutes ago. I had the following warning sent to me by Graylog:

The 'pjsip/mwi-00000501' task processor queue reached 500 scheduled tasks...

I can now confirm that nothing is happening in the console and packets are not reaching the server.(see Console_output.txt)
Check my bash script, to see how I have extracted the backtraces and other files.

The problem points toward the mwi.

Could this be the fact that our Cisco phone behave the way described in this issue:

ASTERISK-26548

By: Carl Fortin (phonefxg) 2017-01-20 11:23:57.220-0600

Just a quick correction about the warning message.It came after restarting asterisk but I had processes in queue that made asterisk stopped processing packets. The problem is that I do not know which one, all I know is that my script was executed because some taskprocessors were queue.
By: Richard Mudgett (rmudgett) 2017-01-20 11:30:33.043-0600

This is the first run you have shown the pjsip/mwi task processors hitting their limit. I don't know if the pjsip/mwi task processor backlog is because of ASTERISK-26548. When you gathered the data the task processor backlog had cleared itself, there were no tasks queued in any other task processors, and the system only had about 10 calls on it. I also think your turning "pjsip set logger on" and then monitoring activity for only 20 seconds is too short to determine if pjsip is not processing packets before killing asterisk to restart it. You don't even know if any packets were actually sent to the server.

Nothing is showing that this is anything other than a brief transient event that self corrects.
By: Carl Fortin (phonefxg) 2017-01-20 11:41:24.999-0600

I get packets continuously as soon as I turn the logger on during the day, that's why 20 seconds seemed good enough.
I'll comment out the restart asterisk line in my script during my working hours,so I'll be able to do it manually in the console.

By: Carl Fortin (phonefxg) 2017-01-23 14:22:42.411-0600

I found something interesting by changing pollmailboxes to no in voicemail.conf.
The warning message about 500 scheduled tasks disappear at asterisk startup.
If I set pollmailboxes=yes, the warning comes back.

It does not explain why pjsip stop working after a couple of weeks but since we do not change mailboxes outside of app_voicemail , we can leave it to no.

By: Carl Fortin (phonefxg) 2017-02-28 06:49:07.723-0600

Asterisk has been running for 6 weeks without any problem since I have disabled the mailbox polling.
I guess we can close the ticket for now. I'll post back if something comes up.

By: Joshua C. Colp (jcolp) 2017-02-28 06:53:41.109-0600

Closed per reporter.
By: Carl Fortin (phonefxg) 2017-03-02 10:39:15.079-0600

Please reopen the issue. I just had the problem a few minutes ago. I was able to extract the locks held by asterisk.
I can also confirm that pjsip is processing packets . You will see the sip packets in my full.txt log .
Right after receiving this warning only sip packets are shown in CLI and we cannot make any calls:

[Mar 2 11:08:36] WARNING[18295] taskprocessor.c: The 'pjsip/distributor-000004e4' task processor queue reached 500 scheduled tasks.

It took over six weeks for the problem to occur.
By: Asterisk Team (asteriskteam) 2017-03-02 10:39:15.496-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Joshua C. Colp (jcolp) 2017-03-02 10:51:38.966-0600

On initial analysis this appears to be the result of the ODBC database connection. It looks as though it blocked, causing everything needing database access to block and wait. A backtrace (I don't see one) would confirm whether this is inside the ODBC driver or not and where it was stuck.
By: Joshua C. Colp (jcolp) 2017-03-02 10:55:16.191-0600

There's nothing in here to show it is a PJSIP problem or problem within Asterisk itself. Do you have a backtrace to confirm it's not the ODBC connector having a problem?
By: Carl Fortin (phonefxg) 2017-03-02 11:06:39.436-0600

Unfortunately I had no time for a backtrace. The whole building was yelling about the phones not working.
Maybe I could enable unixODBC tracing for next time.

By: Carl Fortin (phonefxg) 2017-03-02 11:10:33.145-0600

Have you notice that no packets are going out?. We only see <--- Received SIP request which indicates that asterisk is not even trying to respond.
By: Joshua C. Colp (jcolp) 2017-03-02 11:16:38.231-0600

Incoming requests are matched to an endpoint which in your setup is done using a database query. Each thread trying to identify who the request is from was blocked waiting for the database, which meant messages piled up, and no responses could be sent (as the message hadn't even been processed yet).
By: Carl Fortin (phonefxg) 2017-03-02 12:19:03.704-0600

All right, I guess we'll have to wait for another month.
I'll take my time to do a backtrace next time. Nobody will be able to call me anyway because the system will be down.. :)
By: Joshua C. Colp (jcolp) 2017-03-02 12:21:38.815-0600

I'll leave this in suspended until you get the backtrace, or the automation will close it out. Commenting reopens it automatically like before.
By: Carl Fortin (phonefxg) 2017-03-10 12:52:27.795-0600

Ok, I now have backtraces as well as the console output showing all ODBC connections used.
As you can see I was still able to read from the DB in cli (ODBC pool.txt).

By: Asterisk Team (asteriskteam) 2017-03-10 12:52:28.203-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Joshua C. Colp (jcolp) 2017-03-12 18:28:58.663-0500

I'm not seeing a precise reason over why it's deadlocking but there does appear to be a problem. What version of UnixODBC are you using?
By: Carl Fortin (phonefxg) 2017-03-12 18:42:39.486-0500

We are using unixODBC 2.2.14. It's the latest available with yum under CentOS 6.However I was able to install the latest version from source on my test machine. I was thinking about updating it on my production system too. What about the mySQL engine ? Could it make a difference? We are using MyISAM.

By: Joshua C. Colp (jcolp) 2017-03-12 19:10:14.904-0500

Unless you are using a unixODBC version greater than 2.3.1 I do not recommend enabling multiple connections like you've done. The older versions do not handle it well at all and can crash as well as corrupt memory. Using the latest version of everything would work.
By: Joshua C. Colp (jcolp) 2017-03-12 19:16:01.325-0500

In fact that's not a configuration I can say is safe to run at all. You either need to upgrade, or set a max_connections of 1. If the problem still occurs then we can investigate further (sorry for repeating this). I was the one who fixed all the pooling stuff and investigated in unixODBC code itself. It's just really really not safe and in undefined behavior territory.
By: Carl Fortin (phonefxg) 2017-03-13 06:47:21.808-0500

I had the problem with max_connections of 1 also but since I'm now able to update UnixODBC, I'll go ahead.

Done. I have updated to the latest version (2.3.4).Let's wait and see how it goes.
By: Carl Fortin (phonefxg) 2017-04-18 06:55:23.340-0500

You can close this ticket. The problem did not reoccur after 5 weeks of operation.
I have also done a small procedure for anyone who wants to upgrade UnixODBC in CentOS 6:

https://blogs.asterisk.org/2016/06/15/asterisk-odbc-connections/

By: Joshua C. Colp (jcolp) 2017-04-18 07:10:50.147-0500

Thanks for the response and confirmation!