Summary: | ASTERISK-22093: Deadlock due to locking inversion between PBX context lock and channel lock while reloading dialplan through pbx_lua | ||
Reporter: | David Brillert (aragon) | Labels: | |
Date Opened: | 2013-07-17 15:28:32 | Date Closed: | 2016-08-19 14:16:32 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/PBX PBX/pbx_lua |
Versions: | 11.5.0 | Frequency of Occurrence | Occasional |
Related Issues: | |||
Environment: | Attachments: | ( 0) 7_26_2013_gdb_thread_apply_all_bt_full.txt ( 1) gdb_thread_apply_all_bt.txt ( 2) gdb_thread_apply_all_bt_full.txt ( 3) valgrind.rar | |
Description: | Third deadlock in two weeks.
Asterisk not built with NOT_OPTIMIZED or thread debug options. The best I could do was attach GDB to Asterisk PID and get a thread apply all BT (attached) | ||
Comments: | By: David Brillert (aragon) 2013-07-18 08:26:10.037-0500 DEBUG_THREADS is not enabled in the build and the gdb shows a lot of <value optimized out>. The system is a very busy production system and would not survive DEBUG_THREADS. Is there another build option I could use to get the required locking data? Or is there enough info in the GDB trace to help with a patch? This is a subversion build of Asterisk 11 r391794 By: Matt Jordan (mjordan) 2013-07-18 08:45:23.062-0500 I don't think this has anything to do with Local channels. Rather, it looks like a locking inversion between reloading the dialplan from {{pbx_lua}} (and one of the dialplan context locks) and a channel lock. Unfortunately, without a {{core show locks}} it's difficult to find the culprit. A {{bt full}} might give a bit more information such that the culprit can be found. By: David Brillert (aragon) 2013-07-18 08:58:51.946-0500 I'll collect that bt full as soon as I see another deadlock :D I have updated to SVN 393929 By: David Brillert (aragon) 2013-07-22 14:37:25.067-0500 Here is the gdb thread apply all bt full from today's deadlock By: David Brillert (aragon) 2013-07-26 09:56:55.475-0500 Another deadlock today. Reproducible on a reload. So I modified modules.conf to remove lua. noload => pbx_lua.so Now the system doesn't completely deadlock (processes calls) but the CPU load goes through the roof on any reload. Attached gdb during the reload By: Rusty Newton (rnewton) 2013-07-26 10:21:35.838-0500 Mark Michelson triaged the new trace and it looks like it should be helpful. Though "core show locks" output could reduce hours of work to minutes of work. pbx_lua is under extended support so I've put the issue into an open state for a community dev to take a look when they are available. By: David Brillert (aragon) 2013-07-26 10:32:13.631-0500 Thank you kindly, that is good news. By: David Brillert (aragon) 2013-07-26 10:36:08.693-0500 We aren't using pbx_lua so I changed modules.conf so it won't load. I hoped that would be enough to fix the problem but the gdb trace still doesn't look 'happy, happy, happy'. The load gets crazy and I still see some <Address 0x6 out of bounds> that look like memory corruption in the gdb trace. By: Rusty Newton (rnewton) 2013-07-26 14:00:14.032-0500 Sorry I had missed your 26/Jul/13 9:56 AM before I commented. Can you get [valgrind|https://wiki.asterisk.org/wiki/display/AST/Valgrind] output? To confirm, after noloading pbx_lua you don't get a full deadlock, but CPU usage spikes. How long does the CPU usage hit the roof for? By: David Brillert (aragon) 2013-07-26 14:07:03.708-0500 I have downloaded the customer's config to a lab server and starting valgrind. It's taking a long time to start under Valgrind ;) To confirm, after noloading pbx_lua you don't get a full deadlock, but CPU usage spikes. - So far so good on the full deadlock How long does the CPU usage hit the roof for? - a long time, until I restart Asterisk The trace = gdb thread apply all bt full.txt 22/Jul/13 3:37 PM Is a gdb attachment during a full deadlock with lua loaded. The trace = 7_26_2013 gdb thread apply all bt full.txt 26/Jul/13 10:56 AM Is a gdb attachment during the high CPU load with noload => pbx_lua.so By: David Brillert (aragon) 2013-07-26 14:47:02.287-0500 I hope this valgrind attachment will help. It's not from the production server and had zero voice traffic on it, and no registered extensions. By: Rusty Newton (rnewton) 2013-07-29 16:18:21.168-0500 Since you are not getting the deadlock it sounds like we have two possibly separate issues. To avoid confusion please create a new issue for the high CPU consumption. Attach your new trace and valgrind output to that one, then we'll look at it there. Since the locking was related to pbx_lua and your not even using that, we won't expect this new debug to help for the previous deadlock. Thanks! By: David Brillert (aragon) 2013-07-29 16:32:46.148-0500 New ticket is opened ASTERISK-22208 By: David Brillert (aragon) 2013-09-16 14:22:58.856-0500 Why am I the assignee on this bug report? |