ASTERISK-21228: Deadlock in pbx_find_extension when attempting an autoservice stop due to holding the context lock

[Home]

Summary: ASTERISK-21228: Deadlock in pbx_find_extension when attempting an autoservice stop due to holding the context lock

Reporter: Dare Awktane (awktane) Labels:

Date Opened: 2013-03-08 10:14:07.000-0600 Date Closed:

Priority: Major Regression?

Status: Open/New Components: Core/PBX PBX/pbx_realtime

Versions: 1.8.20.1 1.8.21.0 11.2.1 11.3.0 13.18.4 Frequency of
Occurrence Frequent

Related
Issues:
duplicates ASTERISK-18249 Deadlocks when using a switch statement

duplicates ASTERISK-18764 Asterisk stopped accepting sip registrations, and stopped logging to full.

duplicates ASTERISK-20537 Asterisk deadlocks between looking up extension from process_sdp and bridge execution from pbx_realtime

duplicates ASTERISK-21040 Deadlock involving chan_sip.c, pbx.c and autoservice.c, locking on chan and &conclock

duplicates ASTERISK-22835 pbx_realtime: deadlock with channel in autoservice while calling realtime switch

duplicates ASTERISK-23325 Asterisk is Running but not processing anything

duplicates ASTERISK-24989 Deadlock between chan_sip, channels, and pbx_realtime

duplicates ASTERISK-25602 chan_sip deadlocks after INVITE processing while calling sip_report_security_event

is duplicated by ASTERISK-23496 SIP Channel Stops Accepting Packets

is duplicated by ASTERISK-24360 Asterisk deadlock

is duplicated by ASTERISK-25046 Chan_sip deadlock persistent between Asterisk 1.8 versions, using Realtime.

is duplicated by ASTERISK-26052 Asterisk hangs with a T.38 call

is duplicated by ASTERISK-26435 Asterisk freezing randomly

is duplicated by ASTERISK-28893 pbx_realtime: Cascading deadlock due to ast_autoservice_stop blocking

is related to ASTERISK-16365 [patch] dialplan reload deadlocks in ast_rdlock_contexts when calling ast_hint_state_changed

Environment: Linux 3.2.0-38-generic #61-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux Attachments: ( 0) 1366481920-backtrace_threads.txt
( 1) 1366481920-core_show_locks.txt
( 2) 1366495718-backtrace_threads.txt
( 3) 1366495718-core_show_locks.txt
( 4) 1367129605-backtrace_threads.txt
( 5) 1367129605-core_show_locks.txt
( 6) 1367187707-backtrace_threads.txt
( 7) 1367187707-core_show_locks.txt
( 8) 20130425__AST-21228.patch
( 9) 20130427__AST-21228.patch
(10) asterisk11_3-1367049271-backtrace_threads.txt
(11) asterisk11_3-1367049271-core_show_locks.txt
(12) backtrace-threads.txt
(13) core-show-locks.txt

Description: Due to the way locks are maintained it appears as though our realtime switch calls in our extensions.conf, which are also stored in the database via odbc, are causing a conflict. We have 5500 contexts any of which can be called at any time as each one exists to maintain separation between our customers.

Comments: By: Dare Awktane (awktane) 2013-03-08 16:38:34.385-0600

Seeing that the dumps likely had to do with the dialplan reload calls we've modified to call it a little less - now it only calls it when there is for sure a change. This appears to be more stable than forcefully calling it every 5 minutes. I have to assume it's still possible for it to happen though.
By: Matt Jordan (mjordan) 2013-03-12 12:15:21.654-0500

The deadlock occurs between Thread {{0x7f4dbf437700}} and Thread {{0x7f4d981e2700}}:

{noformat}
Thread 25 (Thread 0x7f4d981e2700 (LWP 25741)):
#0 0x00007f4ddfc9189c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f4ddfc8d080 in _L_lock_903 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f4ddfc8cf19 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x0000000000506fe1 in __ast_pthread_mutex_lock (filename=0x5df3d4 "pbx.c", lineno=11809, func=0x5e5900 "ast_rdlock_contexts", mutex_name=0x5e574f "&conlock", t=0x828220) at lock.c:248
#4 0x0000000000554c36 in ast_rdlock_contexts () at pbx.c:11809
#5 0x000000000053b909 in pbx_extension_helper (c=0x7f4d940dbf08, con=0x0, context=0x7f4d940dcd58 "default", exten=0x7f4da1d76144 "fax", priority=1, label=0x0, callerid=0x7f4d9421d2d0 "2895620092", action=E_MATCH, found=0x0, combined_find_spawn=0) at pbx.c:4600
#6 0x000000000053f12f in ast_exists_extension (c=0x7f4d940dbf08, context=0x7f4d940dcd58 "default", exten=0x7f4da1d76144 "fax", priority=1, callerid=0x7f4d9421d2d0 "2895620092") at pbx.c:5756
#7 0x00007f4da1cd9101 in sip_read (ast=0x7f4d940dbf08) at chan_sip.c:8398
#8 0x000000000047d92c in __ast_read (chan=0x7f4d940dbf08, dropaudio=0) at channel.c:4004
#9 0x000000000047f79e in ast_read (chan=0x7f4d940dbf08) at channel.c:4358
#10 0x0000000000454ad2 in autoservice_run (ign=0x0) at autoservice.c:136
#11 0x00000000005a1a97 in dummy_start (data=0x7f4d7c001980) at utils.c:1028
#12 0x00007f4ddfc8ae9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f4de0db9cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 14 (Thread 0x7f4d7bbb7700 (LWP 22878)):
#0 0x00007f4de0d8583d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4de0db3774 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00000000004552b7 in ast_autoservice_stop (chan=0x7f4d940dbf08) at autoservice.c:288
#3 0x00000000005365a0 in pbx_find_extension (chan=0x7f4d940dbf08, bypass=0x0, q=0x7f4d7bbb2520, context=0x7f4d940dcd58 "default", exten=0x7f4d940dcda8 "8776227713", priority=8, label=0x0, callerid=0x7f4d9421d2d0 "2895620092", action=E_SPAWN) at pbx.c:3347
#4 0x000000000053b970 in pbx_extension_helper (c=0x7f4d940dbf08, con=0x0, context=0x7f4d940dcd58 "default", exten=0x7f4d940dcda8 "8776227713", priority=8, label=0x0, callerid=0x7f4d9421d2d0 "2895620092", action=E_SPAWN, found=0x7f4d7bbb6c74, combined_find_spawn=1) at pbx.c:4604
#5 0x000000000053f328 in ast_spawn_extension (c=0x7f4d940dbf08, context=0x7f4d940dcd58 "default", exten=0x7f4d940dcda8 "8776227713", priority=8, callerid=0x7f4d9421d2d0 "2895620092", found=0x7f4d7bbb6c74, combined_find_spawn=1) at pbx.c:5781
#6 0x0000000000540a9d in __ast_pbx_run (c=0x7f4d940dbf08, args=0x0) at pbx.c:6256
#7 0x000000000054258c in pbx_thread (data=0x7f4d940dbf08) at pbx.c:6586
#8 0x00000000005a1a97 in dummy_start (data=0x7f4d9403ec00) at utils.c:1028
#9 0x00007f4ddfc8ae9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f4de0db9cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()
{noformat}

Thread {{0x7f4d981e2700}} calls {{ast_exists_extension}}, which tries to obtain a read lock on the contexts lock. That blocks the autoservice thread, such that the autoservice list won't be rebuilt. The fact that the autoservice list won't be rebuilt blocks Thread {{0x7f4d7bbb7700}}, which is the actual owner of the contexts lock.

What's interesting here is that both of these are supposed to be read locks, and both threads merely are holding/trying to hold a read lock on the mutex. Unfortunately, that isn't the case - the rdlock/wrlocks are actually wrappers around a straight call to ast_mutex_lock/unlock.

That, in and of itself, appears to be necessary:

{noformat}
........
r280982 | tilghman | 2010-08-05 02:28:33 -0500 (Thu, 05 Aug 2010) | 8 lines

Change context lock back to a mutex, because functionality depends upon the lock being recursive.

(closes issue #17643)
Reported by: zerohalo
Patches:
20100726__issue17643.diff.txt uploaded by tilghman (license 14)
Tested by: zerohalo
........

{noformat}

ASTERISK-16365 indicates that device state callbacks will attempt to obtain the context lock recursively.

The solution here is most likely to not hold the context lock when calling {{ast_autoservice_stop}}. How to do that, however, is tricky.

By: Dare Awktane (awktane) 2013-04-20 23:54:45.611-0500

As volume increases this happens more. We're up to 6-7 crashes a day per asterisk machine running 11.3. Appears to happen slightly less on 1.8.21. I don't fully understand what causes it (and what we're doing differently than everyone else). CPU is fine. RAM is fine. I've seen the deadlock happen with as little as two channels and as many as 60. I'm not sure what to do but we're getting desperate. I can provide tons of examples. We ran for another day in debug mode with one machine on 1.8.21 and the other on 11.3. It occurs on both versions too much. What's worse is for reasons unknown to us both machines running side by side always go down within a few minutes of each other. I've attached more examples under asterisk 11.3 and 1.8.21.
By: Dare Awktane (awktane) 2013-04-22 00:32:39.560-0500

Wanted to add something I've noticed in the messages log when it happens. I believe this is likely caused by the deadlock rather than the cause of but still worth mentioning. These messages get spammed for a while until sip show channels shows nothing.

[Apr 19 08:20:28] WARNING[30995][C-000002ca] res_rtp_asterisk.c: RTP Read too short
[Apr 19 08:20:41] WARNING[31021][C-000002cb] res_rtp_asterisk.c: RTP Read too short

(confirmed that this is because of the issue not the cause of the issue)
By: Dare Awktane (awktane) 2013-04-24 03:26:50.788-0500

A $700 bounty has been offered on this issue. See asterisk-dev for details or message me.
By: Matt Jordan (mjordan) 2013-04-25 15:27:28.213-0500

Do you have a switch in your dialplan?
By: Dare Awktane (awktane) 2013-04-25 15:38:57.852-0500

Yes - we have about 5500 switches in a realtime odbc version of the extensions.conf file. Sadly I have yet to see a way to tell asterisk that every context is in the database. We have to create a 1-3 contexts for every customer that we have to make sure that each customer only has access to their own extensions. An example of one of the entries is as follows:

| id | cat_metric | var_metric | commented | filename | category | var_name | var_val | stamp |
| 248702 | 2 | 319 | 0 | extensions.conf | 310_319_0 | switch | realtime/@ | 2013-04-19 02:26:26 |

Do you think it would help to look into another method of storing the extensions? I could try to make it so that a script took the entries in the database and output it to a hard extensions.conf file which could be reloaded each time we add a new context. As an alternative are you aware of a way to specify that all contexts or even a pattern match of contexts are stored in the realtime odbc version of extensions?
By: Tilghman Lesher (tilghman) 2013-04-25 19:14:48.799-0500

I was thinking something along these lines... change conlock back into a special type of rwlock, with recursive tracking. I've additionally made both read locks and write locks recursive, with the special case that you can put a readlock on top of a write lock, though the reverse causes a DEADLOCK error. Depending upon how well this works, it may be generalizable and useful in other areas.
By: Matt Jordan (mjordan) 2013-04-25 20:26:50.796-0500

Dare:

Well, I can tell you that you're running into this due to the {{switch}} in the dialplan - that would be why you're seeing this when few others are. Everytime a switch is executed, the channel is put into {{autoservice}} - the act of taking it out of {{autoservice}} while holding the context lock is what causes this issue. That, and {{pbx_realtime}} being extended support - it does not receive a lot of developer attention.

Tilghman:

That might work, but it'd be nice if we didn't have to rely on such a measure. Unfortunately, that would probably require a wholesale move to ao2 in pbx.c, as well as a significant restructuring of the dialplan logic.
By: Dare Awktane (awktane) 2013-04-25 22:30:16.338-0500

So if it is the switch calls themselves then putting it all into a hard file would make little to no difference. I will try to come up with a way to remove them from our dialplans. Thank-you for providing me direction. I will leave the bounty in effect to help the asterisk community even if we do resolve it by working around it.
By: Tilghman Lesher (tilghman) 2013-04-25 23:07:38.542-0500

Matt: I agree that ao2 would be a nice thing to do to the pbx core, but that is a massive undertaking. I think this patch will solve the immediate problem in short order. I was actually thinking about this solution when I wrote the patch that converted the rwlocks back to mutexes, but I abandoned that approach, because I thought that mutexes would be a simpler solution. So as it turned out, mutexes were too simple.
By: Dare Awktane (awktane) 2013-04-26 03:59:11.310-0500

I'm unable to run asterisk with debug flags enabled during the week but it looks like your patch caused a different issue. Apparent deadlock with the following output on the console:

[Apr 26 07:23:48] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog '72764eac4838da9f5548311344a28c46@edm.trk.tprm.ca' with owner SIP/thinktel-000001fd in place (Method: BYE). Rescheduling destruction for 10000 ms
== Using SIP RTP CoS mark 5
[Apr 26 07:23:52] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog '637C27D4@159.18.161.101' with owner SIP/thinktel-00000203 in place (Method: BYE). Rescheduling destruction for 10000 ms
[Apr 26 07:23:55] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog '485c420a3c3256fc6fdb80c92f561cb7@edm.trk.tprm.ca' with owner SIP/thinktel-000001ff in place (Method: BYE). Rescheduling destruction for 10000 ms
[Apr 26 07:24:01] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog 'E4601269@159.18.161.101' with owner SIP/thinktel-00000205 in place (Method: BYE). Rescheduling destruction for 10000 ms
[Apr 26 07:24:01] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog 'B7F78109@159.18.161.101' with owner SIP/thinktel-00000202 in place (Method: BYE). Rescheduling destruction for 10000 ms
== Using SIP RTP CoS mark 5
[Apr 26 07:24:04] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog '44d59aa72a5a26ec5e2358ca7ae9f858@edm.trk.tprm.ca' with owner SIP/thinktel-000001fe in place (Method: BYE). Rescheduling destruction for 10000 ms
[Apr 26 07:24:08] WARNING[16179]: chan_sip.c:4210 __sip_autodestruct: Autodestruct on dialog '58679d5c02285c22209df7b1360e9309@edm.trk.tprm.ca' with owner SIP/thinktel-000001fc in place (Method: BYE). Rescheduling destruction for 10000 ms
By: Dare Awktane (awktane) 2013-04-26 08:30:59.936-0500

Confirmed - both machines went down in a similar fashion. I'll get you the thread details over the weekend if you would like.
By: Tilghman Lesher (tilghman) 2013-04-26 19:11:09.101-0500

Yes, I'd like to see the thread details there.
By: Dare Awktane (awktane) 2013-04-27 02:59:16.269-0500

Here are the asterisk 11.3 dumps. Waiting for it to happen on the asterisk 1.8 system again as well.
By: Dare Awktane (awktane) 2013-04-27 03:05:33.098-0500

As a note I believe issue ASTERISK-21040 is a duplicate of this one.
By: Tilghman Lesher (tilghman) 2013-04-27 07:54:52.447-0500

Looking at your 'core show locks' output, I don't see an actual deadlock condition. There might be starvation due to a rdlock not getting unlocked, but that's a different bug. What are the symptoms when this condition occurs?
By: Tilghman Lesher (tilghman) 2013-04-27 08:11:51.605-0500

This is the same basic patch, but with detection of other possible errors coming back from the lock implementation.
By: Dare Awktane (awktane) 2013-04-27 14:33:28.116-0500

No new calls are created. Calls in progress do not appear to close properly. I can not really confirm what happens with the calls in progress as I'm not on one of the lines and cannot create new ones. core and sip show channels output appears to remain the same. I will try your new patch shortly on both versions of asterisk. Is there anything apart from the locks that you would like me to try in case it happens again? This time I will also leave it for longer to see if the channels ever timeout.
By: Dare Awktane (awktane) 2013-04-28 17:30:34.493-0500

Two more deadlocks.

Core show channels shows almost all channels with a status of "Ring". SIP show channels shows many channels up. Message log shows same as above.
By: Maciej Maurycy Bonin (mmbonin) 2015-05-05 04:34:11.140-0500

We've been experiencing this issue as well, it seems a bug I submitted last week turned out to be a duplicate of this problem.
We've tried 1.8.23.1 and 1.8.32.3 (latest ast api 18 as of this moment)- no improvement.
By: Mikael Fredin (Zedekul) 2015-05-26 03:44:17.633-0500

I was under the impression, after reading other bug reports as well as this one, that maybe the pbx_realtime module was the culprit, but I just got the same problem even after disabling pbx_realtime and using only regular configuration files. I don't know what else to try, if it is only the reloading of configuration then I am out of luck, since we need to to this quite often.

It happens more often when we have more calls active (after 50-70+ calls it becomes a lot more common).
By: Antonis Psaras (apsaras) 2016-10-04 13:37:57.340-0500

Hello

Is there any progress on that issue? We are having the same problem with 11.23.1.

Is there any workaround? If we remove switch from extensions.conf and work with static extension files, would that resolve the issue?
By: Martin Nyström (martin.nystrom) 2020-05-14 08:06:35.359-0500

As my issue ASTERISK-28893 was closed I will post here instead.

Would it be realistic to think that this issue will never be fixed? What would be the alternative?
By: Joshua C. Colp (jcolp) 2020-05-14 08:11:22.616-0500

I know of noone actively working on it, and I don't expect many people use the functionality in the first place. Most people either do static configuration, generate configuration, or use func_odbc to query a database at specific points - instead of placing all dialplan in the database.
By: Dare Awktane (awktane) 2020-05-14 15:01:33.527-0500

Confirmed that our solution went from expecting asterisk to handle where it was in the database dial plan to a static dial plan that used odbc calls to do what was needed. It took a rewrite but everything became much more reliable thereafter.
By: Joshua C. Colp (jcolp) 2020-05-14 15:06:08.194-0500

[~awktane] Glad to hear that!