ASTERISK-24850: chan_lcr: Deadlock caused by chan_lock interaction with channel container lock and individual channel lock

[Home]

Summary: ASTERISK-24850: chan_lcr: Deadlock caused by chan_lock interaction with channel container lock and individual channel lock

Reporter: Birger "WIMPy" Harzenetter (wimpy) Labels:

Date Opened: 2015-03-06 06:19:15.000-0600 Date Closed: 2015-03-09 11:05:38

Priority: Major Regression? No

Status: Closed/Complete Components: General

Versions: 11.16.0 Frequency of
Occurrence Occasional

Related
Issues:

Environment: Two Asterisks connected via IAX Attachments: ( 0) Asterisk-debug.txt

Description: The issue starts with warnings like the follogin one
[Mar 6 09:32:04] WARNING[9013] channel.c: Exceptionally long voice queue length queuing to IAX2/hbc-hah-73
about 1-2 times per second.
After some random time in the range from half a minute to 4 minutes it changes to
[Mar 6 09:36:23] WARNING[9009] chan_iax2.c: Max retries exceeded to host 144.76.51.196 on IAX2/hbc-hah-73 (type = 6, subclass = 2, ts=377730, seqno=114)
every 10 seconds until I kill (-9) Asterisk.
In that state Asterisk still answers IAX packets so that it still qualifies OK on the calling instance.
However it completely stops processing any calls.
On the calling box all following Dial()s to the locked up one end with DIALSTATUS=CANCEL and HANGUPCAUSE=0 so even they seem to be accepted, just not processed.

This particular issue started 21 days after upgrading from 11.15.0, but occurred 3 times withing two weeks so far. So I'm unsure if it's related to the upgrade.

Comments: By: Birger "WIMPy" Harzenetter (wimpy) 2015-03-06 06:21:17.782-0600

This is the output of
gdb -ex "thread apply all bt"
asterisk -x "core show locks"
asterisk -x "core show threads"
asterisk -x "core show taskprocessors"
asterisk -x "iax2 show stats"

By: Matt Jordan (mjordan) 2015-03-09 11:01:11.585-0500

Since we discussed this a lot in {{#asterisk-dev}}, I'm including the conversation here.

tl;dr: this is actually a locking inversion in {{chan_lcr}}.

{quote}
{noformat}
21:22.10 mjordan WIMPy: nope.
21:22.30 mjordan WIMPy: chan_lcr is doing something wrong by causing a locking inversion
21:23.19 WIMPy Hmm. And that causes IAX to go to zombie state?
21:23.34 mjordan lcr_request is holding a channel lock while attempting to get the channels container lock.
21:24.05 mjordan It would, because lcr_request is holding a channel lock that causes everyone else to pile up
21:24.21 WIMPy Ok.
21:24.42 mjordan at least that's what I can tell from the 'core show locks' output.
21:24.55 WIMPy Now the interesting thing is where to look for that lock.
21:25.25 mjordan some line number in lcr_request.
21:25.36 mjordan I would suspect it is attempting to link it into the channels container.
21:25.43 WIMPy Or why it only seem to happen if IAX is involved.
21:26.19 mjordan because chan_iax2 has to grab the channels container?
21:26.30 mjordan Honestly, if you ran a 'core show channels' it would probably block that as well.
21:27.03 mjordan it also looks like you've got IAX channels in a bridge with the LCR channels
21:27.25 mjordan which would cause those IAX channels to block while their thread attempts to write to the LCR channel (which is locked by chan_lcr)
21:27.31 WIMPy The result looks extremely similar to theissue I ran in to on the upgrade from 11.7.0 to 11.7.1, but that did not involve lcr, only iax and sip.
21:27.47 mjordan that core show locks doesn't show anything that I would blame on chan_iax.
21:28.10 WIMPy But it seems to have trouble writing to IAX not from.
21:28.46 mjordan WIMPy: " * \note Absolutely _NO_ channel locks should be held before calling this function."
21:28.57 mjordan that's ast_channel_alloc, which chan_lcr is calling into, with a channel lock held
21:29.00 WIMPy All these locking things don't really mean much to me :-(
21:29.02 mjordan that will absolutely cause a deadlock.
21:29.15 mjordan Once a channel deadlock happens, everyone gets screwed
21:29.30 mjordan the channels container is permanently locked, and everyone grinds to a halt
21:29.48 WIMPy alloc? So that would be when a new channel is set up?
21:30.26 mjordan WIMPy: that is when you create a new channel, and I can look at chan_lcr and see them doing it, which I should *not* be doing since that is code that isn't licensed back to the project.
21:30.47 mjordan line 2103, followed by line 2125. https://github.com/osmobuntu/lcr/blob/master/chan_lcr.c
21:30.59 mjordan It violates the contract of the API, and will absolutely cause deadlocks willy nilly.
21:31.41 WIMPy Ok. Thanks for pointing me in the right direction.
21:33.02 WIMPy just wonders what version that is yu found, BTW
21:33.10 WIMPy you
21:33.42 mjordan er
21:34.00 mjordan WIMPy: that lock isn't a channel lock. They globally lock everytime they make a channel. Weird.
21:34.03 mjordan Hm.
21:34.32 mjordan they are still deadlocking it.
21:34.37 mjordan but it's a weird one.
21:34.44 mjordan mostly because of their stupid global lock.
21:34.57 mjordan they grab their global lock (chan_lock), then attempt to grab the channel container lock.
21:35.05 WIMPy They way you describe it makes it interesting that it works at all.
21:35.21 mjordan that causes a locking inversion when the core - which holds a channel lock - calls ast_write on them
21:35.38 mjordan that channel lock blocks another thread which has grabbed the global channels lock and is attempting to grab that channel lock
21:35.47 mjordan that global lock is still the problem.
21:35.56 mjordan but it is a bit more complex than I thought.
21:36.37 mjordan WIMPy: generally, you shouldn't hold locks all over the place, and the locking order is supposed to be "channel container, then channel"
21:37.07 mjordan what they've done is thrown a third lock into the mix (generally a bad idea), then complicate that by having it grabbed when a channel lock may be held.
21:37.29 mjordan They create the inversion when they hold that global lock and grab the channel container lock by creating a new channel (when the new channel is done being created, it has to go into the container).
21:38.05 mjordan Why they have a global lock and are locking it left & right, I don't know. That seems like a terrible idea for performance. They probably thought they were solving some concurrency problem, and stepped right into creating worse ones.
21:38.12 WIMPy That again sounds like it should happen regularly.
21:38.25 mjordan well, kind of. It involves at least three threads.
21:38.43 WIMPy The shit is horribly old :-(
21:50.41 WIMPy Do you think there needs to be a bridged lcr channel when a new lcr channel is created for the issue to happen?
21:56.24 WIMPy Could it be that the whole lock isn't neccessay at all?
22:35.07 WIMPy is confused.
22:35.51 WIMPy While chan_misdn (which obviousely has been the template for chan_cr at some point) doesn't seem to do locking, while chan_dahdi looks like chan_lcr looks like now.
22:38.07 WIMPy Also alsa does it an console doesn't.
23:05.17 mjordan chan_dahdi doesn't have a global lock.
23:06.13 mjordan it has an interface lock for its private structures, but it doesn't use it in the same way.
23:06.30 mjordan chan_misdn is garbage, so I can't comment on that.
23:07.12 mjordan granted, everything has issues, but chan_misdn does some horrible stuff with holding onto memory in unsafe ways
23:07.24 WIMPy I have to admit that I'm too clueless about that locking stuff to see the difference :-(
23:07.30 mjordan and generally can't be compiled any more, thanks to libmisdn's reliance on an old version of the kernel.
23:08.03 mjordan WIMPy: lock inversions are hard. In this case, I'm positive it is chan_lcr. They are holding onto their global lock irrespective of the locking contract that a channel driver must adhere to.
23:08.05 WIMPy Well, it has been "old and deprecated" for many years.
23:08.43 mjordan to be frank, while deadlocks do still happen in Asterisk from time to time, in the core channel drivers, we have been aggressive about killing them.
23:09.03 WIMPy Do I get it right that this lock would prevent other threads from accessing an unfinished structure?
23:09.13 mjordan Richard's latest fix for one in chan_sip is one of the few I've seen crop up in the last year.
23:09.58 WIMPy Maybe that was the one I got with SIP transfers?
23:10.37 mjordan WIMPy: maybe. It occurs rarely when you poke a peer while inside of certain messages. He just committed the fix, I think (I was out most last week)
23:11.14 mjordan WIMPy: I'm not sure what they are doing with that global lock. You don't need to protect something that is allocated on the stack, so I'm not sure why they are aggressively holding it.
23:11.41 WIMPy Actually that one was a lot more interesting on the also involved IAX side.
23:12.14 mjordan keep in mind that once something deadlocks the channels container, all channel drivers are screwed
23:12.25 mjordan what is on the other side of a bridge usually doesn't matter.
23:12.29 WIMPy I have tried to just remove the locking which does work. But I guess that doesn't really mean much.
23:12.43 mjordan heh, yeah, now all you know is that you won't deadlock. May blow up instead.
23:12.58 WIMPy For the sip transfer issue it did.
23:12.58 mjordan honestly, we get very few deadlocks on the issue tracker when using just the core channel drivers.
23:13.08 mjordan shrugs
23:13.17 mjordan I can't recall what that issue was, so I'm not sure
23:14.29 WIMPy That was the very strange thing that my box would lock up in the same way as on this issue when I tied to transfer a call that come in via IAX on the sip phone.
23:15.19 WIMPy The thing that made it interesting was that it would happen again on the next try until I restarted the other Asterisk box where the call came from.
23:15.55 WIMPy That's why I asked that time if there is anything in IAX in any way related to realtime.
23:17.22 *** join/#asterisk-dev superscrat (~asanders@173-17-133-2.client.mchsi.com)
23:20.17 WIMPy But back to this one: Can you think of some action that might have an issue with the lock removed?
23:20.45 WIMPy Something like 'core show channels' displaying incomplete information would be obvious, but harmless.
23:21.05 mjordan which lock?
23:21.08 mjordan the one in chan_lcr?
23:21.17 WIMPy yes
23:23.24 mjordan I'd say go for it. Particularly the one in the request callback.
23:23.40 mjordan the other option would be to unlock around the ast_channel_alloc call and then re-lock.
23:23.48 WIMPy Yes, that's the one I removed.
23:23.58 mjordan not knowing all of their code, I'm really not sure, but that would prevent the locking inversion.
23:26.43 WIMPy alloc_call is the only thig happening before. Doesn't sound like something that needs any locks. Maybe I just move the lock to after ast_channel_alloc? That would seem to be the safe bet.
23:27.00 WIMPy That was actually my first idea until I looked in to the other channels.
23:40.14 WIMPy OTOH, if other channels work without loking anything, there can't be a need to do it, can there?
23:40.49 file that's an implementation detail of the channel driver itself
23:41.07 WIMPy I will try to find out if Jolly remembers why it was there.
23:41.33 WIMPy But can it be called again for the channel to be created before the function returns?
23:42.08 *** join/#asterisk-dev sgriepentrog (sgriepentr@nat/digium/x-qvecpniglwsxdign)
23:50.07 file I don't understand the question
23:52.04 WIMPy Is there any chance the channel will will be called for the channel that might be incompletely created, befor that very function completes?
23:53.11 file it is possible for something to get the channel after it is allocated.
23:54.38 WIMPy Yes, but is there anyhting that could have an issue with the incomplete channel?
23:55.03 file no?
23:55.35 WIMPy That's what I hoped.
23:56.16 WIMPy And in the end there would always remain the possibility to hit the very moment beween ast_channel_alloc and any locking.
23:56.52 WIMPy I think I'm positive that the lock can just go.
23:59.19 WIMPy I'll keep it running without here and see what Jolly can tell about it.
{noformat}
{quote}

By: Matt Jordan (mjordan) 2015-03-09 11:05:06.796-0500

The locking inversion occurs due to the following:

* Thread ID: 0xb6a87b40 holds the channel container lock ({{0x89fa2c0}}) while waiting for a channel lock ({{0xb4ec5f70}})
* Thread ID: 0xb50beb40 holds the channel lock ({{0xb4ec5f70}}) while waiting for the {{chan_lcr}} global lock ({{0xb6503ae0}})
* Thread ID: 0xb5082b40 holds the {{chan_lcr}} global lock ({{0xb6503ae0}}) while waiting for the channel container lock ({{0x89fa2c0}})

There is an explicit relationship between the channel container (and its lock) and channels contained in it (and their locks). Inserting a lock between that locking order causes the lock inversion and the deadlock on this issue.

Since this is an issue in {{chan_lcr}} and not in Asterisk, I'm going to close this out as Not a Bug. This information should be useful for the maintainer of {{chan_lcr}}, and should help others fix that channel driver.