[Home]

Summary:ASTERISK-00975: [patch] Manager Redirect with two parties sometimes gets second party hungup
Reporter:Matt Florell (mflorell)Labels:
Date Opened:2004-02-02 09:38:28.000-0600Date Closed:2008-01-15 14:44:04.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) channel.c.comprehensive.patch
( 1) channel.c.may.patch
( 2) channel.c.patch
( 3) channel.c.zombiefix_delta.patch
( 4) channel.h.may.patch
( 5) manager.c.may.patch
( 6) pbx.c.patch
( 7) zomb_hangup.txt
( 8) zomb_no_hangup.txt
Description:using the manager action below I have an application transferring a SIP and a Zap channel into a meetme room:

Action: Redirect
Channel: Zap/1-1
ExtraChannel: SIP/ab123-f4d3
Exten: 1234
Context: default
Priority: 1

about 33% of the time, even on a non-loaded machine, the channel listed in ExtraChannel will just drop and not transfer to the meetme room, no matter if it is the Zap or SIP channel. It seems like the Redirect is happening for the Channel and when it gets around to transferring the ExtraChannel Asterisk is saying that the call has ended and assumes that it us hungup. Is there any way to get this working more reliably?
Comments:By: matt (matt) 2004-02-04 14:39:42.000-0600

I see the same thing, except it's more like 100% of the time.  This only started happening with a recent CVS update.

I'm trying to do the same thing you are:  Take the two connected ends of an existing call and move them to a meetme room.  This single action has been more of a headache than all of Asterisk's other quirks for me.  It used to be that it would crash all the time (see "ExtraChannel in transfer causes crash").  I'm still not convinced that's fixed.

Looking at the logs, I see these two messages appear the moment that one of the call legs is dropped:

Feb  4 15:13:40 ERROR[851986]: Unable to re-open master file /var/log/asterisk//cdr-csv//Master.csv
Feb  4 15:13:40 DEBUG[851986]: Asked to hangup channel not connected

In future days I hope to dig some more into this.

I wonder, though, is there another way to accomplish the same thing, and just forget the ExtraChannel feature?  You can't transfer one at a time because the second one will get hung up (like it's doing now).

Maybe with call parking somehow, but I'm unfamiliar with that; it seems like it might be another can of worms.

By: Matt Florell (mflorell) 2004-02-04 14:58:15.000-0600

isn't this strange, somehow mine got better when I upgraded to today's CVS, it's only dropping about 10% of the time now :)

Sadly I have tried many ways of accomplishing the same thing and none of them work. Asterisk ALWAYS assumes that once one of the parties of a two party call is gone somewhere else, the other party must be hung up immediately. Redirect with ExtraChannel is the only way of sending a two party call into a meetme conference.

I guess I'll have to live with a few dropped connections where the user has to go back into the conference manually.

By: matt (matt) 2004-02-05 12:54:41.000-0600

With a little investigation, it appears the culprit is chan_sip.c.  Unfortunately, this file has changed so much recently that it may be difficult to isolate.

By: matt (matt) 2004-02-05 15:19:54.000-0600

I updated to cvs again today 2/5/04, and the behavior changed yet again.  Now the call doesn't get dropped anymore, but the success or failure of the transfer depends on the order of the channels.

I set up a certain scenario, where A calls B.  Then I try to do a dual-redirect to a conf. room.  If "B" is listed as Channel and A is ExtraChannel it's fine.  Reverse the parameters and it crashes.  This afternoon it's 100% reproducible.

This seems to be the way it was behaving many months ago, like 6-8 months ago.  I suspect there's a bug that's always been in there.

By: matt (matt) 2004-02-19 11:33:52.000-0600

After much more work on this, I have learned a lot about, and mostly fixed it.  I would really really really appreciate someone helping me resolve the zombie issue.

Symptoms:

Set up a call between two SIP phones on the local system ("A" calls "B").  Use the Management Interface to command a Redirect with ExtraChannel to a MeetMe extension.  5 Times out of 15 (33%) if "A" was listed as Channel and "B" as ExtraChannel, "B" would get dropped and "A" would end up in the conference room by itself.  1 time out of 15 (7%), if "B" was listed as "Channel" and "A" as ExtraChannel, "A" was dropped and "B" was in the conference room by itself.  This agress with the symptom rate reported by the original poster.

Earlier I reported that it was happening 100% of the time in one direction.  That is true, when the transfers were being commanded using our automated software.  In order to take that out of the equation, everything else I talk about here was typed manually into the management interface.  The problem could never be replicated with GDB running.


Also Relates To:

Bug ASTERISK-491 "ExtraChannel in Transfer Causes Crash".  I wrote in there that I wasn't sure it was fixed.  I believe this is related and that these patches finally fix that bug as well.  The crash reported in that bug was caused by the 'fixup' routine in chan_sip.c.  Because the channel was already hungup by the time fixup was called, chan->pvt->pvt (aka "p") was NULL, resulting in the crash.


Outline, Patch files and Bugs Fixed (and Introduced - Doh!)

Section 1. pbx.c.patch fixes "Dropped Channels"
Section 2. pbx.c.patch introduces "Zombies!"
Section 3. channel.c.patch fixes "Garbled Channel Names"

Patches are relative to Asterisk CVS, 18 Feb 2004, 6pm Eastern US


Retesting:

Before making these patches, I ran the dual-redirect 30 times in a row (15 times each with a given Channel/ExtraChannel combination, then swapped).  I verified that both channels were directed to a MeetMe room and that audio worked both ways, in every case.


----------------------------------

Section 1. Dropped Channels

The cause for the dropped channels can be traced to app_dial.c line 699 we see:

if (res != AST_PBX_NO_HANGUP_PEER)
    ast_hangup(peer);

This hangup was being called in the midst of the masquerade process (in the cases where the call was dropped).  It's in a different thread, but it was hanging up the call before masquerade was finished.  The best solution so far has been simply to move the unlock() until after the masquerade is complete. The ast_hangup from app_dial.c still gets called, but it blocks on the channel lock until masquerade can finish.

This is what pbx.c.patch does:
(pbx.c:3562):

                       /* Setup proper location */
                       if (context && strlen(context))
                               strncpy(tmpchan->context, context, sizeof(tmpchan->context) - 1);
                       else
                               strncpy(tmpchan->context, chan->context, sizeof(tmpchan->context) - 1);
                       if (exten && strlen(exten))
                               strncpy(tmpchan->exten, exten, sizeof(tmpchan->exten) - 1);
                       else
                               strncpy(tmpchan->exten, chan->exten, sizeof(tmpchan->exten) - 1);
                       if (priority)
                               tmpchan->priority = priority;
                       else
                               tmpchan->priority = chan->priority;
Old Location, remove--> //if (needlock)
                       //        ast_mutex_unlock(&chan->lock);

                       /* Masquerade into temp channel */
                     
ast_channel_masquerade(tmpchan, chan);

New Location, add--> if (needlock)
                               ast_mutex_unlock(&chan->lock);

                       /* Make the masquerade happen by reading a frame from the tmp channel */
                       f = ast_read(tmpchan);
                       if (f)
                               ast_frfree(f);
                       /* Start the PBX going on our stolen channel */

That eliminates the dropped channels.

------------------------------------

Section 2.  Zombies!

Unfortunately, the pbx patch also results in some ZOMBIE channels remaining.  Out of 30 retests, I was left with 11 Zombies that never seem to go away.  I have tried several ast_hangups, ast_softhangups, etc. but don't know how to fix this.  Hopefully someone can help; I've been working on this bug for 2 weeks and now I'm very close to the answer.  If I can get rid of the zombies, this will be 100% done.  In the meantime, though, I would rather live with zombies than dropped channels (or crashes!)

The file zomb_hangup.txt shows the manager events of a proper zombie hangup.
The file zomb_no_hangup.txt shows a zombie remaining.

The "Missing Action in Request" is where I hit enter a couple of times to show where the duration of the call was.


------------------------------------

Section 3. Garbled Channel Names

Throughout this process I noticed that sometimes (6 times out of 30) The channel names as they existed during the fixup routine were garbled.  Examination of channel.c shows that the 'clone' channel was being used after potentially being destroyed.  The solution was to move the clone-destruction code block after the fixup routine.

This is what channel.c.patch does
channel.c:2092


Move this block of clone zombie/destruction code down.  The "Destroying Clone" path was taken 6 times out of 30.

.         /* Now, at this point, the "clone" channel is totally F'd up.  We mark it as
.           a zombie so nothing tries to touch it.  If it's already been marked as a
.           zombie, then free it now (since it already is considered invalid). */
.        if (clone->zombie) {
.                ast_log(LOG_DEBUG, "Destroying clone '%s'\n", clone->name);
.                ast_mutex_unlock(&clone->lock);
.                ast_channel_free(clone);
.                manager_event(EVENT_FLAG_CALL, "Hangup", "Channel: %s\r\n", zombn);
.        } else {
.                ast_log(LOG_DEBUG, "Released clone lock on '%s'\n", clone->name);
.                clone->zombie=1;
.                ast_mutex_unlock(&clone->lock);
.        }


       /* Set the write format */
       ast_set_write_format(original, wformat);

       /* Set the read format */
       ast_set_read_format(original, rformat);

       ast_log(LOG_DEBUG, "Putting channel %s in %d/%d formats\n", original->name, wformat, rformat);

       /* Okay.  Last thing is to let the channel driver know about all this mess, so he
          can fix up everything as best as possible */
       if (original->pvt->fixup) {

Look!  Clone is being used after possibly being destroyed! (above, original version!)

               res = original->pvt->fixup(clone, original);
               if (res) {
                       ast_log(LOG_WARNING, "Driver for '%s' could not fixup channel %s\n",
                               original->type, original->name);
                       return -1;
               }
       } else
               ast_log(LOG_WARNING, "Driver '%s' does not have a fixup routine (for %s)!  Bad things may happen.\n",
                       original->type, original->name);
       /* Signal any blocker */
       if (original->blocking)
               pthread_kill(original->blocker, SIGURG);

Move the "clone destruction" block to here ->

       ast_log(LOG_DEBUG, "Done Masquerading %s (%d)\n",
               original->name, original->_state);
       return 0;

That fixes the occasionally-garbled channel names.  It seems lucky that more serious side effects weren't observed from that.

By: Matt Florell (mflorell) 2004-02-19 11:46:58.000-0600

Wow, looks like you've done a lot of work on this. I hope we can get the attention of an Asterisk core coder to take a look at what you've discovered and fix this annoying bug.

Have you tried manually hanging up the ZOMBIE channels with the manager interface Hangup action?

Action: Hangup
Channel: AsyncGoto/SIP/3854-de60

I actually wrote a small script to hangup congested Local channels generated through one of my applications. Let me know if manager interface Hangup of ZOMBIEd channels works and I'll send you the script.

edited on: 02-19-04 10:33

edited on: 02-19-04 10:34

By: matt (matt) 2004-02-19 11:59:31.000-0600

No, you can't hangup the zombies.  I tried it and the management interface said success, but when you do an Action: Status it's still there.

By: matt (matt) 2004-02-19 15:24:13.000-0600

Sweeeeeeeeeet!  I got it!  Stand by for another patch.  Man, this is the third bug I've found for this one issue.

By: matt (matt) 2004-02-19 15:41:51.000-0600

Got it!

I have uploaded two new files:

channel.c.zombiefix_delta.patch  -  Is a 'delta' to the patches previously submitted containing only the zombie fix.

channel.c.comprehensive.patch  - Is the all-inclusive patch for channel.c.

If you have not yet applied any patches, use channel.c.comprehensive.patch and pbx.c.patch and that's all you need.


It turns out the zombie problem was quite simple.  The routine ast_hangup was erroneously trying to set chan->zombie=1 AFTER releasing the lock.  This resulted in a race condition where the zombie flag may or may not be set in time.  The simple solution is to set the flag before releasing the lock.  This seems to be the complete solution.  So far I have tested it 10 times with perfect results.  I will retest some more but I think this is it!


In channel.c in the routine ast_hangup, line 630:

       /* If this channel is one which will be masqueraded into something,
          mark it as a zombie already, so we know to free it later */
       if (chan->masqr) {

put it here->   chan->zombie=1;

               ast_mutex_unlock(&chan->lock);

used to be here, after the unlock:

// chan->zombie=1;
               return 0;
       }

By: matt (matt) 2004-02-20 12:54:57.000-0600

I just finished testing it 70 times and it was perfect.

The test breakdown:

20 Dual Redirects involving 2 SIP phones on the local system

20 Dual Redirects involving a SIP phone and an IAX2 connection

20 Dual Redirects involving a SIP phone and a Zap channel (POTS phone calling in)

10 assorted normal (single) transfers involving SIP phones, IAX channels, and the Zap channel.

Is there an administrator in the house?  I say "ship it".  As far as I'm concerned, channel.c.comprehensive.patch and pbx.c.patch should be put into CVS and the bug closed.  Of course if the admins want 3rd party verification or more testing that's up to them, but this concludes my part of it.

By: Matt Florell (mflorell) 2004-02-25 17:21:24.000-0600

Is there a resolution on adding this to the CVS yet? Anyone know how to get a hold of the code maintainer for channel.c and pbx.c so we can close this?

By: Malcolm Davenport (mdavenport) 2004-02-25 18:30:52.000-0600

Maybe I'm wrong, but this appears to already be in the unstable branch??

By: matt (matt) 2004-02-25 23:35:11.000-0600

It appears that the channel.c changes are now in CVS, but not the pbx.c

I don't know if that's an oversight or if it's intentional until some 3rd party verification can be done.

The errors in channel.c were pretty obvious and 'safe' to correct.  The pbx.c change involves moving a lock and I can see where some might be "nervous" about that.  Or maybe they just forgot. I dunno.

By: Malcolm Davenport (mdavenport) 2004-02-26 12:47:03.000-0600

Fixed in CVS.  Thank you :)

By: Malcolm Davenport (mdavenport) 2004-05-19 11:02:56

Reopening per request.  Have at it :)

By: Matt Florell (mflorell) 2004-05-19 11:23:05

I have noticed that this has reared it's ugly head again. What has changed in the code to bring this back?

By: matt (matt) 2004-05-19 11:44:05

After much work and testing, I discovered that this needed more work.  The previous patches are all keepers, and are already merged into cvs.  They fix some things, but still don't resolve this issue 100%.

So what I have now is what I believe to be the final and definitive fix for this.  I have had these changes merged into my Asterisk for several months with no ill effects (although I have not been able to test as much as I want).  I got a request from someone on the list for the solution and figured it was time to post the rest of it.

The patch files presented here (with the word 'may' in them, because we are in May) are relative to the main CVS branch as of yesterday, 18 May 2004.  They have been tested by one other person already who claims that they work for him as well.  You don't need to worry about the other patch files, they are still applicable and are already in cvs.

A Little Not-Great News:
This fix is more invasive than the previous ones.  It's about 100 lines of code including two new subroutines in manager.c and, worst of all, an additional flag added to channel.h.  I really, really, really, didn't want to have to modify any header files.  I had a no-header-change fix that I tried but it just didn't work.  This is the only one that works.  I finally decided, though, that even if the changes are deemed too much to merge into cvs, at least this fix will be recorded here for those who want to use it, because seriously at this point it's In Danger Of Being Lost Forever, and I don't want that to happen.

Bonus:
One of the new routines, get_chan_by_name, might be useful to other parts of Asterisk.  Feel free to call it as you wish!

The Basic Idea
A lot of the problems with the dual transfer are race conditions, which usually involves one of the channels being hung up before the dual redirect is finished.  This can result in an Asterisk crash, or it can result in the symptomatic behavior of one channel of the dual-redirect being hung up.  The fix is conceptually simple.  Add a flag to each channel that means "I'm somehow involved with a dual-redirect in progress, so don't hang up this channel until I'm finished with that."  A hangup-inhibit flag, which, obviously, must be treated carefully.

The Way This Was Tried Before and Why It Didn't Work
In channel.c you'll see starting at line 645 a test for chan->masq and then a test for chan->masqr.  I believe this was the Old Way of trying to detect this condition, to prevent the hangup during masquerading.  It's possible that these two tests could be removed with the addition of my patch, but I have not removed them.  The reason this doesn't (always) work is a matter of timing.  chan->masq and chand->masqr don't get set until you're already some portion of the way into the dual-redirect/masquerading process.  The way mine works is the flags are set just as soon as they possibly can be, the moment you're realize you're going to be doing a dual-redirect.  And they're not released until it's sure that it's done.  That little bit of timing difference makes ALL the difference.  The masq and masqr fields simply aren't adequate for this.

Changes to channel.h:
Added the hangup inhibit flag, aka used_by_dual_xfer.  Tried to be clear about what it's used for.

Changes to channel.c:
1. Initialize the new flag to 0 when the channel is allocated.

2. Don't perform and ast_hangup if the flag is set.


Changes to manager.c:

1.  Added the function get_chan_by_name.  It's important to look things up by name, because when you go through the masquerade process, the channel that started out as "SIP/abcd" has actually been done away with and a new one has been created that took over its name and properties (and therefore will be in a different memory location).  Same name, but it's not the same instance you started with.

2.  Added the function set_dual_xfer_flag.

  In action_redirect:

3.  Test for dual-redirect.  If so, set the flags on the two involved channels ASAP.

4.  When the dual-redirect is finished:

     A. Clear the flags on the two original channel names
     B. If there happen to be any AsyncGoto/<name><ZOMBIE> channels names still hanging around, make sure they're clear to hangup and go ahead and issue a hangup for them.

That last step, 4B might not be strictly necessary.  Although I did have some unpleasant encounters with zombies in my previous work on this, I remember thinking that the zombie hangups were to some extent "just to be sure" but I don't remember 100%.

That's about it.  Feel free to help out by testing it.  Comments are welcome, please no flames about the header.

- Matt

By: nicolasg (nicolasg) 2004-05-19 18:28:48

I'm the one who asked Matt for the solution to this problem. I tested his latest patch and it worked great.

Without the patch the dual redirect does not work reliably. Many times the second channel is hanged up before the redirect. In one of my many tests asterisk crashed altogether.

I will incorporate a function to my 'flash operator panel' that will use the dual redirect feature, so we might have more users testing the patch so we can get more feedback.

Thanks Matt!

By: Mark Spencer (markster) 2004-05-20 12:55:24

This doesn't look like the right way to  do this.  i'll put some thought into it and see what i can do.

By: flavour (flavour) 2004-05-20 15:06:51

Patch fails on current CVS, because ast_channel_walk has become ast_channel_walk_locked.

Applies (with offset), but then:

manager.o(.text+0x2db4): In function `action_redirect':
/home/flavour/asterisk/manager.c:454: undefined reference to `ast_channel_walk'
manager.o(.text+0x2ddc):/home/flavour/asterisk/manager.c:458: undefined reference to `ast_channel_walk'
manager.o(.text+0x2dee):/home/flavour/asterisk/manager.c:454: undefined reference to `ast_channel_walk'

edited on: 05-20-04 14:15

By: Mark Spencer (markster) 2004-05-21 12:42:01

I'm working on fixing this the right way so the attached patch will not be necessary.  Just sit tight :)

By: gdalgliesh (gdalgliesh) 2004-05-21 12:57:32

When using this "may" patch asterisk is crashing quite often when attempting to use management interface thru 'flash operator panel' to Join 3 calls in a conference asterisk crashes.
   -- Remote UNIX connection
   -- Remote UNIX connection disconnected
   -- Executing Dial("SIP/5003-8303", "ZAP/2|25|t") in new stack
   -- Called 2
   -- Zap/2-1 is ringing
   -- Zap/2-1 answered SIP/5003-8303
   -- Executing MeetMe("Zap/2-1", "|E") in new stack
 == Parsing '/etc/asterisk/meetme.conf':   == Parsing '/etc/asterisk/meetme.conf': Found
   -- Playing 'conf-enteringno' (language 'en')
May 21 12:37:24 WARNING[1200884528]: channel.c:677 ast_hangup: Hard hangup called by thread 1200884528 on AsyncGoto/Zap/2-1<ZOMBIE>, while fd is blocked by thread 1226062640 in procedure ast_waitfor_nandfds!  Expect a failure
pbx*CLI>
Disconnected from Asterisk server
Executing last minute cleanups
Asterisk cleanly ending (0).

By: Mark Spencer (markster) 2004-05-21 13:06:19

Don't use any of the attached patches.  If you update to latest CVS it should fix the original problem in this bug report.

By: nicolasg (nicolasg) 2004-05-21 14:17:34

I've just did a fresh checkout of CVS, and it does not work. when performing a dual redirect between SIP/11 as Channel and SIP/15 as ExtraChannel, the last one gets a hang up before the redirect is performed. This is the debug output at the console:

DEBUG[213006]: manager.c:720 process_message: Manager received command 'Redirect'
DEBUG[213006]: channel.c:2063 ast_channel_masquerade: Planning to masquerade SIP/11-ac5b into the structure of AsyncGoto/SIP/11-ac5b
DEBUG[213006]: channel.c:2076 ast_channel_masquerade: Done planning to masquerade AsyncGoto/SIP/11-ac5b into the structure of SIP/11-ac5b
DEBUG[213006]: channel.c:2109 ast_do_masquerade: Actually Masquerading SIP/11-ac5b(6) into the structure of AsyncGoto/SIP/11-ac5b(6)
DEBUG[213006]: channel.c:2120 ast_do_masquerade: Got clone lock on 'SIP/11-ac5b' at 0x90d6f30
DEBUG[213006]: channel.c:2261 ast_do_masquerade: Putting channel SIP/11-ac5b in 4/16 formats
DEBUG[213006]: channel.c:2286 ast_do_masquerade: Released clone lock on 'AsyncGoto/SIP/11-ac5b<ZOMBIE>'
DEBUG[213006]: channel.c:2294 ast_do_masquerade: Done Masquerading SIP/11-ac5b (6)
DEBUG[393232]: channel.c:2542 ast_channel_bridge: Didn't get a frame from channel: AsyncGoto/SIP/11-ac5b<ZOMBIE>
DEBUG[393232]: channel.c:2612 ast_channel_bridge: Bridge stops bridging channels SIP/15-9e79 and AsyncGoto/SIP/11-ac5b<ZOMBIE>
 == Spawn extension (sip, 11, 3) exited non-zero on 'SIP/15-9e79'
   -- Executing Hangup("SIP/15-9e79", "") in new stack
 == Spawn extension (sip, h, 1) exited non-zero on 'SIP/15-9e79'
DEBUG[393232]: chan_sip.c:1490 sip_hangup: update_user_counter(15) - decrement inUse counter

By: gdalgliesh (gdalgliesh) 2004-05-21 18:50:06

Dual redirect seems to be working for me know with lastest CVS. I will continue to test it.

By: nicolasg (nicolasg) 2004-05-21 20:02:41

Follow up from my tests. 95% of the time does not work.

On a successfull attempt, the first event I see on the manager console is this:

<- Event: Newchannel
<- Channel: AsyncGoto/SIP/17-3bf3
<- State: Up
<- Callerid: <unknown>
<- Uniqueid: 1085183384.178

On a failed attempt is this:

<- Event: Unlink
<- Channel1: SIP/16-5813
<- Channel2: SIP/17-c2af
<- Uniqueid1: 1085183244.172
<- Uniqueid2: 1085183244.173

It seems that Matt is right when he says "The reason this doesn't (always) work is a matter of timing". It appears that sometimes (in my case most of the time) the hangup is processed before the redirect.

To make things worst (or at least more wacky), if I enter the redirect command by hand in a telnet window, it works 95% of the time (but not allways).

By: Mark Spencer (markster) 2004-05-21 23:08:06

nicolasg: are you sure you're testing with latest CVS *head* and not using the patches in the bug report???  the locking is now done in such a way that both channels have to be processed before a thread can possibly receive a hangup on one of the channels.

By: nicolasg (nicolasg) 2004-05-22 08:47:50

Hi Mark,

Yes, I tried with CVS-HEAD. I did a fresh checkout (not an update) of asterisk,zaptel,libpri and tried without any other custom patches that I use normally (mostly internationalization patches). I will try again today, I'm seeing commited changes to channel.c in CVS-HEAD after I did my tests, about 8 hours ago. I will report back my result later.

Just tested with a fresher CVS-HEAD: now when the redirect is received, asterisk stops responding. It does not crash, but it stops responding. I'm not at the office so I cannot test it deeply... I'm usgin Fedora Core with LD_ASSUME_KERNEL, maybe the problem is related to this. I will try with a stock kernel.org kernel on monday and report back.

edited on: 05-22-04 08:43

By: nicolasg (nicolasg) 2004-05-24 09:47:18

Well, I have just tried with stock kernel.org version 2.4.26.

I'm using Fedora Core I on an Athlon XP 2000+. Output from /proc/version:

Linux version 2.4.26 (root@dns1.houseware.com.ar) (gcc version 3.3.2 20031022 (Red Hat Linux 3.3.2-1)) #2 Sun May 23 21:32:48 ART 2004

Output of show version from asterisk cli:

Asterisk CVS-HEAD-05/24/04-10:02:24 built by root@dns1.houseware.com.ar on a i686 running Linux

When I try to do a dual redirect, if I specify the originating party as the Channel parameter, asterisk crashes:

Asterisk ended with exit status 139
Asterisk exited on signal 11.

(gdb) bt
#0  0x40019582 in pthread_mutex_lock () from /lib/i686/libpthread.so.0
#1  0x08075637 in ast_async_goto (chan=0x0, context=0x4556f3d1 "conferences",
   exten=0x4556f1cf "900", priority=1) at pbx.c:3535
#2  0x0808dbef in action_redirect (s=0x80ff5a0, m=0x4556eec4) at manager.c:472
#3  0x080929d1 in process_message (s=0x80ff5a0, m=0x4556eec4) at manager.c:780
#4  0x08091a61 in session_do (data=0x80ff5a0) at manager.c:851
ASTERISK-1  0x40018db2 in pthread_start_thread () from /lib/i686/libpthread.so.0
ASTERISK-2  0x4016279a in clone () from /lib/i686/libc.so.6

If I specify the originating party as the ExtraChannel parameter, asterisk stops responding, with no crash. But no further calls are allowed.

Is there anything I can do to try to solve this problem? Maybe there is a specific kernel option to select? or a different version of gcc?

By: Mark Spencer (markster) 2004-05-24 11:16:22

Segfault should be gone in latest CVS,  I'm confused about the lock though.  Just find me on IRC today and we'll work on it.

By: Mark Spencer (markster) 2004-05-25 00:40:33

Found a small but uber substantial typo.  Should be fixed in CVS now!

By: matt (matt) 2004-05-25 10:55:40

Mark, could you please tell us what you fixed, and will it be going into the stable branch as well?

Many times there are fixes that we would like to see/test/incorporate into our version/check to see if they're in certain versions/etc....  but when people say "oh it's fixed don't worry about it" but don't describe what they did, that is very frustrating because we can't check on it, or learn from it or anything.  At the very least it ought to be standard practice to list which files were changed.

Don't get me wrong, though, I appreciate the effort.  And I don't give a rat's tail whether it was my fix or someone else's.  I'll be perfectly happy if this bug really is fixed, never to be seen again.

By: Mark Spencer (markster) 2004-05-25 11:09:59

the channel_walk_locked was being called with NULL each time instead of the previous channel, causing it to loop rapidly in that routine and not make progress.  It is not possible to back-port this fix to -stable since -stable does not support recursive mutexes.

By: Mark Spencer (markster) 2004-05-25 11:10:52

I guess more fundamentally the issue is this: You have to start both masquerades at the same time, because if the two channels are related (which they generally are) you don't want one receiving the hangup while you're masquerading the first.

By: nicolasg (nicolasg) 2004-05-25 16:30:14

Matt: Maybe you will find useful to subscribe to asterisk-cvs if you are not alredy subscribed, so you can see what changes are being commited, related to wich bug, etc.

Mark: Thanks for your time and effort. Today is like July 4 in my country, so I'm not at the office to test it deeply, but the latest change seems to fix the problem. I tried the dual redirect twice and it worked ok, with no deadlocks nor crash. Thanks again!

By: Mark Spencer (markster) 2004-05-26 01:24:11

Fixed in CVS

By: Digium Subversion (svnbot) 2008-01-15 14:44:03.000-0600

Repository: asterisk
Revision: 2206

U   trunk/channel.c

------------------------------------------------------------------------
r2206 | markster | 2008-01-15 14:44:03 -0600 (Tue, 15 Jan 2008) | 2 lines

Fix minor ordering issue (bug ASTERISK-975)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=2206

By: Digium Subversion (svnbot) 2008-01-15 14:44:04.000-0600

Repository: asterisk
Revision: 2207

U   branches/v1-0_stable/channel.c

------------------------------------------------------------------------
r2207 | markster | 2008-01-15 14:44:04 -0600 (Tue, 15 Jan 2008) | 2 lines

Minor reordering for bug ASTERISK-975

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=2207