[Home]

Summary:ASTERISK-19455: SIP channels permanently stuck in system after BYE message received
Reporter:Dean Vesvuio (deanvesuvio)Labels:
Date Opened:2012-03-01 13:41:34.000-0600Date Closed:2012-05-09 11:22:19
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:10.1.0 10.1.1 10.1.2 10.1.3 Frequency of
Occurrence
Constant
Related
Issues:
is related toASTERISK-19425 Calls not released after BYE
is related toASTERISK-19642 SIP channels permanently stuck in system after BYE message received
Environment:Linux: 2.6.32-220.4.1.el6.x86_64 #1 SMP Tue Jan 24 02:13:44 GMT 2012 x86_64Attachments:( 0) 13034709861-filter.txt
( 1) errorlog.gz
Description:We are seeing some percentage of our SIP channels stuck in the system after BYE is received. We see the following in our logs:

[Mar  1 11:11:23] WARNING[16927]: chan_sip.c:3879 __sip_autodestruct: Autodestruct on dialog '4f59fe4879706230286df93b21335ea9@216.40.252.146' with owner in place (Method: BYE)
[Mar  1 11:11:24] WARNING[16927]: chan_sip.c:3879 __sip_autodestruct: Autodestruct on dialog '7edfb438409765196966d7550515d8d3@76.164.225.138' with owner in place (Method: BYE)
[Mar  1 11:11:24] WARNING[16927]: chan_sip.c:3879 __sip_autodestruct: Autodestruct on dialog '0e1e400609c5925b292ad8ae10fe9942@216.40.252.146' with owner in place (Method: BYE)

The channels do NOT clear. Asterisk must be totally restarted. If you do a sip show channels in the CLI you see lines that look like this:

76.164.225.138   XXXXXXXXXX       16e63b231707cb9  (nothing)        No       Rx: BYE                    p3        
76.164.225.138   XXXXXXXXXX       569511d04c78b4d  (nothing)        No       Rx: BYE                    p3  



If you inspect a specific stuck channel with sip show channel you see this:

* SIP Call
 Curr. trans. direction:  Outgoing
 Call-ID:                569511d04c78b4d319fb2914183bc17a@76.164.225.138
 Owner channel ID:       <none>
 Our Codec Capability:   (ulaw)
 Non-Codec Capability (DTMF):   0
 Their Codec Capability:   (gsm|ulaw|alaw)
 Joint Codec Capability:   (ulaw)
 Format:                 (nothing)
 T.38 support            No
 Video support           No
 MaxCallBR:              384 kbps
 Theoretical Address:    XX.XX.XX.XX:5060
 Received Address:       XX.XX.XX.XX:5060
 SIP Transfer mode:      open
 Force rport:            Yes
 Audio IP:               XX.XX.XX.XX (local)
 Our Tag:                as2f5a8f37
 Their Tag:              as65be27db
 SIP User agent:         Asterisk PBX 1.6.1.20
 Peername:               p3
 Original uri:           sip:XXXXXXXXXX@XX.XX.XX.XX
 Caller-ID:              XXXXXXXXXX
 Need Destroy:           No
 Last Message:           Rx: BYE
 Promiscuous Redir:      No
 Route:                  sip:XXXXXXXXXX@XX.XX.XX.XX
 DTMF Mode:              inband
 SIP Options:            replaces replace timer
 Session-Timer:          Inactive


We do NOT have a hangup extension in our dialplan. It is incredibly simple. Here it is:

exten => _cXX#.,1,Answer()
exten => _cXX#.,n,AGI(confagi.php,${EXTEN})
exten => _cXX#.,n,ConfBridge(${INSTANCE},default_bridge,default_user,${MENU_NAME})
exten => _cXX#.,n,Hangup()

The AGI script is ALWAYS already complete when this issue happens. It only takes about 300ms to run. The issue occurs when the user hangs up while in the ConfBridge app.

In the log I've uploaded, you can see the entire situation as regards the following SIP channel: 7fa1b98e207951a85328a51c7d539e58

The file had a zillion copies of the following two messages, which I redacted for readability:

[Mar  1 11:27:06] DEBUG[2668] bridging.c: Going into a multithreaded waitfor for bridge channel 0x7f39d50388c8 of bridge 0x7f39d40d0878

-and-

[Mar  1 11:27:06] DEBUG[2934] audiohook.c: Failed to get 160 samples from read factory 0x7f39d41fc2b8
[Mar  1 11:27:06] DEBUG[2934] audiohook.c: Failed to get 160 samples from write factory 0x7f39d41fcdf8
[Mar  1 11:27:06] DEBUG[2934] audiohook.c: Read factory 0x7f39d41fc2b8 and write factory 0x7f39d41fcdf8 both fail to provide 160 samples


One thing I noticed is that the autodestruct time for this channel gets set to 6400ms. I had thought it was supposed to be 32000ms. Maybe that's the problem?
Comments:By: Dean Vesvuio (deanvesuvio) 2012-03-01 13:42:54.372-0600

These are the logs mentioned in the description.

By: Dean Vesvuio (deanvesuvio) 2012-03-01 13:44:22.305-0600

Note that above I have replaced phone numbers and IP addresses with Xs. I didn't do that in the logfile.

By: Dean Vesvuio (deanvesuvio) 2012-03-01 13:49:08.341-0600

And, just so we don't lose any data here, this is what that channel looks like now, 15 minutes later. It will look like this until I restart Asterisk sometime tonight. Please let me know if I should not do that -- if I can be helpful by running some commands I'm more than willing to do so.


Connected to Asterisk 10.1.3 currently running on i2075 (pid = 16895)
i2075*CLI> sip show channel 7fa1b98e207951a85328a51c7d539e58

 * SIP Call
 Curr. trans. direction:  Incoming
 Call-ID:                7fa1b98e207951a85328a51c7d539e58@216.40.252.146
 Owner channel ID:       <none>
 Our Codec Capability:   (ulaw)
 Non-Codec Capability (DTMF):   0
 Their Codec Capability:   (gsm|ulaw|alaw)
 Joint Codec Capability:   (ulaw)
 Format:                 (nothing)
 T.38 support            No
 Video support           No
 MaxCallBR:              384 kbps
 Theoretical Address:    216.40.252.146:5060
 Received Address:       216.40.252.146:5060
 SIP Transfer mode:      open
 Force rport:            Yes
 Audio IP:               72.46.132.146 (local)
 Our Tag:                as5a366fd6
 Their Tag:              as5b45d16e
 SIP User agent:         Asterisk PBX 1.6.1.20
 Peername:               p2
 Original uri:           sip:9546356061@216.40.252.146
 Caller-ID:              9546356061
 Need Destroy:           No
 Last Message:           Rx: BYE
 Promiscuous Redir:      No
 Route:                  sip:9546356061@216.40.252.146
 DTMF Mode:              inband
 SIP Options:            replaces replace timer
 Session-Timer:          Inactive


By: Sarikov Gabriel (gabriel) 2012-03-13 04:26:15.101-0500

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.4 and 1.6.x branches has ended. For continued maintenance support please move to the 1.8 branch which is a long term support (LTS) branch. For more information about branch support, please see https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions.  After testing with Asterisk 1.8, if you find this problem has not been resolved, please open a new issue against Asterisk 1.8.

its also recurring on all asterisk 1.8 versions till now 1.8.10

By: Dean Vesvuio (deanvesuvio) 2012-03-15 02:46:17.526-0500

This isn't in 1.4 or 1.6. It's in all versions of 10.X. And, it appears also to exist in 1.8.

By: Dean Vesvuio (deanvesuvio) 2012-03-15 11:48:24.649-0500

I upgraded a server to 10.2.0 and the bug is still there. Also, I tried this:

#define DEFAULT_TRANS_TIMEOUT 32000

The theory was that maybe letting the server have more time to close down SIP sessions would fix it. No dice.

Really, I'm game to look at anything anyone suggests that might be the source of this problem. Could it be because the server that's sending calls to this server is running Asterisk 1.6? Could it be because we're doing a SIP reinvite to move the RTP stream?

By: Dan McKeehan (dmckeehan) 2012-04-04 15:55:31.740-0500

I too have the same problem.  Let me know if you need any additional info.

By: Dan McKeehan (dmckeehan) 2012-04-04 16:09:49.569-0500

Looks like this might be a duplicate of
ASTERISK-19425
ASTERISK-17267

And possibly
https://reviewboard.asterisk.org/r/1077/

By: Dean Vesvuio (deanvesuvio) 2012-04-05 21:30:32.275-0500

I do think it's a dupe of 19425

Please note that as I pointed out on 1945, if you set DEFAULT_TRANS_TIMEOUT to 128000 it helps even more than 32000. I know this is a workaround, but it's very useful for our use case until the engineers start looking at it.

By: Dean Vesvuio (deanvesuvio) 2012-04-05 21:30:55.372-0500

Oh and, though I think it's a dupe, I suspect my logs are pretty useful, so please don't delete this bug!

By: Dan McKeehan (dmckeehan) 2012-04-06 11:15:40.488-0500

Bounty for this ticket ASTERISK-19642

By: Dean Vesvuio (deanvesuvio) 2012-04-19 20:40:19.164-0500

I found a workaround for the case where the calls are in ConfBridge. It involves an external sweeper script, not a change to Asterisk. I can share it with people if that's the problem they're having. Especially if there's a bounty involved... ;-)

By: Mark Michelson (mmichelson) 2012-04-30 15:51:00.310-0500

I've taken a look at your logs, and it isn't especially clear why the owner channel is sticking around during autodestruction of the SIP dialog.

To answer your early question about why you had a 6400 ms timeout, the DEFAULT_TRANS_TIMEOUT is set to -1, meaning to calculate the timeout based on timer T1. Specifically, the timeout is set to timer T1 * 64. By default, timer T1 is set to 500 ms. In your case, I'm guessing it has been set to 100 ms. This can happen either through setting the timer directly in your sip.conf file, or it can get set low because you have "qualify" enabled and Asterisk has detected a short round-trip time for transactions with this peer.

I agree that your AGI is completely irrelevant in this case. Entering the ConfBridge *has* to occur after the AGI has concluded. This seems like something I could attempt to reproduce using a simple SIPp script. I will set my timer T1 very low and automate some calls and see what happens. I feel like this will be easier to reproduce than ASTERISK-19425. While I think that issue has an issue with the way that the AGI is coded, I think that both issues have the same problem at their hearts: if a SIP dialog autodestructs while a channel is still present, the SIP dialog does not get properly destroyed, even if the channel is later hung up.

By: Mark Michelson (mmichelson) 2012-05-01 13:44:16.354-0500

I believe I was able to reproduce your issue. At first, I was not. I ran about 1000 calls through confbridge and could not make your issue happen. Then I noticed that the "stuck" channel either had a prompt being played on it or had music on hold being played. I adjusted my test so that only one caller would be in the conference bridge at a time and would have music being played to him. With this, I was able to make the issue happen. I may be wrong about this, but I believe that your problem may be fixed by revision 357762 of the Asterisk 10 branch. Try the following command in your Asterisk 10.1.3 checkout: {noformat} svn diff https://origsvn.digium.com/svn/asterisk/branches/10@357761 https://origsvn.digium.com/svn/asterisk/branches/10@357762 | patch -p0 {noformat} Recompile and see if you still get the stuck channels.

When I reran my test against the tip of the 10 branch, I did not get any stuck channels.

By: Dan McKeehan (dmckeehan) 2012-05-01 14:24:05.763-0500

My issue is a bit different but I believe might be the same problem.  I am not using an AGI, conference bridge.  I am also running revision 363269.  I am using pbx_spool.  When channels get stuck they say last command RX:BYE in sip show channels.  Let me know if this is a different issue and I'll create a new bug.

By: Mark Michelson (mmichelson) 2012-05-01 16:35:32.606-0500

Just in case this does not work for you, there is a patch attached to ASTERISK-19425 you can try as well.

By: Dean Vesvuio (deanvesuvio) 2012-05-04 02:22:22.213-0500

I'm actually running 10.2.0 on one box and 10.3.0 on another one. Both still have this problem. I looked at the changelogs, however, and it does look like maybe 10.4.0 has this patch while my current versions do not. I'll try on one box tomorrow and report back. Thanks!

By: Mark Michelson (mmichelson) 2012-05-04 12:41:30.233-0500

@Dan, I'm not sure if you have the same issue or not. I'm going to wait to see what Dean reports. If his issue is fixed, I'm going to close this and recommend you open a new issue since you're having something different happen. If Dena says that the issue isn't fixed, then it's possible that you do have the same issue. We'll see.

By: Dean Vesvuio (deanvesuvio) 2012-05-04 17:25:28.641-0500

Ok I've tested this on a server. It does not address the problem. We still see dozens of calls getting stuck. I've reverted to using our earlier fix, which BTW is to kick users whose channels are in this BYE state out of ConfBridge using the kick command in the console via an automated process. That fix does actually alleviate the problem, if that helps in diagnosing.

By: Mark Michelson (mmichelson) 2012-05-07 10:35:08.956-0500

Okay, thank you for the feedback. I will work on trying to get this figured out.

By: Dean Vesvuio (deanvesuvio) 2012-05-07 13:15:18.831-0500

Let me know if there's anything I can do to help. I'd even be willing to give you access to our server for a few hours if you think it would help. And, needless to say I'm willing to try out whatever patches you come up with.

By: Mark Michelson (mmichelson) 2012-05-07 14:38:51.214-0500

@Dean, I think I may have (once again) figured out your issue.

In the errorlog.gz attachment, I took a look through the logs, and one line jumped out at me as being a bit odd:

[Mar  1 11:30:01] VERBOSE[3821] app.c:     -- Took too long, cutting it short...

This message means that when the conference participant was asked for his name, he took longer than 10 seconds to say it. The result is that a 10 second file was recorded for the participant's name. When the participant left the call, his channel thread then played that 10 second file into the conference bridge to let the other participants know that he had left the call (this happens even if he is the only participant in the conference). During the playback of this 10 second file, the SIP transaction timed out (since it times out after 6.4 seconds).

Essentially what you have here is the same situation that is happening in ASTERISK-19425. Something is occupying the channel for longer than the SIP transaction timeout. In your case, it's playing back the participant's name to the conference bridge. I suspect that the patch I have attached on ASTERISK-19425 will help with your problem. Note that even with the patch from ASTERISK-19425, you will see some warning messages about "Auto destruct called with owner in place" but the channels should eventually clear.

The only thing that I'm hesitant about is that you stated on that issue that even with DEFAULT_TRANS_TIMEOUT set as high as 128000, you still have a few stuck channels.

It may be that back when you made that adjustment, you were getting a combination of the problem I just described AND the issue where hangups were dropped that was fixed in revision 357762. What I suggest you do is to apply both the patch on ASTERISK-19425 and apply the change from revision 357762. If, with both of these applied, you still have stuck SIP channels, then please upload a new debug log and let me know which SIP call-ID is the one involved. In addition, if you find that you still have stuck SIP channels, let me know if the corresponding channels also show up in a "core show channels" call.

@Dan: Given that you are already running a new enough version of Asterisk and you have tried the patch at ASTERISK-19425 with no success, I'm beginning to think that you either have a different issue than what Dean has, or you have the same issue that Dean has for a very small percent of Dean's stuck SIP channels. For now anyway, I'm fine with using this issue as a place for you to upload your logs that demonstrate the stuck SIP channels. If you upload a log, be sure to let me know the SIP call-ID of the stuck dialog, and also let me know if the corresponding channel still shows up if you issue a "core show channels" command.

By: Dan McKeehan (dmckeehan) 2012-05-07 15:05:39.499-0500

Added an example of the logs from my stuck calls.  

By: Dean Vesvuio (deanvesuvio) 2012-05-07 15:11:04.614-0500

@Mark

I will apply the patch tonight and update you tomorrow midday.

Thanks!!!!

By: Dean Vesvuio (deanvesuvio) 2012-05-09 03:50:27.844-0500

That fixed it! Many thanks! You should definitely try to get that patch put into 10.5.0 if possible. We had zero stuck calls today on the server where we deployed it (down from dozens).

By: Mark Michelson (mmichelson) 2012-05-09 09:04:47.105-0500

Okay, great to hear. I'm going to commit the patch and get this and ASTERISK-19425 closed.

@Dan: It's hard to see much from the debug you published. The message

{quote}
[May  1 13:10:20] DEBUG[9170] channel.c: Didn't get a frame from channel: SIP/2.2.2.84-00027be3
{quote}

means that the bridge detected the hangup on the SIP channel properly. However, it's hard to tell what happened beyond that, if anything. The fact that the channel still showed up in your "core show channel SIP/2.2.2.84-00027be3" command indicates the Asterisk channel is still in the system. It's not just the SIP dialog that got stuck. I suspect that there is either

1) A channel reference leak
2) A deadlock

that is preventing the channel from being freed properly. If I were a betting man, I'd guess that the use of chan_local has something to do with the issue.

I suggest that you open a new bug report since it has turned out this issue is not directly related to yours. On that issue, please upload a full log, not just selected bits. Also be sure to indicate the version of Asterisk you are using so that the problem can be more easily reproduced if possible.

By: Virtua Klem (virtuaklem) 2012-07-10 02:51:04.271-0500

Hi,

I have the same problem with 10.5.1 versions.

And i can't apply the patch.

Thanks