ASTERISK-19425: Calls not released after BYE

[Home]

Summary: ASTERISK-19425: Calls not released after BYE

Reporter: David Cunningham (dcunningham) Labels:

Date Opened: 2012-02-23 00:49:54.000-0600 Date Closed: 2012-05-09 11:22:19

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 1.8.7.2 Frequency of
Occurrence Frequent

Related
Issues:
is related to ASTERISK-19455 SIP channels permanently stuck in system after BYE message received

is related to ASTERISK-19714 No hungup after BYE.

Environment: Ubuntu 8.04.4 LTS Attachments: ( 0) ASTERISK-19425.patch
( 1) full-2012-02-23.gz

Description: We have systems on various releases of Asterisk 1.8 which do not release calls properly after a BYE. That is, Asterisk receives a BYE from the other party. It sends an ACK. Then quite a bit later if you do "sip show channels" the call is still listed with a "Last Message" of "Rx: BYE" - example below.
What information do you need to investigate please? We have an Asterisk log with "sip set debug on" for such a call.
Thanks.

Peer User/ANR Call ID Format Hold Last Message Expiry Peer
12.34.56.78 1234567898 613977632-0-308 0x0 (nothing) No Rx: BYE 12.34.56

Comments: By: Matt Jordan (mjordan) 2012-02-23 08:53:06.565-0600

Are the channels associated with the SIP dialog destroyed when the SIP dialog is destroyed? Or are they persisting after the SIP dialog has been completely destroyed? Note that a SIP dialog can remain in the system for up to 32 seconds after the final response has been sent, to account for retransmissions. Currently, the channel associated with the dialog can remain in the system in certain circumstances. That behavior is something we're looking to improve on in Asterisk 1.8.10.

If the channels are remaining in the system after the SIP dialog has been destroyed, that's a much larger issue.
By: David Cunningham (dcunningham) 2012-02-23 17:36:13.255-0600

The "sip show channels" still shows the call about an hour after the call finished. It eventually disappears some time after, but an hour seems like a long wait.

I can see the dialogue destruction happening at the end of the call in the Asterisk log as below. I'm not sure why it then still shows in "sip show channels".
Please let me know what you need to continue, and thanks.

[Feb 23 00:25:39] VERBOSE[11057] chan_sip.c: Scheduling destruction of SIP dialog '613977632-0-3080578460@12.34.56.78' in 32000 ms (Method: BYE)
...
[Feb 23 00:26:11] WARNING[11057] chan_sip.c: Autodestruct on dialog '613977632-0-3080578460@12.34.56.78' with owner in place (Method: BYE)

By: David Woolley (davidw) 2012-02-24 05:54:00.256-0600

"with owner in place" is the critical detail here. It is saying that there is a channel still up for that SIP connection. Do you have a very long running h extension, or an AGI script that has not returned?
By: Matt Jordan (mjordan) 2012-02-24 09:13:27.046-0600

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: David Cunningham (dcunningham) 2012-02-24 16:58:56.317-0600

Debug log attached. We do run an AGI script from the h extension, but it should have finished within seconds.
Thank you!
By: Dean Vesvuio (deanvesuvio) 2012-02-28 12:36:23.357-0600

We are seeing this same issue with the latest 10.1 version. In our case it happens some percentage of the time when a user hangs up while in the Confbridge app. We also have an AGI script in the "h" dialplan which takes about 100ms to run.

Out of about 3600 calls into a server on a day, we have 400 of the calls stuck in this state reliably every day.

I do not think this is a minor issue. We have the exact same workflow working properly on the latest 1.6 version of Asterisk with zero problems.
By: Dean Vesvuio (deanvesuvio) 2012-03-01 12:20:50.138-0600

Just wanted to add that we completely got rid of our "h" dialplan entry and this still happens constantly. It's a terrible bug because it causes the Confbridge app to get stuck. So, that means if you're recording a conference, the file grows and grows since it thinks the conference is still ongoing. I can't understand why this is marked as minor. It makes Asterisk nearly unusable.
By: Matt Jordan (mjordan) 2012-03-05 14:05:06.062-0600

This is marked as 'minor' because that's what the original issue reporter marked it as. I've bumped it up to Major.
By: Dean Vesvuio (deanvesuvio) 2012-03-06 13:46:25.262-0600

Ok. I think this is probably the same issue as the bug I filed, ASTERISK-19455. I have some experience with coding in Asterisk. If you have any suggestions for where to look to help diagnose this, I'm willing to give it a try.
By: David Cunningham (dcunningham) 2012-03-21 07:10:51.186-0500

Would anyone have an update on this? Thanks!
By: Matt Jordan (mjordan) 2012-03-21 08:03:47.115-0500

Your issue is in queue, please be patient, and we will get to it as time permits and developer resources become available.

By: Yoherman Jo (yojo) 2012-04-04 22:02:13.199-0500

same issue, my server (asterisk 10.2) connect to Huawei softx3000. Stuck call every night (around 10 calls).

sip.conf
[general]
nat=yes
allowguest=no
bindport=5060
bindaddr=0.0.0.0
disallow=all
allow=alaw
allow=ulaw
canreinvite=no
rtptimeout=120
relaxdtmf=yes
trustrpid=yes
session-timers=originate
session-refreshe=uac
ignoresdpversion=yes

[trunk-softswitch]
type=friend
host= 176.x.x.x
context=context-softswitch
nat=no
canreinvite=no
insecure=invite
dtmfmode=inband
qualify=yes
language=id

By: Dean Vesvuio (deanvesuvio) 2012-04-05 20:55:26.948-0500

I wanted to add that when I do this in the code:

#define DEFAULT_TRANS_TIMEOUT 32000

the problem gets better, meaning maybe 20 stuck calls a day instead of 200.

When I do this:

#define DEFAULT_TRANS_TIMEOUT 128000

the problem gets a lot better, meaning 2 instead of 20.

So, people might want to try that in the interim until the engineers are able to get a look at this. Maybe if you set that number to something crazy like 5 minutes (300000), it would go away completely.
By: Dan McKeehan (dmckeehan) 2012-04-06 11:15:54.777-0500

Bounty for this ASTERISK-19642
By: Mark Michelson (mmichelson) 2012-04-30 14:28:06.131-0500

Hey folks. Just thought I'd let you know that I am looking into this issue now.

Based on the full-2012-02-23.gz file, I believe the problem is in line with what was suggested with regards to AGI scripts. Using the data in the upload, here's what I see:

At 00:25:39, Asterisk receives a BYE.
During that same second, The h@from-internal:1 AGI begins running.
From here, you can grep through the file looking at the PBX thread ID (15090) to see the AGI's execution.
At line 208315, Asterisk transmits the value of the ENSWITCH-FAX-CALLED variable. The timestamp is still the same.
At 00:26:11, you can see that the BYE transaction has timed out, and that Asterisk has attempted to destroy the dialog.
At 00:26:40 (line 210325), you can see that the channel is being soft hungup due to lack of RTP activity. This makes sense since the call was disconnected a minute ago.
After this the AGI starts running again. Based on what it's doing, it seems to think that the call has just started.
There is a 3 second gap where the AGI stops running and then it starts again at line 210895.
There is another 3 second gap and then the AGI starts again at line 211655.
There is another 3 second gap and then the AGI starts again at line 211674.
It is finally after this run that it appears that the channel finally gets hung up.

I took a further look into your log and found a later call that appeared to be of the same nature (fax to e-mail). Like the "stuck" call, it is a SIP channel (Call-ID 8b032422-b6572adf@774.81.114.167) bridged to an IAX2 channel (channel IAX2/127.0.0.1-1624). In the stuck call, the SIP leg is the one that ends the call with a BYE. After this, the AGI runs in dead mode. In the non-stuck call, the IAX2 leg is the one that ends the call, and the AGI runs in live mode. While I don't have access to the AGI in question, it would appear that the AGI is written in a way such that it expects to run in live mode and detect the hangup itself. Of course, if the hangup has already occurred, then the AGI will wait for quite a while. In the "stuck" call, the AGI is presumably still waiting to receive a hangup when the SIP BYE transaction times out. Taking Dean's advice of setting the DEFAULT_TRANS_TIMEOUT to a higher number means that it will take longer for the SIP BYE transaction to time out. In David's case, I presume that setting the DEFAULT_TRANS_TIMEOUT greater than the configured RTP timeout might be a temporary fix for this issue.

Anyway, while I have possibly added some insight into *what* is going on, I still don't have an answer as to *why* the SIP channel does not go away. I'm going to dive into the code a bit and see if I can find a clue as to what is going on. I'll also have a look into ASTERISK-19455 to see if Dean's situation appears to be the same or if there is something different there that may give me some clue as to what's going on.

As an aside, Dean, is your last name actually "Vesvuio" or is it "Vesuvio" ? If it's the latter, go into your user settings and change your "Full name" since it appears to be misspelled currently.

By: Mark Michelson (mmichelson) 2012-05-01 13:48:47.621-0500

I would like to point out that I do *not* believe that this issue is the same one being observed in ASTERISK-19455. I will work on setting up a situation that mimics this setup and see if I can figure out what's going on.
By: Mark Michelson (mmichelson) 2012-05-01 14:16:04.774-0500

Okay, I have now managed to reproduce this. Quite easily in fact. I simply set my timert1 in sip.conf to 100 ms, then I added an AGI in my 'h' extension that calls a shell script that sleeps for 20 seconds. I see the "Autodestruct on dialog ... with owner in place" message, then the AGI completes. After that, 'sip show channels' shows the channel stuck forever. Issuing "core show channels" results in no channels being shown. This means the issue localized to chan_sip and is not a general channel reference leak. I will report back when I have a working patch.
By: Mark Michelson (mmichelson) 2012-05-01 14:32:55.661-0500

I think I have figured out what the problem is now.

chan_sip is written under the assumption that the sip_hangup() callback will be called before the final transaction times out. If events occur in this order, then in the __sip_autodestruct() function, the dialog gets unlinked from all containers it is in. However, if the events occur out of order, then the dialog never gets unlinked from its containers. This is why the "sip show channels" CLI command still shows the dialogs, even though they are inactive.

I believe that the proper way to fix this is to just have the __sip_autodestruct() function unlink the dialog from its containers even if there is still an owner channel present. This way, when the channel finally hangs up, the last reference to the dialog will be removed and the memory will be freed properly. I will test this hypothesis out and prepare a patch for review.

Edit: After taking a closer look at the code, my initial plan is likely not a good one. Instead, I will set a flag on the dialog in __sip_autodestruct() and allow the sip_hangup() to do the unlinking if the flag is set.
By: Mark Michelson (mmichelson) 2012-05-01 16:32:10.972-0500

I have uploaded ASTERISK-19425.patch.

This patch works by rescheduling destruction of a dialog if the dialog still has an owner channel. The result of this is that you will likely see more warning messages in the cases where your dialog would have become stuck before. However, the dialog should eventually clear out after a while (in your case, I expect it to occur once your RTP timeout occurs).

Please test the patch and let me know how it goes for you.
By: David Cunningham (dcunningham) 2012-05-01 17:33:39.420-0500

Thank you, I will ask the customer to try the patch.

By: Dan McKeehan (dmckeehan) 2012-05-03 11:51:31.243-0500

I can confirm that this fix does not work for me all I get is a whole lot of:

[May 3 09:50:45] WARNING[5255]: chan_sip.c:3899 __sip_autodestruct: Autodestruct on dialog '2d50aa087022486d42a47f561e41d75a@1.1.1.23:5060' with owner in place (Method: BYE). Rescheduling destruction for 10000 ms

By: Mark Michelson (mmichelson) 2012-05-04 12:38:32.135-0500

@Dan: Your problem may be a channel reference leak, or if you have an AGI running in the 'h' extension, it may just never be completing. If that's the case, then you'll just see lots of the "Autodestruct on dialog ... with owner in place... Rescheduling ... " since the channel is never going away. Note that if this patch is working, you *still* will see this message sometimes, but the dialog will eventually auto-destruct once its corresponding channel gets hung up.

I can see how this message could get annoying, so I may go down the route I initially described and set a flag in auto-destruct for the hangup to take care of cleanup when it is called. Note that if my current patch is not working for you, then my new one won't work either, but you'll see far fewer annoying warning messages.
By: David Cunningham (dcunningham) 2012-07-08 23:16:47.637-0500

Can you please advise what version of Asterisk this patch will be put into? Thanks.

By: Leif Madsen (lmadsen) 2012-07-09 09:48:43.179-0500

Answer is always, "next one after commit is in the branch"
By: Matt Jordan (mjordan) 2012-07-30 11:21:29.942-0500

In this particular case, 1.8.14.1.
By: Richard Hamnett (rhamnett) 2014-08-28 20:35:59.050-0500

Hi we have the same issue happening many times a day out of thousands of calls. I am running 1.8.30.0 which already contains the above patch. We too have an AGI script that executes in the 'h' context and takes a few hundred ms.

Please let me know what information I can give you to help resolve this issue. Thanks.

Happy to try patches.

Richard
By: Matt Jordan (mjordan) 2014-08-28 20:56:39.623-0500

This issue is closed, as the patch that was attached to this issue has been merged. If you are still seeing this problem, it is a different bug.

That being said, if the channels *do* eventually clean themselves up once your AGI has properly exited, then there isn't a bug here. Having your AGI hold onto the channel for an extended period of time in the {{h}} extension will cause {{chan_sip}} to eventually complain that things are sticking around longer than it wants.

If, on the other hand, you have channels that are never being destroyed in Asterisk (which can be verified by running "core show channels"), then please open a new issue. Attach:
# Your sip.conf
# A full DEBUG log showing with 'sip set debug on' enabled that shows a call flow that reproduces the problem
# Step by step instructions for reproducing the problem
# The relevant portion of the dialplan
By: Richard Hamnett (rhamnett) 2014-08-28 21:08:08.910-0500

Thanks, i'll gather what you asked for. FYI the channels do hang around in 'core show channels verbose' with nothing bridged. Also the AGI does definitely return (with return code 0). After some time, maybe 30 minutes the hung calls go away but the CDRs are not getting written on hangup which is a big problem for me.
By: Richard Hamnett (rhamnett) 2014-08-28 21:39:17.176-0500

@Matt Jordan - I've captured the debug do you want me to post everything here or new ticket?