[Home]

Summary:ASTERISK-20902: Asterisk waits for AGI to complete, even though we've daemonized it
Reporter:David Cunningham (dcunningham)Labels:
Date Opened:2013-01-07 16:28:08.000-0600Date Closed:2013-02-06 03:55:03.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/PBX PBX/pbx_realtime
Versions:1.8.11.1 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Ubuntu 10.10Attachments:( 0) 0000003-notworking.txt
( 1) 0000003-working.txt
( 2) call_phone.agi
( 3) extensions.conf
( 4) notworking_2013-01-14.txt
( 5) sip.conf
( 6) test_fork.agi
( 7) test_fork.agi
( 8) trace-20902-1.txt
( 9) working_2013-01-14.txt
Description:Three phones on one network: 0000001, 0000003, and 0000004.
When phone 0000001 is NOT plugged in, and 0000004 calls 0000003, when 0000003 answers Asterisk does NOT send a 200 OK to 0000004.
When phone 0000001 IS plugged in, under the same call scenario and with no other changes, when 0000003 answers Asterisk DOES send a 200 OK to 0000004.
Bizarre I know. Traces of both working and not working scenarios attached. Happens on both 1.8.11.1 and 1.8.7.1. Thank you.
Comments:By: David Cunningham (dcunningham) 2013-01-07 16:28:31.348-0600

Not working trace

By: David Cunningham (dcunningham) 2013-01-07 16:28:43.206-0600

Working trace

By: Michael L. Young (elguero) 2013-01-07 21:43:06.151-0600

David, can you help point out where Asterisk does not send a 200?  I am having a hard time finding it... I see what looks like a 200 being sent to Asterisk and I see Asterisk (at least in the debug it says) send an ACK to the 200 received.  Then it runs the AGI script... in the one that is working it looks like the AGI finishes while in the other, there appears that the AGI doesnt' finish?  I then see a BYE is received by Asterisk.

From: "A C" <sip:0000004@xx.xx.xx.25:5070>;tag=as608d9223
To: <sip:0000003@xx.xx.xx.25:5060>;tag=99dde23b4ffe7c8ei0

Same IP address?

Thanks for pointing out what I am missing.

Could we get a full debug log?  Maybe there is something missing that will help us see what is happening.  https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: David Cunningham (dcunningham) 2013-01-07 21:58:34.834-0600

Hi Michael,

The first leg is from 0000004 to Asterisk with the following INVITE:

INVITE sip:0000003@xx.xx.xx.25:5070;transport=udp SIP/2.0
Record-Route: <sip:xx.xx.xx.25;lr=on;ftag=2ff16670b5e728e8o0>
Via: SIP/2.0/UDP xx.xx.xx.25;branch=z9hG4bK93d7.d45c9a11.0
Via: SIP/2.0/UDP 10.0.0.214:5060;rport=19050;received=yy.yy.yy.241;branch=z9hG4bK-14b230c
From: "A C" <sip:0000004@hostname.example.com>;tag=2ff16670b5e728e8o0
To: "P M" <sip:0000003@hostname.example.com>
Call-ID: 4ca5bb10-b9dddcc8@10.0.0.214

Then Asterisk makes a call to 0000003 with this INVITE:

INVITE sip:0000003@xx.xx.xx.25:5060 SIP/2.0
Via: SIP/2.0/UDP xx.xx.xx.25:5070;branch=z9hG4bK6e086edc;rport
Max-Forwards: 70
From: "A C" <sip:0000004@xx.xx.xx.25:5070>;tag=as608d9223
To: <sip:0000003@xx.xx.xx.25:5060>
Contact: <sip:0000004@xx.xx.xx.25:5070>
Call-ID: 449b7c72780f528852148631747d1d79@xx.xx.xx.25:5070

Then phone 0000003 responds with a 200 OK:

SIP/2.0 200 OK
To: <sip:0000003@xx.xx.xx.25:5060>;tag=99dde23b4ffe7c8ei0
From: "A C" <sip:0000004@xx.xx.xx.25:5070>;tag=as608d9223
Call-ID: 449b7c72780f528852148631747d1d79@xx.xx.xx.25:5070

However Asterisk then never sends the 200 OK to 0000004 on SIP session with call ID 4ca5bb10-b9dddcc8@10.0.0.214.
Thanks.


By: Michael L. Young (elguero) 2013-01-07 22:13:46.930-0600

Okay... now I see what you are looking at... thanks.  For some reason this sounds familiar to me... I will have to take a look later to see if I can find something similar or not.

By: Rusty Newton (rnewton) 2013-01-11 13:38:47.869-0600

David, that looks weird.  We would like to see VERBOSE and DEBUG messages between the dialog.

Can you get the same working and not working debugs but with VERBOSE and DEBUG messages as well?

Can you provide a sanitized sip.conf as well?

By: David Cunningham (dcunningham) 2013-01-11 14:04:02.077-0600

Rusty, I will ask for the new traces.

We use Asterisk realtime, so the phones are not configured there. Will the sip.conf still be any use to you?


By: David Cunningham (dcunningham) 2013-01-14 12:21:28.377-0600

Not working with debug.

By: David Cunningham (dcunningham) 2013-01-14 12:21:54.027-0600

Working with debug.

By: David Cunningham (dcunningham) 2013-01-14 12:27:37.392-0600

sip.conf attached.

By: David Cunningham (dcunningham) 2013-01-14 12:28:09.748-0600

Traces attached. As mentioned before, realtime is used so the sip.conf does not contain individual entries. Thanks.

By: Walter Doekes (wdoekes) 2013-01-20 04:33:07.475-0600

Does anyone else notice that the AGI "end" has no timely equivalent:

{noformat}
[Jan 14 08:59:45] DEBUG[5183] chan_sip.c:  Header  0 [ 58]: INVITE sip:0000003@xx.xx.46.25:5070;transport=udp SIP/2.0
[Jan 14 08:59:45] DEBUG[5183] chan_sip.c:  Header  6 [ 36]: Call-ID: 69ee90ac-1b46636@10.0.0.214
...
[Jan 14 08:59:46] DEBUG[7537] chan_sip.c:  Header  0 [ 44]: INVITE sip:0000003@xx.xx.46.25:5060 SIP/2.0
[Jan 14 08:59:46] DEBUG[7537] chan_sip.c:  Header  6 [ 59]: Call-ID: 768af463211d72c301aa3c0270a1bd70@xx.xx.46.25:5070
...
[Jan 14 08:59:49] DEBUG[5183] chan_sip.c:  Header  0 [ 14]: SIP/2.0 200 OK
...
[Jan 14 08:59:49] VERBOSE[7537] res_agi.c:        > agi://127.0.0.1/answered?screen=0&puniqueid=1358153985.13474&pcallid=&snumber=: AGI starting at Mon Jan 14 08:59:49 2013 (1358153989)
...
[Jan 14 08:59:49] VERBOSE[7537] res_agi.c:     -- <SIP/xx.xx.46.25:5060-0000170d>AGI Script agi://127.0.0.1/answered?screen=0&puniqueid=1358153985.13474&pcallid=&snumber= completed, returning 0
[Jan 14 08:59:49] VERBOSE[7537] res_agi.c: <SIP/xx.xx.46.25:5060-0000170d>AGI Tx >> HANGUP
...
[Jan 14 08:59:49] DEBUG[7514] chan_sip.c: Trying to put 'SIP/2.0 200' onto UDP socket destined for xx.xx.46.25:5060
{noformat}

In non-working, it's at:

{noformat}
[Jan 14 09:03:47] DEBUG[5183] chan_sip.c:  Header  0 [ 58]: INVITE sip:0000003@xx.xx.46.25:5070;transport=udp SIP/2.0
[Jan 14 09:03:47] DEBUG[5183] chan_sip.c:  Header  6 [ 37]: Call-ID: 472ee090-fe2c15f1@10.0.0.214
...
[Jan 14 09:03:47] DEBUG[9558] chan_sip.c:  Header  0 [ 44]: INVITE sip:0000003@xx.xx.46.25:5060 SIP/2.0
[Jan 14 09:03:47] DEBUG[9558] chan_sip.c:  Header  6 [ 59]: Call-ID: 73d4569b2b7b918136a3bb4e55316db9@xx.xx.46.25:5070
...
[Jan 14 09:03:49] DEBUG[5183] chan_sip.c:  Header  0 [ 14]: SIP/2.0 200 OK
...
[Jan 14 09:03:49] VERBOSE[9558] res_agi.c:        > agi://127.0.0.1/answered?screen=0&puniqueid=1358154227.13485&pcallid=&snumber=: AGI starting at Mon Jan 14 09:03:49 2013 (1358154229)
...
[Jan 14 09:03:54] DEBUG[9558] res_agi.c: SIP/xx.xx.46.25:5060-00001714 hungup
[Jan 14 09:03:54] VERBOSE[9558] res_agi.c: <SIP/xx.xx.46.25:5060-00001714>AGI Tx >> HANGUP
...
[Jan 14 09:03:56] DEBUG[5183] chan_sip.c:  Header  0 [ 58]: CANCEL sip:0000003@xx.xx.46.25:5070;transport=udp SIP/2.0
[Jan 14 09:03:56] DEBUG[5183] chan_sip.c:  Header  3 [ 37]: Call-ID: 472ee090-fe2c15f1@10.0.0.214
{noformat}

By: David Cunningham (dcunningham) 2013-01-22 16:05:30.806-0600

If it helps, we left a phone unplugged and calls started working between 10 and 15 minutes after a phone was unplugged.

By: Rusty Newton (rnewton) 2013-01-23 16:01:51.917-0600

Walter: I noticed that, but not sure what to make of it.

David, we are at a bit of a loss for ideas on whats going on.  Is there any way you can test with 1.8.20.1 ? There has been numerous fixes since the version you are using, not to mention some serious security vulnerabilities fixed.

If someone gets a chance to look into this then they'll have to reproduce it in order to debug it, and that means they'll likely need most of your config files. That means that anything you can do to reproduce the situation in a simpler environment will help. I.E. reproduce without AGI or Realtime involved if possible. Perhaps in attempting to eliminate one of those components, you may find where the problem exists.



By: David Cunningham (dcunningham) 2013-01-23 16:39:36.123-0600

Rusty, thanks. We will try 1.8.20.1.

By: David Cunningham (dcunningham) 2013-01-24 15:10:47.773-0600

We have tried 1.8.20.1, but unfortunately this did not solve the problem. How should we proceed from here?


By: Michael L. Young (elguero) 2013-01-24 15:26:42.025-0600

David,

{quote}
If someone gets a chance to look into this then they'll have to reproduce it in order to debug it, and that means they'll likely need most of your config files. That means that anything you can do to reproduce the situation in a simpler environment will help. I.E. reproduce without AGI or Realtime involved if possible. Perhaps in attempting to eliminate one of those components, you may find where the problem exists.
{quote}

If you could provide what Rusty mentioned above, that would help.  None of us have your AGI scripts or exact environment to reproduce this and debug it.  So, if you can make things as simple as possible (no agi or realtime, if possible), provide the necessary config files and reproduce the problem, it will help us to be able to attempt to reproduce the problem on our side and then come up with a fix.

Thanks

By: David Cunningham (dcunningham) 2013-01-25 04:17:13.795-0600

The scenario in the trace is the simplest we can provide for now. If we give one of your engineers access to the production server while the scenario is being repeated would that be helpful?


By: Walter Doekes (wdoekes) 2013-01-25 04:39:40.929-0600

For starters I'd like to see it not-work if you disable the AGI bit that's taking so long.

(1) Does not-working and AGI-slowness always coincide?
(2) Can you disable the AGI call?
(3) Can you replace the AGI call with a different script that always completes quickly?

By: Rusty Newton (rnewton) 2013-01-27 18:59:26.057-0600

{quote}
If we give one of your engineers access to the production server while the scenario is being repeated would that be helpful?
{quote}

Not really, that would likely consume significant time for an Engineer who already has more defined core supported issues allocated to them. We have a multitude of issues already in progress. Any community developer is of course free to help you out.

I'd follow Walter and Michael's suggestions for now.

By: David Cunningham (dcunningham) 2013-02-04 19:37:01.577-0600

Trace using simple AGI.

By: David Cunningham (dcunningham) 2013-02-04 19:37:47.727-0600

extensions.conf for simple AGI

By: David Cunningham (dcunningham) 2013-02-04 19:38:23.459-0600

AGI to call destination

By: David Cunningham (dcunningham) 2013-02-04 19:39:08.876-0600

AGI called from test-answered macro

By: David Cunningham (dcunningham) 2013-02-04 19:45:40.785-0600

I've just attached a simple AGI scenario to recreate the problem.

When 1000101 is called the call goes to AGI call_phone.agi.
The AGI dials 1000101, and instructs that macro test-answered be run when it answers.
When phone 1000101 answers, the macro is run and calls AGI test_fork.agi.
The AGI test_fork.agi does the important bit in re-creating the problem, which is to fork a child process. The parent continues after the fork and exits. We do everything we can around the fork to ensure the child process becomes independent and that the parent definitely exits immediately.
However, the 200 OK from 1000101 answering is delayed by Asterisk until after the child process exists.
We can't work out how Asterisk even knows that the child process exists.
A trace is attached, as well as the AGI scripts, and the extensions.conf used.
Please let us know if you need anything more. Thanks.

You'll need to ensure perl module Asterisk::AGI is installed with command:
cpan Asterisk::AGI


By: Walter Doekes (wdoekes) 2013-02-05 01:59:55.469-0600

So, without diving into your uploaded files, can I assume the following?

(1) The problem is with AGI calls. When there is no AGI, there is no problem?

(2) The problem has nothing to do with the call_phone.agi, and everything with test_fork.agi. Did you try putting Dial(SIP/1000101@x.x.x.45/n,20,ortM(test-answered)) in extensions.conf and skipping the the call_phone.agi?

(3) The problem really lies with M(test-answered), because the Macro calls a script which does not end in time? I.e. does the same happen if you System(sleep 10) instead of AGI(test_fork.agi)?

(4) If all of the above is true, then you only need to figure out why your 'disown' doesn't work. (For starters, the waitpid() should go. The point was that you didn't want to wait, right? You may need to double-fork. Look at perl daemonize documentation.)

By: David Cunningham (dcunningham) 2013-02-05 02:10:30.431-0600

test_fork.agi which does problem all in one

By: David Cunningham (dcunningham) 2013-02-05 02:15:30.996-0600

Hi Walter,

1. Correct.

2. Correct. I've attached a new test_fork.agi which can be called directly from extensions.conf with:
exten => 1000101, 1, AGI(test_fork.agi)

3. The problem is that the macro calls an AGI, and somehow manages to know that the AGI has a child process which hasn't finished even thought the parent AGI has.

4. We have tested with a double fork, and without the waitpid(). Neither of these helps. The waidpid() has options so that it doesn't block. We are certain that the parent has gone past that and exited, but Asterisk does not forward on the 200 OK until the child process exits.


By: Walter Doekes (wdoekes) 2013-02-05 02:52:41.156-0600

3. But the problem is exactly the same as if you did a System(sleep 10), right? So the topic at hand should change into: """Asterisk waits for AGI to complete, even though we've daemonized it."""

4. Well.. you should not only double fork but really use a pre-made daemonize function and not hand-hack your own. If I'm not mistaken, asterisk may very well be waiting for the filedescriptors which you didn't close.

By: David Cunningham (dcunningham) 2013-02-06 03:05:32.441-0600

Hi Walter,
Your words "asterisk may very well be waiting for the filedescriptors which you didn't close" proved very helpful. By closing everything we solved the problem. Thank you!


By: Walter Doekes (wdoekes) 2013-02-06 03:54:34.397-0600

No thanks, just send a book :P

Closing this one as wontfix (or something).