[Home]

Summary:ASTERISK-25635: run_agi() while() loop loops indefinitely because of fgets() returns EAGAIN
Reporter:Oleksandr Natalenko (post-factum)Labels:
Date Opened:2015-12-17 03:05:14.000-0600Date Closed:
Priority:MajorRegression?No
Status:Open/NewComponents:Resources/res_agi
Versions:13.6.0 13.7.2 13.10.0 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-20061 PHP Agi Socket Server stopped working - socket_read spamming Resource temporarily unavailable
Environment:CentOS 7.1.1503, KVMAttachments:
Description:We've migrated our corporate PBX setup from Asterisk 1.8.7.1 (hardware server) to 13.6.0 (KVM VM), and faced the following issue.

We use 2 VM instances, one is for Asterisk itself, another is for PBX business logic. Asterisk uses FastAGI to call remote functions, PBX logic server is based on latest pystrix code with custom handlers.

The issue is that occasionally (with no visible reason several times per several days) one or two Asterisk threads starts to consume 100% of CPU time. We've got the stacktrace of such a thread via gdb (in all cases stacktrace was the same pointing to the same source code line):

{code}
0x00007f39ea02d9fd in read () from /lib64/libc.so.6
#0  0x00007f39ea02d9fd in read () from /lib64/libc.so.6
#1  0x00007f39e9fbe448 in __GI__IO_file_underflow () from /lib64/libc.so.6
#2  0x00007f39e9fbf3be in __GI__IO_default_uflow () from /lib64/libc.so.6
#3  0x00007f39e9fb3454 in __GI__IO_getline_info () from /lib64/libc.so.6
#4  0x00007f39e9fb237f in fgets () from /lib64/libc.so.6
#5  0x00007f3981cc53e9 in fgets (__stream=0x7f39a42d07c0, __n=2048, __s=0x7f392eba3da0 "") at /usr/include/bits/stdio2.h:263
#6  run_agi (argv=<optimized out>, argc=<optimized out>, dead=<optimized out>, status=0x7f392eba3520, pid=<optimized out>, agi=0x7f392eba3540, request=<optimized out>,
chan=0x7f39902e9f98) at res_agi.c:3830
#7  agi_exec_full (chan=<optimized out>, data=<optimized out>, enhanced=<optimized out>, dead=<optimized out>) at res_agi.c:4154
#8  0x0000000000565159 in pbx_exec (c=c@entry=0x7f39902e9f98, app=app@entry=0x1fcfbf0, data=data@entry=0x7f392eba4b60 "agi://x.y.z.c:4573/some_script, 130469, , 2, ru") at
pbx.c:1720
#9  0x000000000056fc29 in pbx_extension_helper (c=c@entry=0x7f39902e9f98, context=0x7f39902ea950 "preprocessing", exten=exten@entry=0x7f39902ea9a0 "38044xxxxxxx",
priority=priority@entry=24, label=label@entry=0x0, callerid=callerid@entry=0x7f39a426fd20 "38050xxxxxxx", action=action@entry=E_SPAWN, found=found@entry=0x7f392eba6be0,
combined_find_spawn=combined_find_spawn@entry=1, con=0x0) at pbx.c:4992
#10 0x0000000000575d8b in ast_spawn_extension (combined_find_spawn=1, found=0x7f392eba6be0, callerid=0x7f39a426fd20 "38050xxxxxxx", priority=24, exten=0x7f39902ea9a0 "38044xxxxxxx",
context=<optimized out>, c=0x7f39902e9f98) at pbx.c:6162
#11 __ast_pbx_run (c=c@entry=0x7f39902e9f98, args=args@entry=0x0) at pbx.c:6579
#12 0x000000000057727b in pbx_thread (data=data@entry=0x7f39902e9f98) at pbx.c:6899
#13 0x00000000005d598a in dummy_start (data=<optimized out>) at utils.c:1237
#14 0x00007f39ead5bdf3 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f39ea03c1ad in clone () from /lib64/libc.so.6
{code}

The reason the thread is consuming 100% of CPU time is that while() loop stuck looping because of fgets() (res/res_agi.c:3830) constantly returns EAGAIN (we've checked that with strace). We've enabled AGI debug and got the following:

{code}
[Dec 16 19:50:01] VERBOSE[6767][C-00002c40] res_agi.c: AGI Rx << temp buffer  - errno Resource temporarily unavailable
No \n received, checking again.
{code}

There is also another debug log snippet showing warnings regarding the call in question:

{code}
[Dec 16 19:01:44] VERBOSE[6767][C-00002c40] pbx.c: Executing [s@recall_www:28] AGI("IAX2/mtc-16527", "agi://x.y.z.c:4573/some_script, 050xxxxxxx") in new stack
[Dec 16 19:01:46] WARNING[29968] channel.c: Exceptionally long voice queue length queuing to IAX2/mtc-16527  
[Dec 16 19:01:47] WARNING[29964] channel.c: Exceptionally long voice queue length queuing to IAX2/mtc-16527  
[Dec 16 19:01:49] WARNING[29970] channel.c: Exceptionally long voice queue length queuing to IAX2/mtc-16527  
[Dec 16 19:02:25] WARNING[29965] chan_iax2.c: Max retries exceeded to host 194.50.85.126 on IAX2/mtc-16527 (type = 6, subclass = 11, ts=10009, seqno=2)
{code}

EAGAIN looping could be stopped if we restart PBX AGI server. After AGI server restart Asterisk log shows us that the call hangs up normally:

{code}
[Dec 16 19:50:47] VERBOSE[6767][C-00002c40] res_agi.c: AGI Rx << temp buffer  - errno Resource temporarily unavailable

...here we restart AGI server

[Dec 16 19:50:47] VERBOSE[6767][C-00002c40] res_agi.c: <IAX2/mtc-16527>AGI Script agi://x.y.z.c:4573/some_script completed, returning 0
[Dec 16 19:50:47] VERBOSE[6767][C-00002c40] res_agi.c: <IAX2/mtc-16527>AGI Tx >> HANGUP
[Dec 16 19:50:47] VERBOSE[6767][C-00002c40] chan_iax2.c: Hungup 'IAX2/mtc-16527'
{code}

We tried to find out whether there is similar bug report, and have found this one:

ASTERISK-20061

But the difference is that this bug report tells about AGI not working at all due to message format inconsistency (no new line). In our case everything works OK, but occasionally fails. Another difference is that each time looping occurs on different AGI script.

Also, after looping indefinitely for some time (an hour or so) Asterisk stops processing calls at all.

So, my questions are:

* how could we debug this issue further?
* is there any protection against misbehaving AGI scripts for Asterisk not to stuck in infinite loop because of malformed AGI server message (if that is the reason)?
Comments:By: Asterisk Team (asteriskteam) 2015-12-17 03:05:17.170-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Matt Jordan (mjordan) 2015-12-28 10:08:02.356-0600

The AGI code that runs the loop hasn't changed between 1.8 and 13, particularly with respect to this handling:

{code}
while (len > 1) {
res = fgets(buf + buflen, len, readf);
if (feof(readf))
break;
if (ferror(readf) && ((errno != EINTR) && (errno != EAGAIN)))
break;
if (res != NULL && !agi->fast)
break;
buflen = strlen(buf);
if (buflen && buf[buflen - 1] == '\n')
break;
len = sizeof(buf) - buflen;
if (agidebug)
ast_verbose("AGI Rx << temp buffer %s - errno %s\nNo \\n received, checking again.\n", buf, strerror(errno));
}
{code}

So your running into issues with this is not a result of upgrading to 13.

That aside, I'm not sure how the socket is getting into a state where it constantly reports {{EAGAIN}}. I would suspect that something in the client library has failed to close the socket properly, or else has done something improper.

Regardless, I'm a little suspicious of our handling of {{EAGAIN}} here. I would expect that we would want to clear the error on the stream once we've read it, otherwise it will always be set:

{quote}
The function ferror() tests the error indicator for the stream pointed to by stream, returning nonzero if it is set. The error indicator can only be reset by the clearerr() function.
{quote}

Which may explain the constant looping, once it has gotten into this state.



By: Rusty Newton (rnewton) 2015-12-28 10:29:24.858-0600

This doesn't appear to be a regression based on Matt's analysis so I'm changing that field to No.

By: Oleksandr Natalenko (post-factum) 2016-01-11 02:54:13.200-0600

So, should we add

{code}
clearerr(readf);
{code}

at the end of while() block?

By: Oleksandr Natalenko (post-factum) 2016-04-05 16:45:25.038-0500

As rmudgett suggested on IRC, clearerr() does not clear errno, so probably we need to do this:

{code}
diff --git a/res/res_agi.c b/res/res_agi.c
index ff33580..94f791e 100644
--- a/res/res_agi.c
+++ b/res/res_agi.c
@@ -3840,6 +3840,8 @@ static enum agi_result run_agi(struct ast_channel *chan, char *request, AGI *agi
len = sizeof(buf) - buflen;
if (agidebug)
ast_verbose("AGI Rx << temp buffer %s - errno %s\nNo \\n received, checking again.\n", buf, strerror(errno));
+ clearerr(readf);
+ errno = 0;
}

if (!buf[0]) {
{code}

Will test that.

By: Oleksandr Natalenko (post-factum) 2016-04-26 04:49:04.921-0500

Unfortunately, the patch presented above does not help.

I believe, the error in fact gets cleared, but fgets() on line 3830 still returns EAGAIN, so break does not occur on line 3834.

What else should we try?

By: Oleksandr Natalenko (post-factum) 2016-04-26 05:34:45.010-0500

Next attempt — we will try to limit amount of retries.

{code}
diff --git a/res/res_agi.c b/res/res_agi.c
index ff33580..8fce2a3 100644
--- a/res/res_agi.c
+++ b/res/res_agi.c
@@ -1006,6 +1006,7 @@ ASTERISK_FILE_VERSION(__FILE__, "$Revision$")
#define AGI_NANDFS_RETRY 3
#define AGI_BUF_LEN 2048
#define SRV_PREFIX "_agi._tcp."
+#define MAX_SOCKET_RETRIES 32

static char *app = "AGI";

@@ -3821,6 +3822,7 @@ static enum agi_result run_agi(struct ast_channel *chan, char *request, AGI *agi
} else if (outfd > -1) {
size_t len = sizeof(buf);
size_t buflen = 0;
+ size_t retries = 0;
enum agi_result cmd_status;

retry = AGI_NANDFS_RETRY;
@@ -3832,6 +3834,8 @@ static enum agi_result run_agi(struct ast_channel *chan, char *request, AGI *agi
break;
if (ferror(readf) && ((errno != EINTR) && (errno != EAGAIN)))
break;
+ if (ferror(readf) && ((errno == EINTR) || (errno == EAGAIN)) && (++retries > MAX_SOCKET_RETRIES))
+ break;
if (res != NULL && !agi->fast)
break;
buflen = strlen(buf);
@@ -3840,6 +3844,8 @@ static enum agi_result run_agi(struct ast_channel *chan, char *request, AGI *agi
len = sizeof(buf) - buflen;
if (agidebug)
ast_verbose("AGI Rx << temp buffer %s - errno %s\nNo \\n received, checking again.\n", buf, strerror(errno));
+ clearerr(readf);
+ errno = 0;
}

if (!buf[0]) {
{code}

By: David Cunningham (dcunningham) 2016-09-13 00:04:07.313-0500

We use AGI extensively and have a customer reporting high CPU usage which corresponds with a huge number of the following errors:

[Sep 12 19:32:20] VERBOSE[17279][C-0002f2df] res_agi.c: AGI Rx << temp buffer VERBOSE "Already collected digits * - errno Resource temporarily unavailable

An Asterisk full log with AGI, verbose and debug logging doesn't seem to provide a reason. They are using Asterisk 11.3.0 on CentOS 6.5 64 bit . Can you please advise how we can debug the issue? Thank you.


By: Oleksandr Natalenko (post-factum) 2017-05-22 06:05:06.932-0500

Updated meta to indicate that this bug is still present on 13.10.0.