[Home]

Summary:ASTERISK-21065: Asterisk 11 IPv6 - FastAGI fail
Reporter:Jeremy Kister (jkister)Labels:
Date Opened:2013-02-12 00:36:00.000-0600Date Closed:2013-02-22 13:44:18.000-0600
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Resources/res_agi
Versions:11.2.1 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Attachments:( 0) 21065-1.patch
( 1) asterisk-21065_poll_correctly_v4.diff
( 2) console_debug.txt
( 3) console.txt
( 4) extensions.conf
Description:exten => 1,1,AGI(agi://localhost/foo) always fails with "Connect failed"
Comments:By: Jeremy Kister (jkister) 2013-02-12 00:47:10.794-0600

21065-1.patch removes the new asterisk 11 connection pieces and reverts to asterisk 1.8 connect.  this is in no way intended to be the real fix, just showing the problem.


By: Matt Jordan (mjordan) 2013-02-12 07:07:14.696-0600

So, your patch reverts r357610, which added IPv6 support to FastAGI. So that's probably not good :-)

I think we need to understand why your connection is failing, as we run FastAGI on IPv4 connections all the time. Your log file shows:

{noformat}
   -- Executing [s@VXML:3] AGI("SIP/143-00000000", "agi://localhost/url=http%3A%2F%2Fexample.com%2Fcgi-bin%2Favxml") in new stack
[Feb 12 01:50:04] DEBUG[3634][C-00000000]: netsock2.c:138 ast_sockaddr_split_hostport: Splitting 'localhost' into...
[Feb 12 01:50:04] DEBUG[3634][C-00000000]: netsock2.c:192 ast_sockaddr_split_hostport: ...host 'localhost' and port ''.
{noformat}

What happens if:
# You put in the explicit IPv4 address/port that you are listening on?
# You put in localhost:port/url?


By: Jeremy Kister (jkister) 2013-02-12 12:06:47.511-0600


{noformat}
2. You put in localhost:port/url
{noformat}

that fails [mostly] the same way:
{code}
[Feb 12 12:55:46] DEBUG[14462][C-00000002]: netsock2.c:138 ast_sockaddr_split_hostport: Splitting 'localhost:4573' into...
[Feb 12 12:55:46] DEBUG[14462][C-00000002]: netsock2.c:192 ast_sockaddr_split_hostport: ...host 'localhost' and port '4573'.
[Feb 12 12:55:46] ERROR[14462][C-00000002]: utils.c:1187 ast_carefulwrite: write() returned error: Connection refused
[Feb 12 12:55:46] WARNING[14462][C-00000002]: res_agi.c:1528 launch_netscript: Connect to 'agi://localhost:4573/url=http%3A%2F%2Fexample.net%3A8081%2Fcgi-bin%2Favxml' failed: Connection refused
{code}

{noformat}
1. You put in the explicit IPv4 address/port that you are listening on?
{noformat}

that works..  okay, so it's some sort of ipv6 resolver issue?

{code}
pbx1> dig a localhost
[...]
;; ANSWER SECTION:
localhost.              0       IN      A       127.0.0.1
{code}

hmm.

{code}
pbx1> grep localhost /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
{code}

okay, interesting.  my host has some ipv6 on the interfaces-- but it's all default:
{code}
pbx1> ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:00:29:07:00:8e  
         inet addr:10.9.1.3  Bcast:10.9.1.255  Mask:255.255.255.0
         inet6 addr: fe80::20c:29ff:fe07:398e/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:25200579 errors:5 dropped:2 overruns:0 frame:0
         TX packets:21607036 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:4288193164 (3.9 GiB)  TX bytes:3481644916 (3.2 GiB)
         Interrupt:18 Base address:0x2000

lo        Link encap:Local Loopback  
         inet addr:127.0.0.1  Mask:255.0.0.0
         inet6 addr: ::1/128 Scope:Host
         UP LOOPBACK RUNNING  MTU:16436  Metric:1
         RX packets:501151 errors:0 dropped:0 overruns:0 frame:0
         TX packets:501151 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:60945765 (58.1 MiB)  TX bytes:60945765 (58.1 MiB)
{code}

commenting out the ipv6 localhost in /etc/hosts hides the problem.


By: Jeremy Kister (jkister) 2013-02-12 12:10:56.786-0600

oh, i see.  my daemon is not listening on ipv6, and asterisk 11 is trying ipv6 only.

i wonder what the behavior is for web browsers that run into the same pattern.  firefox 18.0.2 sees www.google.com AAAA 2607:f8b0:400c:c01::93.  I have no ipv6 transport on my network, and firefox backs down to ipv4:80.

perhaps asterisk should do the same ?

By: cloos (cloos) 2013-02-12 13:05:35.927-0600

Falling back to v4 when v6 cannot route will not help this case.

He has a route to ::1, it is just that nothing is listen(2)ing there.

Most dists these days have hosts(5) configures such that localhost only maps to 127.0.0.1 and ::1 maps to ipv6-localhost, for exactly this reason.  Too many complaints from their users that things started breaking as soon as they enabled ipv6 in their kernels.

That said, asterisk should get a list of addresses back from the getaddrinfo(3) call.  It would not be a bad idea to try each in turn until connect(2) succeeds or it runs out of addrinfo structs.

The reporter’s other solution would be to listen(2) to ::1 either in addition to 127.0.0.1 if his net.ipv6.bindv6only==1 or instead of 127.0.0.1 if net.ipv6.bindv6only==0.


By: cloos (cloos) 2013-02-12 13:37:43.642-0600

Falling back to v4 when v6 cannot route will not help this case.

He has a route to ::1, it is just that nothing is listen(2)ing there.

Most dists these days have hosts(5) configures such that localhost only maps to 127.0.0.1 and ::1 maps to ipv6-localhost, for exactly this reason.  Too many complaints from their users that things started breaking as soon as they enabled ipv6 in their kernels.

That said, asterisk should get a list of addresses back from the getaddrinfo(3) call.  It would not be a bad idea to try each in turn until connect(2) succeeds or it runs out of addrinfo structs.

The reporter’s other solution would be to listen(2) to ::1 either in addition to 127.0.0.1 if his net.ipv6.bindv6only==1 or instead of 127.0.0.1 if net.ipv6.bindv6only==0.


By: Michael L. Young (elguero) 2013-02-12 15:51:08.821-0600

Just like cloos states, I find it interesting that your hosts file lists localhost for both the ipv4 and ipv6 interfaces.  Most dists that I have seen label them differently.  In RHEL/Fedora it is normally localhost for ipv4 and localhost6 for ipv6.

I took a look at the res_agi.c code and it does look like it already does get a list of addresses and goes through them until connect succeeds.  In this case, it would appear that Asterisk does its job in checking if it can create a socket and connect to, in this case, localhost using the ipv6 address that was returned.  This is successful since the loopback adapter is setup to handle ipv6.  What is failing is when Asterisk tries to send a message over the socket that has been setup that Asterisk gets the message, "Connection refused", since the daemon is not listening on ipv6.

I think the best way to fix this is to change the hosts file so that localhost does not resolve to the ipv6 address.

By: Jeremy Kister (jkister) 2013-02-12 17:00:24.907-0600

rusty, "Waiting for Feedback" for what?  i've identified the problem and we know that removing localhost from ::1 hides the issue.

it's arguable how to fix the issue, and depending on how it's fixed a blurb in UPGRADE.txt may be needed.

I'm not versed on IPv4 -> IPv6 transition semantics, so I don't know if the "proper" or "well accepted" way of fixing this is to loop through  getaddrinfo until connect succeeds or it runs out of hosts as cloos mentioned.  Reading CHANGES for r357610, it does appear an attempt was made for asterisk to have this behavior (but it's not working in this case).

If we do not make asterisk connect to each host for a given RR, I would suggest adding a comment in UPGRADE.TXT.  It should mention ipv6 taking precedence over ipv4 and "localhost" is the most common thing that may need tweaking if the daemon is not ipv6 enabled.  Then, perhaps something in configure could also check for localhost in /etc/hosts to see if it is multihomed (send a well visible warning, but not an error)


By: Michael L. Young (elguero) 2013-02-12 17:57:23.234-0600

Jeremy, the problem here, I believe, is that you are using the hostname "localhost" and "localhost" is resolving to the ipv6 address on your loopback adapter.  Since the IPv6 address is being returned first by the kernel and then a call to create a socket() and then connect() are successful on this valid loopback adapter using ipv6, everything is working properly based on this configuration.  So, removing "localhost" from ::1 is not "hiding" anything.  It is actually correcting the configuration problem.

As far as I am concerned, there is no bug in Asterisk.  The current code already handles things just as cloos outlined for determining to use IPv4 or IPv6.  This issue is not a result of adding IPv6 capabilities to AGI.  This is a configuration issue, which presented itself based on the fact that we now are able to use IPv6.

I believe that Rusty just wanted you to confirm that our conclusions were correct about the problem lying in the hosts file by having "localhost" resolving to both ipv4 and ipv6 loopback addresses, something that is not commonly found in the default hosts files on current distros as far as I am aware.

Thanks for confirming this for us.

By: Jeremy Kister (jkister) 2013-02-12 20:47:49.573-0600

Debian 5 and 6 both use localhost for ipv4 and ipv6.  see /var/lib/dpkg/info/netbase.postinst for reference.

More than this being a local configuration issue, I think this is a documentation or behavior problem.

In Asterisk 1.8, my config was working fine.  In Asterisk 11, my config broke.  All I'm saying is something should be changed in code or a N.B. for the poor asterisk administrator to read should be made.

By: Matt Jordan (mjordan) 2013-02-12 21:42:20.042-0600

I'm fine with documentation being updated, but it feels like the appropriate place for that is on the Asterisk wiki. Maybe on [this page|https://wiki.asterisk.org/wiki/display/AST/Upgrading+to+Asterisk+11]?

By: Michael L. Young (elguero) 2013-02-13 10:25:38.995-0600

Matt, I agree that the wiki page you refer to would be the appropriate place for that info.  Shall I go ahead and update it or should I post something here first to get approval?

By: Jeremy Kister (jkister) 2013-02-13 10:33:07.747-0600

i like the idea of the wiki being noted.  but honestly, i've never read the wiki during upgrade; i only read the UPGRADE*.txt files in the tarball.  is there someplace else that tells the user to read the upgrade wiki?

Also, It'd be great if someone could review the code that loops through the addr structs.  If the intent was to try all possibilities until one works, then this loop is flawed as it does not account for ast_agi_send error.

By: Jeremy Kister (jkister) 2013-02-13 10:59:44.335-0600

[RFC6555|https://tools.ietf.org/html/rfc6555] actually suggests running the IPv6 and IPv4 requests in parallel.  not sure if that's interesting for Asterisk.


By: Michael L. Young (elguero) 2013-02-13 11:17:33.041-0600

Jeremy,

What you need to understand is there is the OS involved here and then there is the application.  Asterisk uses the standard kernel libraries for setting up the socket for FastAGI.  The kernel is reporting to Asterisk that the host can be found at ::1 and 127.0.0.1.  Asterisk then asks the kernel to create a socket based on the returned IP address.  The kernel responds back that it was successful.  Then Asterisk proceeds to ask the kernel to connect to ::1.  The kernel responds back that it was successful.  Asterisk then attempts to send data over the established connection.  It receives an error, not due to Asterisk's fault or the kernel's fault, but due to the fact that there is nothing on the host to handle the data being sent to the host.

The RFC you refer to, based on a quick look through it, seems to be for handling when IPv6 fails which is not what is happening here.  IPv6 to the host is successful but there is no daemon listening or responding.

I am not sure how else we can correct what is plainly a configuration problem, which is not under Asterisk's control.  Everything is working properly.

I did review the code and everything looks good from my perspective.  I don't think it would be very beneficial to create a socket, connect and then try to send data over the socket to then have to tear it all down and do it all over again for every address resolved for that host due to a configuration problem on an OS level.  The configuration problem should be corrected (whether by changing the hostname, removing the hostname in hosts or by setting up the daemon to listen on both IPv4 and IPv6), not masked.

The wiki page is probably more for issues that people have come across when upgrading Asterisk in general.  For this situation, that would be the best place to put this information.  UPGRADE*.txt is more for when there are Asterisk related code changes that directly affect Asterisk configurations, not necessarily, changes that are related to the configuration in the distro of choice.  That is just my opinion... Matt will correct me if I am wrong.

By: Jeremy Kister (jkister) 2013-02-13 11:24:01.803-0600

please don't take my comments the wrong way;
i completely agree that the correct solution is to either
# separate ipv4 and ipv6 localhost
# make the daemon listen on all addresses necessary

all i'm doing (like with the rfc blurb) is adding more information to the topic; i'm not suggesting any course of action.

By: Michael L. Young (elguero) 2013-02-13 11:27:33.834-0600

Jeremy,

Okay.  I apologize if I am harping on this too much.  I just wanted to make sure you understood that Asterisk is not really the issue here.

Thanks for your help and the information.

By: cloos (cloos) 2013-02-13 12:57:38.389-0600

Does the connect(2) really succeed.

I should write a test to confirm one way or the other, but based on the man page the connect(2) should fail with errno set to ECONNREFUSED.

By: Michael L. Young (elguero) 2013-02-13 13:43:53.959-0600

cloos,

I see what you are saying.

Feel free to look at the code: http://svnview.digium.com/svn/asterisk/branches/11/res/res_agi.c?revision=374428&view=markup - goto line 1449.  

Based on the attached debug log, it looks to me like it does.  It doesn't display an error until line 1528.  Am I missing something?

By: cloos (cloos) 2013-02-13 14:53:43.206-0600

I see why conenct(2) doesn’t fail right away; ast sets O_NONBLOCK.

When I tried a quick test with netcat(1) listen(2)ing on 127.0.0.1 and nc6(1) (which also uses O_NONBLOCK) trying to connect(2) to ::1, I got EINPROGRESS.

The fact that res_agi.c ignores EINPROGRESS leads to the connection miss showing up later than it otherwise would.

Giving up O_NONBLOCK is too much of a loss, so that clearly is not an option.

connect(2) suggests poll(2)ing for writeability on the socket(7) whenever EINPROGRESS is set; I do not know whether doing so will help this case.  That would need a test.

This is an interesting issue.

(And I see that deb still defaults localhost to both 127.0.0.1 and ::1 in wheezy and sid, so this will be an issue long term.  Or at least until everyone uses v6 by default. ☺)


By: Michael L. Young (elguero) 2013-02-14 15:45:50.550-0600

Okay, Jeremy and cloos, thanks for sticking with this.  This has been an interesting issue, as cloos said.

Based on cloos' findings, I started messing around with the code.  It looks like we were attempting to poll but never checking if we were actually connected.  The polling was also happening later on then bailing out so the other addresses available for the hostname were never checked.

Please try out the following patch, [^asterisk-21065_poll_correctly_v2.diff], and confirm if that takes care of it.

I did some local testing and it appears to be working well.

I took a look at a fresh install of CentOS 6 and see that they are also using localhost for both 127.0.0.1 and ::1.

cloos, do you think this is the right approach?

Thank you for your persistence... hopefully this takes care of it!

By: cloos (cloos) 2013-02-14 16:44:32.892-0600

If I’m reading the patch correctly, it will log a warning for each ip which failed, so the reporter could expect a warning every time the agi is called.  Of course, maybe that is a Good Thing™.

Also, if getsockopt(2) returns negative then errno will be its error and not SO_ERROR.  Perhaps that cannot happen in this case?  I’m too tired to be sure just now.

+1 to ship it (as a bug fix and not as a new feature).

By: Michael L. Young (elguero) 2013-02-14 17:15:25.361-0600

Yep, it will log a warning every time the connect fails... I thought it might help a user to know that it is trying multiple addresses.  If it is too verbose, we can always change it.

The patch is kind of using errno for two purposes.  Like you mentioned, getsockopt(2) can return negative and errno will be the error.  If there is no error, then errno will be set to SO_ERROR.  I am only checking errno since I am using it for a dual purpose.  I could always change that if it is confusing.  I just may do that anyway for code readability purposes.

I created this patch against 11 and will put it up for review as a bug fix.  I can see that the intent of the code was to have this behavior so I think it is just a behavior that was there but not working properly, rather than a new feature.

By: Michael L. Young (elguero) 2013-02-14 17:39:23.015-0600

Updated patch.

By: Jeremy Kister (jkister) 2013-02-14 21:24:27.373-0600

I added localhost back to ::1 so that localhost is multihomed on ipv4 and ipv6.  my daemon is still only listening on ipv4.

the new code makes asterisk try to connect to ::1:4573 which fails then tries 127.0.0.1:4573, succeeds and continues.

{code}
[Feb 14 22:18:10] WARNING[2861][C-00000000]: res_agi.c:1525 launch_netscript: Connecting to '[::1]:4573' failed for url 'agi://localhost/url=http%3A%2F%2Fexample.com%2Fcgi-bin%2Favxml': Connection refused
   -- Playing 'local/tts/bbWHEJViRO68xuFMQnHPRDQMIvkKTr8R4________For_fur' (escape_digits=0123456789*#) (sample_offset 0)
   -- <SIP/143-00000000>AGI Script agi://localhost/url=http%3A%2F%2Fexample.com%2Fcgi-bin%2Favxml completed, returning 4
 == Spawn extension (VXML, s, 3) exited non-zero on 'SIP/143-00000000'
{code}

this is the behavior i thought asterisk was trying to have when [i asked someone to check out the code|https://issues.asterisk.org/jira/browse/ASTERISK-21065?focusedCommentId=202868&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-202868].  good work..


By: Michael L. Young (elguero) 2013-02-22 10:39:51.018-0600

Adding final patch here that will be committed.

By: Jeremy Kister (jkister) 2013-02-23 01:07:49.876-0600

applied final patch to 11.2.1.  works as intended!  thanks.