ASTERISK-02617: sip stops communicating when gethostbyname() temporarily fails

[Home]

Summary: ASTERISK-02617: sip stops communicating when gethostbyname() temporarily fails

Reporter: orei (orei) Labels:

Date Opened: 2004-10-16 07:34:51 Date Closed: 2011-06-07 14:10:07

Priority: Minor Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) gethostbyname.patch

Description: Hi,
the sip-connection to my provider silently stops (no communication although status still connected) when gethostbyname_r returns with the error code TRY_AGAIN. That is not actually a problem, the program should just try again. But, instead, the error is returned out of ast_gethostbyname() where it is not properly handled (thats another problem I didn't look into).
To solve this paticular problem I simply added a loop around the call to gethostbyname_r(). Patch is attached

Ole

Comments: By: Mark Spencer (markster) 2004-10-16 10:56:42

According to the gethostbyname_r man page, TRY_AGAIN means "A temporary error occured on an authoritative name server. Try again later." It doesn't say "try again immediately". gethostbyname is already blocking, I don't especially want to make it more blocking thant i already is.

Can you explain why you feel we should *immediately* retry the lookup? How long is the gethostbyname taking to fail in your environment?

Also, this is OBVIOUSLY not a major bug because it's basically only affecting you in your very limited set of circumstances.
By: orei (orei) 2004-10-16 14:08:15

TRY_AGAIN indicates a temporary error, but does not give any information about when to try again. So I think its pretty reasonable to try again immediately. In my case there is never a second retry. Every failing call needs a longer time (seconds, probably a timeout). How would you want to solve that problem?
There is a certain probability that this error gets returned in any environment, not only in mine. That sometimes affects user services that is outside the specification. In environments I work this _is_ called a major bug!
By: Mark Spencer (markster) 2004-10-16 15:57:03

I'd like to know more about the nature of these failures. Seems pretty bizarre to say "Well, it didn't work once, lets try indefinitely" don't you think? Especially with no further explanation?

Before I'm willing to consider *any* change to the current behavior, I want to understand the nature of this failure. Is it a downed DNS server that is timing out? What is *really* causing the error.

If, in fact, it never requires more than a second pass, why not just loop no more than twice, or add a new option for the max # of dns retries? These seem like MUCH better solutions than just generally forcing a second lookup on a "temporary failure".
By: Olle Johansson (oej) 2004-10-17 04:52:05

A side note on our DNS support:

What we really need, is an asynchrounous DNS, like http://www.chiark.greenend.org.uk/~ian/adns/ or http://daniel.haxx.se/projects/c-ares/

Seems like Squid handles Async DNS by calling external software through pipes.

-----------
Sorry for the interruption... :-)
By: orei (orei) 2004-10-20 14:39:03

I read from
http://www.acm.uiuc.edu/bug/Be%20Book/The%20Network%20Kit/NetworkDB.html
that the error code means that some server inside the DNS- chain didn't answer. So it isn't clear whether the name really exists or not. A following request may be successful. The system has (obviously) no clue when the next request may be successful. So a direct re-request is perfectly fine, though admittedly a bit simplistic.
So I propose to
- keep DNS out of anything having to do with realtime (which I hope is anyway the case with asterisk)
- retry the request a fixed maximum times (5?)
- cache previous responses and return immediately to subsequent requests with the cached value, triggering an asynchronous request to the DNS to update the cached value

OTOH the clients of ast_gethostbyname should get reviewed whether they adequately handle the error(s) returned.
What do you think?
Ole

BTW: I'm out of house until 1st of november. I guess I cannot be of much help until then.
By: pfn (pfn) 2004-10-21 15:58:26

I may also be having a similar problem. Randomly, about 15-20mins in on random calls, inbound audio to me will drop. I don't have the details of how this happens for Ole, so I don't know if my circumstances are similar.

My symptoms: SIP call, through broadvoice or direct to X-lite, randomly after 10-20 minutes, inbound audio to me will disappear preceded by some sort of "clicking" sound. RTP is still up, audio is gone, the remote side can still hear me. I haven't been able to narrow my 7960 out of my equation as I loathe to use my FXS-plugged phone (echo).

I will try this patch and see if this rectifies the problem, or I get the warning message added in the patch.
By: pfn (pfn) 2004-10-24 14:36:26

Ok, nevermind, this is not the problem I am experiencing. This patch doesn't change anything for me (similar symptoms, wrong problem).
By: Mark Spencer (markster) 2004-11-07 19:04:38.000-0600

Would be nice to see the network dump during the period of loss of audio. That can tell whether Asterisk is getting the packets (and getting them in a timely fashion) or not.
By: Russell Bryant (russell) 2004-11-14 21:41:53.000-0600

The author seems to have lost interest in this bug.