Asterisk
  1. Asterisk
  2. ASTERISK-2617

sip stops communicating when gethostbyname() temporarily fails

    Details

    • Type: Bug Bug
    • Status: Closed
    • Severity: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Target Release Version/s: None
    • Component/s: Core/General
    • Labels:
      None
    • Mantis ID:
      2662
    • Regression:
      No

      Description

      Hi,
      the sip-connection to my provider silently stops (no communication although status still connected) when gethostbyname_r returns with the error code TRY_AGAIN. That is not actually a problem, the program should just try again. But, instead, the error is returned out of ast_gethostbyname() where it is not properly handled (thats another problem I didn't look into).
      To solve this paticular problem I simply added a loop around the call to gethostbyname_r(). Patch is attached

      Ole

        Activity

        Hide
        Mark Spencer added a comment -

        According to the gethostbyname_r man page, TRY_AGAIN means "A temporary error occured on an authoritative name server. Try again later." It doesn't say "try again immediately". gethostbyname is already blocking, I don't especially want to make it more blocking thant i already is.

        Can you explain why you feel we should immediately retry the lookup? How long is the gethostbyname taking to fail in your environment?

        Also, this is OBVIOUSLY not a major bug because it's basically only affecting you in your very limited set of circumstances.

        Show
        Mark Spencer added a comment - According to the gethostbyname_r man page, TRY_AGAIN means "A temporary error occured on an authoritative name server. Try again later." It doesn't say "try again immediately". gethostbyname is already blocking, I don't especially want to make it more blocking thant i already is. Can you explain why you feel we should immediately retry the lookup? How long is the gethostbyname taking to fail in your environment? Also, this is OBVIOUSLY not a major bug because it's basically only affecting you in your very limited set of circumstances.
        Hide
        orei added a comment -

        TRY_AGAIN indicates a temporary error, but does not give any information about when to try again. So I think its pretty reasonable to try again immediately. In my case there is never a second retry. Every failing call needs a longer time (seconds, probably a timeout). How would you want to solve that problem?
        There is a certain probability that this error gets returned in any environment, not only in mine. That sometimes affects user services that is outside the specification. In environments I work this is called a major bug!

        Show
        orei added a comment - TRY_AGAIN indicates a temporary error, but does not give any information about when to try again. So I think its pretty reasonable to try again immediately. In my case there is never a second retry. Every failing call needs a longer time (seconds, probably a timeout). How would you want to solve that problem? There is a certain probability that this error gets returned in any environment, not only in mine. That sometimes affects user services that is outside the specification. In environments I work this is called a major bug!
        Hide
        Mark Spencer added a comment -

        I'd like to know more about the nature of these failures. Seems pretty bizarre to say "Well, it didn't work once, lets try indefinitely" don't you think? Especially with no further explanation?

        Before I'm willing to consider any change to the current behavior, I want to understand the nature of this failure. Is it a downed DNS server that is timing out? What is really causing the error.

        If, in fact, it never requires more than a second pass, why not just loop no more than twice, or add a new option for the max # of dns retries? These seem like MUCH better solutions than just generally forcing a second lookup on a "temporary failure".

        Show
        Mark Spencer added a comment - I'd like to know more about the nature of these failures. Seems pretty bizarre to say "Well, it didn't work once, lets try indefinitely" don't you think? Especially with no further explanation? Before I'm willing to consider any change to the current behavior, I want to understand the nature of this failure. Is it a downed DNS server that is timing out? What is really causing the error. If, in fact, it never requires more than a second pass, why not just loop no more than twice, or add a new option for the max # of dns retries? These seem like MUCH better solutions than just generally forcing a second lookup on a "temporary failure".
        Hide
        Olle Johansson added a comment -

        A side note on our DNS support:

        What we really need, is an asynchrounous DNS, like http://www.chiark.greenend.org.uk/~ian/adns/ or http://daniel.haxx.se/projects/c-ares/

        Seems like Squid handles Async DNS by calling external software through pipes.

        -----------
        Sorry for the interruption...

        Show
        Olle Johansson added a comment - A side note on our DNS support: What we really need, is an asynchrounous DNS, like http://www.chiark.greenend.org.uk/~ian/adns/ or http://daniel.haxx.se/projects/c-ares/ Seems like Squid handles Async DNS by calling external software through pipes. ----------- Sorry for the interruption...
        Hide
        orei added a comment -

        I read from
        http://www.acm.uiuc.edu/bug/Be%20Book/The%20Network%20Kit/NetworkDB.html
        that the error code means that some server inside the DNS- chain didn't answer. So it isn't clear whether the name really exists or not. A following request may be successful. The system has (obviously) no clue when the next request may be successful. So a direct re-request is perfectly fine, though admittedly a bit simplistic.
        So I propose to

        • keep DNS out of anything having to do with realtime (which I hope is anyway the case with asterisk)
        • retry the request a fixed maximum times (5?)
        • cache previous responses and return immediately to subsequent requests with the cached value, triggering an asynchronous request to the DNS to update the cached value

        OTOH the clients of ast_gethostbyname should get reviewed whether they adequately handle the error(s) returned.
        What do you think?
        Ole

        BTW: I'm out of house until 1st of november. I guess I cannot be of much help until then.

        Show
        orei added a comment - I read from http://www.acm.uiuc.edu/bug/Be%20Book/The%20Network%20Kit/NetworkDB.html that the error code means that some server inside the DNS- chain didn't answer. So it isn't clear whether the name really exists or not. A following request may be successful. The system has (obviously) no clue when the next request may be successful. So a direct re-request is perfectly fine, though admittedly a bit simplistic. So I propose to keep DNS out of anything having to do with realtime (which I hope is anyway the case with asterisk) retry the request a fixed maximum times (5?) cache previous responses and return immediately to subsequent requests with the cached value, triggering an asynchronous request to the DNS to update the cached value OTOH the clients of ast_gethostbyname should get reviewed whether they adequately handle the error(s) returned. What do you think? Ole BTW: I'm out of house until 1st of november. I guess I cannot be of much help until then.
        Hide
        pfn added a comment -

        I may also be having a similar problem. Randomly, about 15-20mins in on random calls, inbound audio to me will drop. I don't have the details of how this happens for Ole, so I don't know if my circumstances are similar.

        My symptoms: SIP call, through broadvoice or direct to X-lite, randomly after 10-20 minutes, inbound audio to me will disappear preceded by some sort of "clicking" sound. RTP is still up, audio is gone, the remote side can still hear me. I haven't been able to narrow my 7960 out of my equation as I loathe to use my FXS-plugged phone (echo).

        I will try this patch and see if this rectifies the problem, or I get the warning message added in the patch.

        Show
        pfn added a comment - I may also be having a similar problem. Randomly, about 15-20mins in on random calls, inbound audio to me will drop. I don't have the details of how this happens for Ole, so I don't know if my circumstances are similar. My symptoms: SIP call, through broadvoice or direct to X-lite, randomly after 10-20 minutes, inbound audio to me will disappear preceded by some sort of "clicking" sound. RTP is still up, audio is gone, the remote side can still hear me. I haven't been able to narrow my 7960 out of my equation as I loathe to use my FXS-plugged phone (echo). I will try this patch and see if this rectifies the problem, or I get the warning message added in the patch.
        Hide
        pfn added a comment -

        Ok, nevermind, this is not the problem I am experiencing. This patch doesn't change anything for me (similar symptoms, wrong problem).

        Show
        pfn added a comment - Ok, nevermind, this is not the problem I am experiencing. This patch doesn't change anything for me (similar symptoms, wrong problem).
        Hide
        Mark Spencer added a comment -

        Would be nice to see the network dump during the period of loss of audio. That can tell whether Asterisk is getting the packets (and getting them in a timely fashion) or not.

        Show
        Mark Spencer added a comment - Would be nice to see the network dump during the period of loss of audio. That can tell whether Asterisk is getting the packets (and getting them in a timely fashion) or not.
        Hide
        Russell Bryant added a comment -

        The author seems to have lost interest in this bug.

        Show
        Russell Bryant added a comment - The author seems to have lost interest in this bug.

          People

          • Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development