[Home]

Summary:ASTERISK-25911: chan_iax2: IAX Max Retries - hung IAX channels in Ring state - cannot clear channels until Asterisk restart
Reporter:Andreas Krüger (woopstar)Labels:
Date Opened:2016-04-11 08:04:41Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Channels/chan_iax2
Versions:13.7.2 13.9.0 13.9.1 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-28395 Asterisk occasionally fails to hangup channels
Environment:Ubuntu serverAttachments:( 0) 01-08-2016-backtrace-threads.txt
( 1) 01-08-2016-core-show-channels.txt
( 2) 01-08-2016-core-show-channels-infos.txt
( 3) 01-08-2016-core-show-locks.txt
( 4) 01-08-2016-iax2-show-channels.txt
( 5) 01-08-2016-iax2-show-netstats.txt
( 6) 2016-06-15-backtrace-threads.txt
( 7) 2016-06-15-core-show-channels.txt
( 8) 2016-06-15-core-show-channels-infos.txt
( 9) 2016-06-15-core-show-locks.txt
(10) 2016-06-15-iax2-show-channels.txt
(11) 2016-06-15-iax2-show-netstats.txt
(12) 2016-06-16-backtrace-threads.txt
(13) 2016-06-16-core-show-channels.txt
(14) 2016-06-16-core-show-channels-infos.txt
(15) 2016-06-16-core-show-locks.txt
(16) 2016-06-16-iax2-show-channels.txt
(17) 2016-06-16-iax2-show-netstats.txt
(18) 2016-06-17-backtrace-threads.txt
(19) 2016-06-17-core-show-channels.txt
(20) 2016-06-17-core-show-channels-infos.txt
(21) 2016-06-17-core-show-locks.txt
(22) 2016-06-17-iax2-show-channels.txt
(23) 2016-06-17-iax2-show-netstats.txt
(24) 2016-08-26-backtrace-threads.txt
(25) 2016-08-26-core-show-channels.txt
(26) 2016-08-26-core-show-channels-infos.txt
(27) 2016-08-26-core-show-locks.txt
(28) 2016-08-26-iax2-show-channels.txt
(29) 2016-08-26-iax2-show-netstats.txt
(30) 23-08-2016-backtrace-threads.txt
(31) 23-08-2016-cli-output-full.txt
(32) 23-08-2016-core-show-channels.txt
(33) 23-08-2016-core-show-channels-infos.txt
(34) 23-08-2016-core-show-locks.txt
(35) 23-08-2016-iax2-show-channels.txt
(36) 23-08-2016-iax2-show-netstats.txt
(37) backtrace-threads.txt
(38) core-show-channels.txt
(39) core-show-channels-infos.txt
(40) debug_log_25911_odn1-voip-cluster02-asterisk01
(41) debug_log_25911_odn1-voip-cluster02-upstream01
(42) iax.conf
(43) iax2-show-channels.txt
(44) iax2-show-netstats.txt
(45) upload_1.png
Description:Hi there,

We ran into a problem, when there is some, but not high, load on some of our asterisk servers, we suddenly see an IAX max retries error in the console.
When this happens, everything stops to work and we cannot get asterisk to work again unless we restart the service (not the server).

I tried to start asterisk trough GDB, but since asterisk never crashes, there is nothing to show in gdb about the problem.
I've also sat up a monitoring tool to check for network glitches and neither this has happened.

I've also tried to increase the max retries in chan_iax2.c and recompile asterisk, as I've read on some forums that it should resolve the issue, but this is neither the case.

{code}
sed -i "s/static int max_retries = 4;/static int max_retries = 12;/" channels/chan_iax2.c
{code}

I've attached the output from the console we see. This messages just keeps popping up and seems not to end. This could for me look like theres some cleanup not working in chan_iax2.c when the max retries happens. The error we're facing happens on this line:

https://github.com/asterisk/asterisk/blob/13.7/channels/chan_iax2.c#L3572

I could use some advice to debug this problem further and resolve it, because when this error happens, Asterisk does not work at all until it's get restarted.

The problem is not persistent and I have a hard time to reproduce it. But we see it when the load increases. Doing 10k calls within 7 hour seems to make it happen.
I looked into the code, and see that it uses a reference to a callno, which for me looks like a counter that increases ? - Could we maybe see some sort of race condition or maybe the callno runs out of scope?
Comments:By: Asterisk Team (asteriskteam) 2016-04-11 08:04:42.391-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2016-04-11 15:03:11.727-0500

Lets start by getting additional debug to see what is happening at the time the problem occurs.

Please record a debug log as follows:

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Attach the log as .txt after verifying the log contains the "DEBUG" and "VERBOSE" messages.

If you can also gather a packetcapture of the issue that would help immensely. It would be great to see the last few dozen calls before the issue occurs that correlate to the debug log. You can probably remove the RTP if that helps with privacy.

By: Andreas Krüger (woopstar) 2016-04-12 03:59:19.985-0500

I've enabled full logging on all servers and try to reproduce the problem as fast as possible.

I've attached the iax.conf file too. We mappend iaxfriends through ODBC to have them in our database. Do you want the entry from the sql db too?

By: Rusty Newton (rnewton) 2016-04-12 07:33:57.663-0500

Thanks.

bq. Do you want the entry from the sql db too?

Yeah everything helps.

By: Andreas Krüger (woopstar) 2016-04-13 07:34:58.584-0500

Okay, it happened today.
I've attached the debug logs from both servers.

On debug_log_25911_odn1-voip-cluster02-upstream01, which the error occours on:
{code}
[2016-04-13 14:25:46] WARNING[30039] chan_iax2.c: Max retries exceeded to host 185.60.160.134 on IAX2/odn1-voip-cluster02-asterisk01-5638 (type = 6, subclass = 2, ts=83999, seqno=24)
{code}

And the server on .134 has the following debug log: debug_log_25911_odn1-voip-cluster02-asterisk01

By: Andreas Krüger (woopstar) 2016-04-13 08:00:06.733-0500

SQL Entries from iaxfriends:

{code}
*************************** 2. row ***************************
                    id: 22
                  name: odn1-voip-cluster02-upstream01
                  type: friend
              username: odn1-voip-cluster02-upstream01
               mailbox: NULL
                secret: ***********
              dbsecret: NULL
               context: fromupstream
            regcontext: NULL
             regserver: odn1-voip-cluster02-upstream01
                  host: 185.60.160.132
              hostname: 4a72066682feb9cc1c14daf3800d70ba-voip-aws-eu.publicdns.zone
              local_ip: 185.60.160.132
                ipaddr: 185.60.160.132
                  port: 4569
             defaultip: NULL
         sourceaddress: NULL
                  mask: 255.255.255.255
              regexten: NULL
            regseconds: 0
           accountcode: NULL
          mohinterpret: NULL
            mohsuggest: NULL
                inkeys: IAXTrunk
               outkeys: IAXTrunk
              language: NULL
              callerid: NULL
            cid_number: NULL
               sendani: NULL
              fullname: NULL
                 trunk: yes
                  auth: NULL
            maxauthreq: NULL
      requirecalltoken: NULL
            encryption: yes
              transfer: mediaonly
          jitterbuffer: yes
     forcejitterbuffer: NULL
              disallow: all
                 allow: ulaw,alaw
         codecpriority: NULL
               qualify: yes
      qualifysmoothing: NULL
         qualifyfreqok: 60000
      qualifyfreqnotok: 10000
              timezone: Europe/Oslo
                  adsi: NULL
              amaflags: NULL
                setvar: NULL
                 login: NULL
                permit: NULL
                  deny: NULL
        provision_date: 2016-02-10 15:17:10
provision_last_response: 2016-04-13 14:58:07
          manager_user: manager
      manager_password: ***********
              ari_user: manager
          ari_password: ***********
             available: 1
   available_last_seen: 2016-04-13 14:58:14
  available_last_check: 2016-04-13 14:58:14
           is_upstream: 1
              restarts: 10
*************************** 3. row ***************************
                    id: 25
                  name: odn1-voip-cluster02-asterisk01
                  type: friend
              username: odn1-voip-cluster02-asterisk01
               mailbox: NULL
                secret: ***********
              dbsecret: NULL
               context: fromasterisk
            regcontext: NULL
             regserver: odn1-voip-cluster02-asterisk01
                  host: 185.60.160.134
              hostname: 2b37a1ede9657d1a1adac9a420d493d3-voip-aws-eu.publicdns.zone
              local_ip: 185.60.160.134
                ipaddr: 185.60.160.134
                  port: 4569
             defaultip: NULL
         sourceaddress: NULL
                  mask: 255.255.255.255
              regexten: NULL
            regseconds: 0
           accountcode: NULL
          mohinterpret: NULL
            mohsuggest: NULL
                inkeys: IAXTrunk
               outkeys: IAXTrunk
              language: NULL
              callerid: NULL
            cid_number: NULL
               sendani: NULL
              fullname: NULL
                 trunk: yes
                  auth: NULL
            maxauthreq: NULL
      requirecalltoken: NULL
            encryption: yes
              transfer: mediaonly
          jitterbuffer: yes
     forcejitterbuffer: NULL
              disallow: all
                 allow: ulaw,alaw
         codecpriority: NULL
               qualify: yes
      qualifysmoothing: NULL
         qualifyfreqok: 60000
      qualifyfreqnotok: 10000
              timezone: Europe/Oslo
                  adsi: NULL
              amaflags: NULL
                setvar: NULL
                 login: NULL
                permit: NULL
                  deny: NULL
        provision_date: 2016-02-10 15:45:55
provision_last_response: 2016-04-13 14:58:06
          manager_user: manager
      manager_password: ***********
              ari_user: manager
          ari_password: ***********
             available: 1
   available_last_seen: 2016-04-13 14:58:15
  available_last_check: 2016-04-13 14:58:15
           is_upstream: 0
              restarts: 11
{code}

By: Andreas Krüger (woopstar) 2016-04-19 07:04:30.038-0500

Hi Rusty;

Did you find time to look into this issue?

By: Rusty Newton (rnewton) 2016-04-25 17:53:29.593-0500

Sorry for the delay. I had other issues to look into first.

I've looked into it a bit, I'm not sure how to reproduce yet or what the root issue could be. I'll probably have to find someone who is more expert at IAX to take a look.

By: Rusty Newton (rnewton) 2016-04-26 18:06:07.438-0500

{quote}
When this happens, everything stops to work and we cannot get asterisk to work again unless we restart the service (not the server).
{quote}

Can you prepare for, and gather a backtrace along with locks output the next time Asterisk locks up?

Here are the instructions:
https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock

By: Andreas Krüger (woopstar) 2016-05-11 08:50:17.970-0500

Hi,

We'll try to gather the backtrace, but it is very difficult to do so, as it only happens occasionally.

What we see though is, that it always happens on type = 6 and subclass = 11 , and also on type = 6, subclass = 2. It seems an earlier ISSUE has been made about it too: ASTERISK-21193 which could be related.

Looking into this file: https://github.com/asterisk/asterisk/blob/master/include/asterisk/frame.h#L259 it seems that is classes which is related to RING and OPTIONS.

We just updated to 13.9.0 and seen the error there too.

By: Andreas Krüger (woopstar) 2016-05-13 06:41:16.399-0500

It may be worth noting and defining "stop working" better.

When the error happens, Asterisk is still responsive. You can log into the console, and suck. But channels are not working. The console is simply just "spamming" the max retries error every second.

By: Rusty Newton (rnewton) 2016-05-17 18:25:02.585-0500

Are any channels hung when this occurs? Can you provide "core show channel" or "iax2 show channel" output for the hung channels if they exist?

By: Andreas Krüger (woopstar) 2016-05-18 06:33:21.530-0500

We're setup a monitoring script at the servers now. When a max tries happens now, the follow output it gathered:

{code}
asterisk -rx "core show channels" > /tmp/core-show-channels.txt
asterisk -rx "iax2 show channels" > /tmp/iax2-show-channels.txt
asterisk -rx "iax2 show netstats" > /tmp/iax2-show-netstats.txt
gdb -ex "thread apply all bt" --batch /usr/sbin/asterisk `pidof asterisk` > /tmp/backtrace-threads.txt
asterisk -rx "core show locks" > /tmp/core-show-locks.txt
{code}

By: Rusty Newton (rnewton) 2016-05-18 17:57:08.331-0500

You want to get the "core show channel <channel name>" and "iax2 show channel <channel name>" output for each of the channels that is up to see what they are doing individually.

By: Jeppe Ryskov Larsen (ryskov) 2016-05-19 03:49:52.305-0500

Hey there, working with Andreas on this one.. We have included 'core show channel <channel_name>' into the monitor-script on all channels output from 'core show channel concise'.

A 'iax2 show channel <channel_name>' command does not seem to exists, but we have the netstats and the general iax2 channel information logged when it happens.

Please let us know if any more information would be useful. The issue happens so rarely, so when it does, we might as well get everything we can.

By: Jeppe Ryskov Larsen (ryskov) 2016-05-23 06:31:53.736-0500

We had it happen just now, and i have attached all the relevant files. The only thing is the 'core show locks' gave an error because the command does not exists. I hope this is usefull for finding the root issue

By: Andreas Krüger (woopstar) 2016-05-23 06:49:13.236-0500

FYI this happend on 13.9.1 and also happend on 13.9.0

By: Jeppe Ryskov Larsen (ryskov) 2016-05-25 08:13:25.702-0500

It happended two times today aswell. Just let me know if you want the log-files for these crashes aswell and i will upload them.

By: Rusty Newton (rnewton) 2016-05-25 16:35:59.674-0500

Thanks for the additional information.

Do any of the channels respond to a CLI hangup request after the issue occurs?

{noformat}
CLI> hangup request
Usage: channel request hangup <channel>|<all>
      Request that a channel be hung up. The hangup takes effect
      the next time the driver reads or writes from the channel.
      If 'all' is specified instead of a channel name, all channels
      will see the hangup request.
{noformat}

By: Jeppe Ryskov Larsen (ryskov) 2016-05-26 01:32:43.555-0500

I'll try that next time it happens and get back to you. We have a script set up to monitor, collect data and restart the service, so we will have to modify it so it doesn't automatically restart the service, so we can try it out.

By: Jeppe Ryskov Larsen (ryskov) 2016-05-31 04:22:38.106-0500

So now it happended again, with 3 'active' IAX channels on the server (all incoming from other servers).

seems like they were all stuck in 'Ring' state, while trying to set a local variable at dialplan-level.
{code}
IAX2/odn1-voip-clust ~~s~~@parkCall:1     Ring    MSet(LOCAL(parkingspace)=1000)
IAX2/odn1-voip-clust ~~s~~@parkCall:1     Ring    MSet(LOCAL(parkingspace)=1000)
IAX2/odn1-voip-clust ~~s~~@parkCall:1     Ring    MSet(LOCAL(parkingspace)=1000)
{code}

I tried doing '{{channel request hangup <channel>}}' on all of the channels (using full channel-names from '{{core show channels concise}}'). The command went through and gave me proper response (eg. '{{Requested Hangup on channel 'IAX2/odn1-voip-cluster02-upstream01-7119'}}'), but none of the channels were actually hung up. As i write this, the channels still exists, no sign of hangup going through.

Hope it helps!

By: Jeppe Ryskov Larsen (ryskov) 2016-05-31 04:33:12.356-0500

After inspecting the logs from the previous time it happended, i can see that the last executed dialplan application, before it hangs and prints 'max retries...', is {{MSet(LOCAL(...))}}.

Here are a few examples from my logs:

23/05/2016
{code}
[2016-05-23 13:23:04] VERBOSE[44062][C-0000006a] pbx.c: Executing [~~s~~@enterQueue:1] MSet("IAX2/odn1-voip-cluster02-upstream01-7669", "LOCAL(queueid)=80") in new stack
{code}

25/05/2016
{code}
[2016-05-25 10:15:07] VERBOSE[15591][C-00000010] pbx.c: Executing [~~s~~@parkCall:1] MSet("IAX2/odn1-voip-cluster02-upstream01-12266", "LOCAL(parkingspace)=1000") in new stack
{code}

31/05/2016
{code}
[2016-05-31 11:12:57] VERBOSE[21572][C-00000304] pbx.c: Executing [~~s~~@parkCall:1] MSet("IAX2/odn1-voip-cluster02-upstream01-7119", "LOCAL(parkingspace)=1000") in new stack
{code}

That seems pretty interesting.

Edit: All of those 'MSets' are argument-assignment that happens automatically when using AEL2. The macro signatures are like this:

{code}
macro enterQueue(queueid)
{code}

{code}
macro parkCall(parkingspace)
{code}


By: Andreas Krüger (woopstar) 2016-06-06 04:01:51.104-0500

Hi Rusty. Did you find time to look into this issue?

By: Rusty Newton (rnewton) 2016-06-14 15:27:02.688-0500

I did - unfortunately the issue is still not clear. That is, there is certainly a problem, but it will require a developer to look further into it. I'm going to open it up, but I wouldn't expect a resolution on this issue too quickly as we don't have a way to reproduce the issue reliably. However all the information you have provided will be of help.

I did miss that you were not able to provide the "core show locks" output which could be critical.

bq. The only thing is the 'core show locks' gave an error because the command does not exists.

Please check out the [wiki page which explains why|https://wiki.asterisk.org/wiki/display/AST/CLI+commands+useful+for+debugging#CLIcommandsusefulfordebugging-coreshowlocks].

I'm going to go ahead and open this issue up. Hopefully a developer will get time to look into it deeper or else we'll get other reports of the same issue with a reliable reproduction method.

By: Rusty Newton (rnewton) 2016-06-14 15:27:45.829-0500

I didn't say it specifically in the previous comment; please go ahead and provide the "core show locks" output the next time the problem occurs.

By: Jeppe Ryskov Larsen (ryskov) 2016-06-15 06:26:57.984-0500

Everything attached

By: Jeppe Ryskov Larsen (ryskov) 2016-06-15 06:27:05.763-0500

We just had it happen again, "luckily" just after we recompiled asterisk with the correct flags to ensure the relevant debug information. Im attaching everything i've got which is:

output from:

- core show locks
- iax2 show channels
- core show channels
- core show channel [channelname] for each channel
- iax2 show netstats
- backtrace threads (gdb)

Hopefully this will help :)


By: Jeppe Ryskov Larsen (ryskov) 2016-06-17 07:13:44.433-0500

Uploading a whole new batch of logs, as we had it happen both yesterday and today aswell

By: Jeppe Ryskov Larsen (ryskov) 2016-06-24 09:05:34.608-0500

As i wrote earlier, we have a script set up to automatically detect whenever this happens, gather log-data and send it via email and then restart the Asterisk service.

If it could be of any help, we could write one of your emails on the list, so you get the newest log-data of the crash, whenever it happens.

By: Jeppe Ryskov Larsen (ryskov) 2016-08-01 08:05:00.460-0500

Uploading a new fresh batch of files

By: Andreas Krüger (woopstar) 2016-08-17 06:40:25.327-0500

To maybe have this issue fixed faster or prioritized, we're willing to add a cash option of $2000 for resolving this issue. If this is allowed?

By: Joshua C. Colp (jcolp) 2016-08-17 06:42:52.156-0500

The process for bug bounties is documented on the wiki[1] if you would like to pursue it.

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties

By: Matt Jordan (mjordan) 2016-08-17 09:12:04.684-0500

I took a peak at the backtraces, and you may be getting hit by some problems with connection pooling in UnixODBC and its various connector clients (in your case, MySQL). [~jcolp] had an e-mail out to the {{asterisk-users}} list on this back in [June|http://lists.digium.com/pipermail/asterisk-users/2016-June/289291.html] that explained some of the issues, with some workarounds at the system level that could be done until we put workaround in Asterisk to bypass the buggy behavior.

All of the work to bypass UnixODBC and its connectors odd behavior should be done in 13.10.0 - have you tried that version yet?



By: Andreas Krüger (woopstar) 2016-08-17 13:33:12.738-0500

Hi Matt,

Thanks for posting. We will look into installing the unixODBC 2.3.2 asap. Will get back to you if it makes any difference.

Regaring  13.10, we did try to upgrade, but a strange error happened. Issue is located here: ASTERISK-26263 - Only happens in 13.10 and suck. 13.9.2 which we currently run does not have the problem. Same setup routine is used, only the tar file that gets downloaded is different in version. We use Ansible to install and configure an entire Asterisk server.

By: Andreas Krüger (woopstar) 2016-08-18 07:11:03.528-0500

@jcolp I've read your post from the mailing list. Do you suggest i has to be UnixODBC 2.3.2 ? I see 2.3.3 and 2.3.4 has been released

By: Joshua C. Colp (jcolp) 2016-08-18 07:14:43.117-0500

If using Asterisk 13.10 then the version of UnixODBC doesn't really matter as the behavior will match the previous versions of Asterisk, which did not have any problems with UnixODBC. It's still a good idea though if you can to use the latest UnixODBC as well as database connector just not critical.

By: Jeppe Ryskov Larsen (ryskov) 2016-08-23 08:00:02.506-0500

Niether upgrading (to 13.11 rc1) or updating UnixODBC to the newest version did the trick as we just today saw instances of this issue. One of the times it was running 13.9 with UnixODBC 2.3.4, the other time we were running 13.11-rc1 with UnixODBC 2.3.4. Our mysql connector was updated along with UnixODBC.

I will attach backtraces + logs from the incident on 13.11-rc1 + UnixODBC 2.3.4

By: Jeppe Ryskov Larsen (ryskov) 2016-08-23 08:03:42.421-0500

I have uploaded the usual logs, but i also included one called cli-output.txt this time, showing the last ~100 of lines before the 'Max retries' message. I happended to have iax2 set debug on by coincidence when this happended.

Files are prefixed with 23-08-2016

By: Jeppe Ryskov Larsen (ryskov) 2016-08-23 08:25:01.371-0500

I went digging in logs, and i have uploaded the full iax2 debug log output, instead of just the last 100 lines, which seemed insufficient.

23-08-2016-cli-output-full.txt is the name of the file

By: Andreas Krüger (woopstar) 2016-08-23 08:27:57.536-0500

I was looking into the source code of chan_iax2.c and what we're seeing from the issue here is that the following log command (https://github.com/asterisk/asterisk/blob/master/channels/chan_iax2.c#L3577-#L3583) is called in some sort of infinite loop where the server stops responding.

Looking at the next lines (https://github.com/asterisk/asterisk/blob/master/channels/chan_iax2.c#L3586-#L3600), I'm lead to belive that the iax channel is never destroyed. The destroy in https://github.com/asterisk/asterisk/blob/master/channels/chan_iax2.c#L3599 is never called, as it never reaches that part of the if clause.

But this is strictly an assumption based on what we're seeing.

We tried putting out a BOUNTY on the DEV mailling list, but seems a bit idle that list. The bounty of $2.000 is still in effect of resolving this issue.

By: Jeppe Ryskov Larsen (ryskov) 2016-08-25 07:58:17.015-0500

Today we have been testing out some changes, both on our 13.9.0 and our 13.11.0-rc1 environments.

We noticed from our logs that a lot of the times when we have gotten the "max retries"-stall, some or all of the active channels would be stuck in the NoOp application. Another thing that we read somewhere was to turn off/minimize logging through logger.conf. I am starting to think something not related to iax2 is causing the issue, causing the iax2 to behave oddly.

So we removed all of our NoOps from our dialplan and now only logs verbose,error,warning instead of verbose,error,warning,debug,fax,dtmf

On a normal day at this point (14:53, clients starts using the system at 08:00) we would have gotten the error at least once or twice, but we haven't seen it today, so far.

I will report back if this is bogus and we see it again. After all, these are just random and desperate attempts at fixing this.

By: Jeppe Ryskov Larsen (ryskov) 2016-08-26 02:15:01.240-0500

Discard my previous comment. It happended again. I have attached the debug info

By: Andreas Krüger (woopstar) 2017-04-21 05:01:40.982-0500

Okay. So we are still seeing this issue. Today we got it again:
{noformat}
[2017-04-21 11:52:41] WARNING[3509] chan_iax2.c: Max retries exceeded to host 185.161.127.73 on IAX2/osl1-voip-cluster01-upstream04-9829 (type = 6, subclass = 11, ts=387427, seqno=104)
[2017-04-21 11:52:41] WARNING[3504] chan_iax2.c: Max retries exceeded to host 185.161.127.72 on IAX2/osl1-voip-cluster01-upstream03-15532 (type = 6, subclass = 11, ts=180441, seqno=49)
[2017-04-21 11:52:41] WARNING[3500] chan_iax2.c: Max retries exceeded to host 185.161.127.73 on IAX2/osl1-voip-cluster01-upstream04-16195 (type = 6, subclass = 11, ts=1246488, seqno=71)
[2017-04-21 11:52:41] WARNING[29600] chan_iax2.c: Max retries exceeded to host 185.161.127.72 on IAX2/osl1-voip-cluster01-upstream03-7394 (type = 6, subclass = 2, ts=1959345, seqno=49)
[2017-04-21 11:52:41] WARNING[29358] chan_iax2.c: Max retries exceeded to host 185.161.127.72 on IAX2/osl1-voip-cluster01-upstream03-2294 (type = 6, subclass = 11, ts=230881, seqno=34)
[2017-04-21 11:52:41] WARNING[28199] chan_iax2.c: Max retries exceeded to host 185.161.127.73 on IAX2/osl1-voip-cluster01-upstream04-16195 (type = 6, subclass = 2, ts=1247489, seqno=72)
[2017-04-21 11:52:41] WARNING[26616] chan_iax2.c: Max retries exceeded to host 185.161.127.72 on IAX2/osl1-voip-cluster01-upstream03-7394 (type = 6, subclass = 11, ts=1958345, seqno=48)
{noformat}

What could cause this to happen suddenly? All servers are connected with 1 switch. Network is so stupid simple and just "works".

When it happens, the asterisk server becomes unstable and still needs a restart.


Is it really such a bad idea to interconnect Asterisk with IAX ?