[Home]

Summary:ASTERISK-17143: [patch] Asterisk hangs with no calls being able to go in or out, and the console does not respond to commands
Reporter:Jonas (jonascph)Labels:
Date Opened:2010-12-21 09:42:56.000-0600Date Closed:2011-09-22 07:37:41
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_iax2
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk_1.8_hang_other_log.txt
( 1) asterisk_1.8_hang.txt
( 2) asterisk-restart-2011-01-08-1414.txt
( 3) asterisk-restart-2011-01-08-1534.txt
( 4) asterisk-restart-2011-01-11-1520.txt
( 5) asterisk-restart-2011-01-11-1748.txt
( 6) asterisk-restart-2011-01-12-0830.txt
( 7) asterisk-restart-2011-01-12-1824.txt
( 8) asterisk-restart-2011-01-13-1206.txt
( 9) asterisk-restart-2011-01-14-0102.txt
(10) asterisk-restart-2011-01-14-1434.txt
(11) asterisk-restart-2011-01-14-1822.txt
(12) asterisk-restart-2011-01-15-1658.txt
(13) asterisk-restart-2011-01-15-2212.txt
(14) asterisk-restart-2011-01-21-1410.txt
(15) asterisk-restart-2011-01-27-2332.txt
(16) asterisk-restart-2011-01-28-2204.txt
(17) asterisk-restart-2011-01-29-0528.txt
(18) asterisk-restart-2011-01-31-1122.txt
(19) asterisk-restart-2011-01-31-1330.txt
(20) asterisk-restart-2011-01-31-1402.txt
(21) asterisk-restart-2011-01-31-1458.txt
(22) asterisk-restart-2011-01-31-1640.txt
(23) asterisk-restart-2011-02-01-0806.txt
(24) asterisk-restart-2011-02-02-1424.txt
(25) asterisk-restart-2011-02-03-1220.txt
(26) asterisk-restart-2011-02-03-1240.txt
(27) asterisk-restart-2011-02-18-0916.txt
(28) asterwatch.sh
(29) backtrace-jrc-2010-01-04.txt
(30) backtrace-jrc-2010-01-06.txt
(31) backtrace-jrc-2010-01-06-2.txt
(32) backtrace-threads-jonas1.txt
(33) backtrace-threads-jonas2.txt
(34) core-show-locks-jrc-2010-01-04.txt
(35) core-show-locks-jrc-2010-01-06.txt
(36) core-show-locks-jrc-2010-01-06-2.txt
(37) iax-debug-jrc-2010-01-04.txt
(38) locks-jonas1.txt
(39) locks-jonas2.txt
Description:I'm not sure if we are seeing one or two different issues.

We've had stability issues with all versions of 1.8 we've tried. Those include 1.8rc2, 1.8 final,a trunk checkout some weeks after 1.8 final and latest 1.8.2rc1 all had the same issue or similar issues (They all froze/crashed)

So far i do not have any type of debug information to show, but im counting on being able to get some next time it happens. It's so far happened about once every 6-9 days, and strangely enough it typically doesnt happen for over a week and then happens 2-3 times on the same day.
Comments:By: Jonas (jonascph) 2010-12-21 09:52:54.000-0600

Added full debug log showing how it looks once asterisk stops working. There are two different errors/warnings in the logs. Hence why I'm not sure if its one or two issues.

There is no pattern (that i've noticed) to whats going on call flow wise when it happens.

Also this is a physical server running Debian lenny. Its an older Xeon 2.4 Ghz with 4GB ram.

There are 49 sip peers, 48 endpoints and 1 sip provider.
There a single sip registration again the 1 sip provider.

I have 4 iax peers, all to providers.
There are 12 IAX2 registrations (each provider has 3 ip's i need to register against)



By: Paul Belanger (pabelanger) 2010-12-21 11:12:58.000-0600

Debugging deadlocks:

Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install)

This will then give you the console command:

core show locks

When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt

# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt

gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!



By: John Covert (jcovert) 2010-12-28 13:55:19.000-0600

This looks like issue ASTERISK-1728521 again.



By: John Covert (jcovert) 2010-12-29 21:04:23.000-0600

Please see the reproducer at

https://issues.asterisk.org/view.php?id=17521#130036

By: John Covert (jcovert) 2011-01-04 16:22:06.000-0600

I have captured data from a hang, and have attached two files.

These are the console messages leading up to the hang:

[Jan  4 16:00:38] NOTICE[15828]: chan_iax2.c:10512 socket_process: Rejected connect attempt from 63.210.44.150, requested/capability '0x400000000000000 (nothing)'/'0x7fff000000000000 (nothing)' incompatible with our capability '0x80e (gsm|ulaw|alaw|g726)'.

   -- Accepting UNAUTHENTICATED call from 66.54.140.46:
      > requested format = ulaw,
      > requested prefs = (ulaw|alaw|gsm|g729),
      > actual format = ulaw,
      > host prefs = (ulaw|alaw|g726|gsm),
      > priority = caller

I'm not sure how long we were hung before I noticed it; at approximately 17:45 I picked up a SIP extension and discovered that asterisk was not responding to it.

I gathered the data, as requested.  When gdb exited and released asterisk, there was some further activity in the IAX2 channel:

[Jan  4 16:58:11] WARNING[15828]: chan_iax2.c:1123 iax_error_output: Information element length exceeds message size
[Jan  4 16:58:11] WARNING[15828]: chan_iax2.c:9901 socket_process: Undecodable frame received from '192.168.0.4'
[Jan  4 16:58:11] WARNING[15828]: chan_iax2.c:1123 iax_error_output: Information element length exceeds message size
[Jan  4 16:58:11] WARNING[15828]: chan_iax2.c:9901 socket_process: Undecodable frame received from '192.168.0.4'
   -- Registered IAX2 'jrclaptop' (AUTHENTICATED) at 192.168.0.4:47569
   -- Registered IAX2 'jsolathome' (AUTHENTICATED) at 69.183.230.139:4569
   -- Registered IAX2 'x29' (AUTHENTICATED) at 192.168.0.17:4569
   -- Registered IAX2 'x38' (AUTHENTICATED) at 79.240.220.121:4569

But the SIP phones were not able to register.  I turned on SIP debugging, and got no response at all, other than the "sip debugging enabled."  I have dozens of SIP peers, so there was certainly activity at the interface.

I turned on IAX2 debugging, and there was quite a bit of activity.  I attempted a call from and iaxy station, and the call was not accepted properly.  I dialled 953, but only part of that was received.  The iax debug output is also supplied.

I issued a core restart now, and the system is up again.  At least one of my consulting clients was unable to get through to me during the outage.  Lost $$$.

I hope that this helps resolve whatever is ailing the IAX2 channel, which has become nearly unusable since somewhere during the 1.6 timeframe.

/john

By: Russell Bryant (russell) 2011-01-05 13:11:34.000-0600

I see in the attached files that res_timing_pthread is in use.  This is the timing module of last resort and has known poor performance (high CPU consumption).  Please run "make menuselect" and enable res_timing_timerfd if possible.  If not, install DAHDI and try res_timing_dahdi.

By: John Covert (jcovert) 2011-01-05 13:24:00.000-0600

Not possible on MacOS X.

By: John Covert (jcovert) 2011-01-06 07:12:25.000-0600

Hung again, probably for over 12 hours.  The following console log output has appeared since the last successful call.  The failed attempted call at the beginning occurred around 19:00 yesterday.  There was no attempt to use the phones yesterday evening, and this was noticed this morning.

   -- Attempting call on local/cancel@forward for s@nocdr:1 (Retry 1)
   -- Attempting call on local/cancel@forward for s@nocdr:1 (Retry 1)
   -- Attempting call on local/cancel@forward for s@nocdr:1 (Retry 1)
   -- Attempting call on local/cancel@forward for s@nocdr:1 (Retry 1)
   -- Attempting call on local/cancel@forward for s@nocdr:1 (Retry 1)
   -- Accepting UNAUTHENTICATED call from 66.54.140.46:
      > requested format = ulaw,
      > requested prefs = (ulaw|alaw|gsm|g729),
      > actual format = ulaw,
      > host prefs = (ulaw|alaw|g726|gsm),
      > priority = caller
[Jan  5 19:38:29] WARNING[17879]: res_jabber.c:2753 aji_recv_loop: JABBER: socket read error
[Jan  5 20:35:15] NOTICE[17879]: chan_iax2.c:10512 socket_process: Rejected connect attempt from 63.210.44.150, requested/capability '0x400000000000000 (nothing)'/'0x7fff000000000000 (nothing)' incompatible with our capability '0x80e (gsm|ulaw|alaw|g726)'.
[Jan  5 20:39:23] NOTICE[17879]: chan_iax2.c:10512 socket_process: Rejected connect attempt from 63.210.44.150, requested/capability '0x400000000000000 (nothing)'/'0x7fff000000000000 (nothing)' incompatible with our capability '0x80e (gsm|ulaw|alaw|g726)'.
[Jan  6 04:51:24] NOTICE[17879]: chan_iax2.c:11867 __iax2_poke_noanswer: Peer 'x38' is now UNREACHABLE! Time: 139
   -- Registered IAX2 'x38' (AUTHENTICATED) at 79.240.212.141:4569
[Jan  6 04:51:33] NOTICE[17879]: chan_iax2.c:11924 iax2_poke_peer: Still have a callno...
[Jan  6 04:51:33] NOTICE[17879]: chan_iax2.c:10802 socket_process: Peer 'x38' is now REACHABLE! Time: 136
   -- Attempting call on local/dtk@dectalk-orig/n for 77@dectalk-orig:1 (Retry 1)
   -- Attempting call on local/dtk@dectalk-orig/n for 77@dectalk-orig:1 (Retry 1)
   -- Attempting call on local/dtk@dectalk-orig/n for 77@dectalk-orig:1 (Retry 1)
   -- Attempting call on local/dtk@dectalk-orig/n for 77@dectalk-orig:1 (Retry 1)
   -- Attempting call on local/dtk@dectalk-orig/n for 77@dectalk-orig:1 (Retry 1)
   -- Attempting call on local/dtk@dectalk-orig/n for 77@dectalk-orig:1 (Retry 1)
   -- Accepting UNAUTHENTICATED call from 66.54.140.46:
      > requested format = ulaw,
      > requested prefs = (ulaw|alaw|gsm|g729),
      > actual format = ulaw,
      > host prefs = (ulaw|alaw|g726|gsm),
      > priority = caller

asterisk has used a lot of cpu time (on multiple cpus, since the total is greater than the elapsed time) since it was restarted, and is still using time, but not a lot:

asterisk*CLI> core show uptime
System uptime: 1 day, 14 hours, 52 minutes, 25 seconds
Last reload: 1 day, 14 hours, 52 minutes, 25 seconds

 PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
17879 asterisk     0.0% 94:34.80  45   706  1086  77.1M  14.8M  73.4M   246M

By: Jonas (jonascph) 2011-01-06 07:18:18.000-0600

I'm not dead so I thought i'd post an update.

Luckily our installation have been crash free for 2 weeks now. I do suspect the fact that there have been a ton of holidays etc, meaning alot less activity is a factor. So if everything goes as expected, im guessing asterisk mgiht crash late this week early next week. But we will see.

I will try to switch to the dahdi timer if it does crash.

Unfortunately i can not get res_timing_timerfd to work with any debian kernels, and i only use debian kernels, i do not compile them myself, so i dont see how I can do anything about that unfortunately.

But anyway i will report back as soon as i see the crash again, if i see it again ;-)

By: John Covert (jcovert) 2011-01-06 16:13:09.000-0600

Another one today.  I'm not sure how much longer I can stay on 1.8.1.1; I arrived at a meeting to complaints from my clients that they had been trying to call me for twenty minutes.

Nothing interesting on the console other than an incoming IAX call.

New files uploaded.

By: John Covert (jcovert) 2011-01-08 13:31:10.000-0600

Another hang.  Shortly before:

[Jan  8 14:09:54] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:09:55] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:09] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:09] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:09] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:09] WARNING[3977]: chan_iax2.c:1123 iax_error_output: Information element length exceeds message size
[Jan  8 14:10:09] WARNING[3977]: chan_iax2.c:9901 socket_process: Undecodable frame received from '79.240.211.162'
[Jan  8 14:10:09] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:13] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:16] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:18] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:24] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:24] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:24] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:29] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:29] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:32] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
   -- Hungup 'IAX2/x29-5710'
 == Spawn extension (inbound-ringall, 990, 20) exited non-zero on 'SIP/von-g-00000045'
[Jan  8 14:10:38] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979
[Jan  8 14:10:38] WARNING[3977]: channel.c:1399 __ast_queue_frame: Exceptionally long voice queue length queuing to IAX2/x29-12979

Then some processing, but not correct, and finally

[Jan  8 14:14:36] WARNING[3977]: chan_iax2.c:1123 iax_error_output: Information element length exceeds message size
[Jan  8 14:14:36] WARNING[3977]: chan_iax2.c:9901 socket_process: Undecodable frame received from '79.240.211.162'


And general failure to respond to SIP messages, which I now have set to automatically initiate collection of the requested data, and a restart.

See asterisk-restart-2011-01-08-1414.txt

By: John Covert (jcovert) 2011-01-08 14:38:09.000-0600

Another hang.  SIP calls not accepted.  Nothing obvious on console.

See asterisk-restart-2011-01-08-1534.txt

By: John Covert (jcovert) 2011-01-11 20:28:51.000-0600

Two more hangs uploaded.

asterisk-restart-2011-01-11-1520.txt

asterisk-restart-2011-01-11-1748.txt

By: John Covert (jcovert) 2011-01-13 11:25:50.000-0600

Three more:

asterisk-restart-2011-01-12-0830.txt

asterisk-restart-2011-01-12-1824.txt

asterisk-restart-2011-01-13-1206.txt

---

With failures this often, the only way I was able to remain on 1.8.1.1 and not revert to something more stable was to run a cron job every two minutes to check asterisk's health and restart it forcibly after gathering the core show locks and gdb info requested by the developers.

Others may find it useful, so I've uploaded asterwatch.sh.

You'll need sipsak, which is readily available.

Note that the ref to asterisk-1.6 is only for the "cli connector", because the one built with all the debug stuff for 1.8.1.1 is noisy as reported elsewhere.  The output is all from the real 1.8.1.1 asterisk being talked to with "-rx".

/john



By: John Covert (jcovert) 2011-01-14 00:04:14.000-0600

Another:

asterisk-restart-2011-01-14-0102.txt

By: John Covert (jcovert) 2011-01-14 15:54:08.000-0600

[Jan 14 14:33:28] ERROR[12688]: lock.c:129 __ast_pthread_mutex_destroy: astobj2.c line 265 (internal_ao2_ref): Error destroying mutex &obj->priv_data.lock: Resource busy
[Jan 14 14:33:28] ERROR[12688]: lock.c:384 __ast_pthread_mutex_unlock: chan_sip.c line 3638 (__sip_autodestruct): mutex 'p' freed more times than we've locked!
[Jan 14 14:33:28] ERROR[12688]: astobj2.c:258 internal_ao2_ref: refcount -1 on object 0x1855310

then hung

asterisk-restart-2011-01-14-1434.txt

By: John Covert (jcovert) 2011-01-14 17:27:04.000-0600

nothing on the console of interest

asterisk-restart-2011-01-14-1822.txt

By: John Covert (jcovert) 2011-01-15 22:25:36.000-0600

two more.  Console info in file.

asterisk-restart-2011-01-15-1658.txt

asterisk-restart-2011-01-15-2212.txt



By: John Covert (jcovert) 2011-01-18 09:28:09.000-0600

Again.  Console info in file.

asterisk-restart-2011-01-18-0920.txt

By: Denis Smirnov (mithraen) 2011-01-20 17:47:12.000-0600

jcovert, thanks for your script. This bug reproduced on my installation of 1.8.1 up to 3 crashes in a day.

Now I install your script and post logs tomorrow.

By: Jonas (jonascph) 2011-01-21 05:44:13.000-0600

Hey again.

I woudl just like to report that we have not seen this crash in over 3 weeks now. Ad we've even added more SIP and IAX peers and activity have increased.

Yet Asterisk now apear to run stable.

I'm currently on Asterisk 1.8.2-rc1

jcovert did u try to upgrade?

I'm not sure what i did that helped (since i did various things trying to fix this issue)

But the dialplan and the hardware is the same. I think i saw an improvement since 1.8.1 but not quite sure when it became stable.

Anyway i think everyone should upgrade to 1.8.2.x and see if that helps?

By: John Covert (jcovert) 2011-01-21 13:16:35.000-0600

I am on 1.8.2.1 and have just had another hang.  Nothing at all useful on the console.

asterisk-restart-2011-01-21-1410.txt

By: John Covert (jcovert) 2011-01-27 23:29:15.000-0600

Again.  Nothing particularly interesting on the console:

 == ast_get_enum(num='+555555555', tech='iax2', suffix='std.ckts.info', options='', record=1
 == ENUM options(): pos=1, options='0'
 == ast_get_enum() profiling: FAIL, 5.5.5.5.5.5.5.5.5.std.ckts.info, 158 ms
   -- Executing [s@macro-cnetlookup:5] Set("SIP/freedigits-00000064", "result=") in new stack

asterisk-restart-2011-01-27-2332.txt

By: John Covert (jcovert) 2011-01-29 08:39:43.000-0600

Two more.  Was on the phone when the first one occurred, and the following had happened a few minutes before the failure:

   -- Executing [990@inbound-ringall:20] Dial("SIP/von-g-000000ea", "SIP/x21&SIP/x22&SIP/x23&SIP/x28&IAX2/x29&SIP/x33,120,m(5xbring)tM(whoanswered)") in new stack
   -- Called x21
   -- Called x22
   -- Called x23
   -- Called x28
   -- Call accepted by 192.168.0.17 (format ulaw)
   -- Format for call is ulaw
   -- Called x29
   -- Called x33
   -- Started music on hold, class '5xbring', on SIP/von-g-000000ea
   -- SIP/x21-000000eb is ringing
   -- SIP/x22-000000ec is ringing
   -- SIP/x23-000000ed is ringing
   -- IAX2/x29-9501 is ringing
   -- SIP/x33-000000ef is ringing
   -- SIP/x28-000000ee is ringing
   -- SIP/x28-000000ee is ringing
   -- SIP/x28-000000ee connected line has changed. Saving it until answer for SIP/von-g-000000ea
   -- SIP/x28-000000ee answered SIP/von-g-000000ea
   -- Hungup 'IAX2/x29-9501'
   -- Executing [s@macro-whoanswered:1] NoOp("SIP/x28-000000ee", "SIP/x28-000000ee") in new stack
   -- Executing [s@macro-whoanswered:2] MacroExit("SIP/x28-000000ee", "") in new stack
   -- Stopped music on hold on SIP/von-g-000000ea


asterisk-restart-2011-01-28-2204.txt


The second one -- nothing at all had happened for two hours.  I woke up, reached for the phone to dial time recording, and things were already dead (no evidence of that call on the console).  My watcher caught it at the next two-minute interval.

asterisk-restart-2011-01-29-0528.txt

By: John Covert (jcovert) 2011-01-31 13:29:31.000-0600

Three more.  Nothing informative on the console.

asterisk-restart-2011-01-31-1122.txt

asterisk-restart-2011-01-31-1330.txt

asterisk-restart-2011-01-31-1402.txt

By: John Covert (jcovert) 2011-01-31 15:59:34.000-0600

This message:

[Jan 31 14:50:59] ERROR[21576]: chan_sip.c:27996 setup_srtp: No SRTP module loaded, can't setup SRTP session.

And then this hang (but several minutes later):

asterisk-restart-2011-01-31-1458.txt

[Jan 31 16:38:44] ERROR[22061]: chan_sip.c:27996 setup_srtp: No SRTP module loaded, can't setup SRTP session.

And then this hang (rather soon):

asterisk-restart-2011-01-31-1640.txt

By: John Covert (jcovert) 2011-02-03 11:41:40.000-0600

Four more, two in close succession.

Before the first one:

   -- Executing [77@dectalk-orig:6] Hangup("Local/dtk@dectalk-orig-fc34;1", "") in new stack
 == Spawn extension (dectalk-orig, 77, 6) exited non-zero on 'Local/dtk@dectalk-orig-fc34;1'
 == Spawn extension (dectalk-orig, dtk, 3) exited non-zero on 'Local/dtk@dectalk-orig-fc34;2'
[Feb  1 06:45:14] NOTICE[22651]: pbx_spool.c:366 attempt_thread: Call completed to local/dtk@dectalk-orig/n
[Feb  1 07:05:04] NOTICE[22651]: chan_sip.c:19588 handle_response_peerpoke: Peer 'acat_x23' is now Lagged. (3235ms / 2000ms)
[Feb  1 07:05:14] NOTICE[22651]: chan_sip.c:19588 handle_response_peerpoke: Peer 'acat_x23' is now Reachable. (53ms / 2000ms)
   -- Registered SIP 'acat_x23' at 173.162.202.218:2072
[Feb  1 07:05:49] NOTICE[22651]: chan_sip.c:19588 handle_response_peerpoke: Peer 'acat_x23' is now Reachable. (55ms / 2000ms)
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 5ac49bb2575455913980c57f44ab2822@sip.broadvoice.com for seqno 2394 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 37065ms with no response
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 2da601e57fd3093d35cd20992e81812d@voipuser.org for seqno 802 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 34825ms with no response
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 026e145206fa0f175fb333094207b60b@cgate.covert.org for seqno 3545 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 34303ms with no response
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 5e22924664d1a94b756e8d17312560e1@cgate.covert.org for seqno 1153 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 33828ms with no response
[Feb  1 08:05:12] NOTICE[22651]: chan_sip.c:24809 sip_poke_noanswer: Peer 'acat_x23' is now UNREACHABLE!  Last qualify: 51
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 13d51bf33b63affd0c92d3062f1a3569@sipgate.com for seqno 802 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 32649ms with no response
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 62c050b05661730f5f2a71ff0bed8a38@freeswitch.covert.org for seqno 1154 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 32591ms with no response
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 3c2dd69a320aaf0936dcbc6d751cea86@sipgate.de for seqno 802 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 32127ms with no response
[Feb  1 08:05:12] WARNING[22651]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 58ccc3e4445500e9181700be3771b6ca@sipgate.de for seqno 803 (Critical Request) -- See doc/sip-retransmit.txt.
Packet timed out after 32036ms with no response
[Feb  1 08:05:12] NOTICE[22651]: chan_sip.c:24809 sip_poke_noanswer: Peer 'acat_x24' is now UNREACHABLE!  Last qualify: 51
[Feb  1 08:05:35] NOTICE[22651]: chan_sip.c:12237 sip_reg_timeout:    -- Registration for '9784290097@broadvoice' timed out, trying again (Attempt #2)
[Feb  1 08:05:35] NOTICE[22651]: chan_sip.c:24809 sip_poke_noanswer: Peer 'x37' is now UNREACHABLE!  Last qualify: 9
[Feb  1 08:05:38] NOTICE[22651]: chan_sip.c:12237 sip_reg_timeout:    -- Registration for 'jcovert@voipuseruk' timed out, trying again (Attempt #2)
[Feb  1 08:05:39] NOTICE[22651]: chan_sip.c:12237 sip_reg_timeout:    -- Registration for '16789357030@sphone.vopr.vonage.net' timed out, trying again (Attempt #2)
[Feb  1 08:05:39] NOTICE[22651]: chan_sip.c:12237 sip_reg_timeout:    -- Registration for 'salmat399@jrc-salmat399' timed out, trying again (Attempt #2)

Then -- asterisk-restart-2011-02-01-0806.txt

Before the next one:

[Feb  2 14:17:13] ERROR[26881]: chan_sip.c:27996 setup_srtp: No SRTP module loaded, can't setup SRTP session.
   -- Stopped music on hold on SIP/88.87.32.27-000000e8
   -- Stopped music on hold on SIP/pstn5433-000000ef


Then -- asterisk-restart-2011-02-02-1424.txt

Next one: Nothing before, but at close to the same time
[Feb  3 12:20:35] ERROR[4959]: netsock2.c:249 ast_sockaddr_resolve: getaddrinfo("free1.voipgateway.org", "(null)", ...): No address associated with nodename
[Feb  3 12:20:36] WARNING[4959]: acl.c:569 resolve_first: Unable to lookup 'free1.voipgateway.org'

asterisk-restart-2011-02-03-1220.txt

and continuing afterwards:

[Feb  3 12:21:27] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:28] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:30] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:34] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:38] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:42] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:46] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19f3400 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:48] ERROR[10480]: netsock2.c:249 ast_sockaddr_resolve: getaddrinfo("free1.voipgateway.org", "(null)", ...): No address associated with nodename
[Feb  3 12:21:48] WARNING[10480]: acl.c:569 resolve_first: Unable to lookup 'free1.voipgateway.org'
[Feb  3 12:21:48] WARNING[10480]: acl.c:698 ast_ouraddrfor: Cannot connect
[Feb  3 12:21:48] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19bb000 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:48] NOTICE[10480]: chan_sip.c:12237 sip_reg_timeout:    -- Registration for '41325114995@free1.voipgateway.org' timed out, trying again (Attempt #3)
[Feb  3 12:21:48] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19bb000 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:49] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19bb000 (len 437) to (null) returned -1: Invalid argument
[Feb  3 12:21:51] WARNING[10480]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19bb000 (len 437) to (null) returned -1: Invalid argument

until the DNS problem was resolved.

Just a few minutes later:

[Feb  3 12:33:21] ERROR[10480]: chan_sip.c:27996 setup_srtp: No SRTP module loaded, can't setup SRTP session.
   -- Stopped music on hold on SIP/von-g-00000001
 == Spawn extension (inbound-ringall, 990, 20) exited non-zero on 'SIP/von-g-00000001'
[Feb  3 12:33:48] WARNING[10480]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 2a12c99c3d3b25c00c3943a6328e5853@cgate.covert.org for seqno 1 (Critical Response) -- See doc/sip-retransmit.txt.
Packet timed out after 32001ms with no response
[Feb  3 12:33:51] WARNING[10480]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 2a12c99c3d3b25c00c3943a6328e5853@cgate.covert.org for seqno 2 (Critical Response) -- See doc/sip-retransmit.txt.
Packet timed out after 32000ms with no response
[Feb  3 12:33:52] WARNING[10480]: chan_sip.c:3386 retrans_pkt: Retransmission timeout reached on transmission 2a12c99c3d3b25c00c3943a6328e5853@cgate.covert.org for seqno 2 (Critical Response) -- See doc/sip-retransmit.txt.
Packet timed out after 32000ms with no response
asterisk*CLI>
asterisk*CLI>
asterisk*CLI>
[Feb  3 12:39:11] ERROR[10480]: chan_sip.c:27996 setup_srtp: No SRTP module loaded, can't setup SRTP session.

and then -- asterisk-restart-2011-02-03-1240.txt

By: Leif Madsen (lmadsen) 2011-02-08 13:10:12.000-0600

I appreciate you're doing your due diligence in testing and providing debugging information, but is it really necessary to post multiple debug files for each time this happens? What new information is to be gleaned with each subsequent file attached?

If there is new information in each of the files, then great, but posting the same thing over and over again probably isn't terribly useful :)

Is there any chance this could be reproduced on Linux as well?

By: John Covert (jcovert) 2011-02-10 13:14:38.000-0600

I don't know for sure that there isn't new information that might actually make it possible for the person who is assigned to look at this bug to see the problem in one of these that for whatever reason didn't show up in another.

I think there is a relationship between all of these problems that are plaguing my system.  I think part of it may be that it is Mac OS X, but others are reporting similar problems on Linux.  I think that there is some major error in queue handling and timing that shows up much more on slower machines than on blindingly fast machines.

/john

By: John Covert (jcovert) 2011-02-12 16:31:18.000-0600

Continues to fail.  I'm not uploading the following:

asterisk-restart-2011-02-11-1728.txt
asterisk-restart-2011-02-12-1032.txt

By: John Covert (jcovert) 2011-02-14 15:58:36.000-0600

Bad day.

One of these many hangs was after a long string of this sort of output:

[Feb 14 12:26:53] NOTICE[197]: chan_sip.c:12237 sip_reg_timeout:    -- Registration for '41325114995@free1.voipgateway.org' timed out, trying again (Attempt ASTERISK-3)
[Feb 14 12:26:54] WARNING[197]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19ffe00 (len 437) to (null) returned -1: Invalid argument

asterisk-restart-2011-02-14-1206.txt
asterisk-restart-2011-02-14-1208.txt
asterisk-restart-2011-02-14-1224.txt
asterisk-restart-2011-02-14-1230.txt
asterisk-restart-2011-02-14-1638.txt

not uploaded

By: John Covert (jcovert) 2011-02-18 08:51:03.000-0600

Running 1.8.3-rc3 + sip SVN 302314 and am still having SIP hangs.

asterisk-restart-2011-02-18-0916.txt

new release, maybe something new in the output.

Most recent one: nothing noteworthy on the console prior to the hang.
Immediately after the automatic restart due to SIP unresponsiveness, (coincidence?) this temporary situation started up with a provider:

[Feb 18 09:16:47] ERROR[11939]: netsock2.c:249 ast_sockaddr_resolve: getaddrinfo("free1.voipgateway.org", "(null)", ...): No address associated with nodename
[Feb 18 09:16:47] WARNING[11939]: acl.c:569 resolve_first: Unable to lookup 'free1.voipgateway.org'
[Feb 18 09:16:59] NOTICE[11939]: chan_sip.c:19588 handle_response_peerpoke: Peer 'x37' is now Reachable. (5554ms / 10000ms)
[Feb 18 09:16:59] NOTICE[11939]: chan_sip.c:19588 handle_response_peerpoke: Peer 'x34' is now Reachable. (5572ms / 10000ms)
[Feb 18 09:16:59] NOTICE[11939]: chan_sip.c:19588 handle_response_peerpoke: Peer 'x36' is now Reachable. (5573ms / 10000ms)
[Feb 18 09:17:04] ERROR[11939]: netsock2.c:249 ast_sockaddr_resolve: getaddrinfo("free1.voipgateway.org", "(null)", ...): No address associated with nodename
[Feb 18 09:17:04] WARNING[11939]: acl.c:569 resolve_first: Unable to lookup 'free1.voipgateway.org'
[Feb 18 09:17:04] WARNING[11939]: acl.c:698 ast_ouraddrfor: Cannot connect
[Feb 18 09:17:04] WARNING[11939]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19bc800 (len 437) to (null) returned -1: Invalid argument
[Feb 18 09:17:04] WARNING[11939]: chan_sip.c:3115 __sip_xmit: sip_xmit of 0x19bc800 (len 437) to (null) returned -1: Invalid argument


that cleared up without another restart.

By: John Covert (jcovert) 2011-02-21 05:50:38.000-0600

This looks potentially troublesome:

   -- Executing [990@inbound-ringall:20] Dial("Local/77@dialstation-87a1;2", "SIP/x21&SIP/x22&SIP/x23&SIP/x28&IAX2/x29&SIP/x33,120,m(5xbring)tM(whoanswered)") in new stack
   -- Local/77@dialstation-87a1;1 is making progress passing it to Local/dtk@dectalk-orig-26cc;1
   -- Called x21
   -- Called x22
   -- Local/dtk@dectalk-orig-26cc;1 requested special control 20, passing it to Local/77@dialstation-87a1;1
   -- Local/dtk@dectalk-orig-26cc;1 requested special control 20, passing it to Local/77@dialstation-87a1;1
   -- Called x23
[Feb 21 06:45:08] ERROR[804]: lock.c:384 __ast_pthread_mutex_unlock: chan_sip.c line 23639 (handle_incoming): mutex 'owner' freed more times than we've locked!
[Feb 21 06:45:08] ERROR[804]: lock.c:416 __ast_pthread_mutex_unlock: chan_sip.c line 23639 (handle_incoming): Error releasing mutex: Operation not permitted
   -- Called x28
   -- Called x29

By: Jonas (jonascph) 2011-03-10 09:49:59.000-0600

I finally managed to catch to hangs/deadlocks.

I REALLY hope they mean something to somebody...

Please find 2 pair of files, a core show locks and a backtrace for each incident.

jonas1 and jonas2 accordingly.

By: Alejandro J Donato (coyotenq) 2011-04-12 14:43:44

Same issue here, no debug info, I notice Asterisk go 100% resources consumption and simply go dead, without any crash.

Can be a timing problem?

i unload and force to noload res_timing_timerfd.so, and looks like issue is gone. (now uses dahdi timing interface)

Can anyone confirm this?

By: John Covert (jcovert) 2011-04-12 14:53:07

I'm wondering if this is really IAX-related as initially reported.

I'm beginning to suspect blind transfer.  I do a lot of that, and notice that a lot of the hangs are shortly after a blind transfer.

This may be related to issue 18749

Note that I have the blind transfer patch from 18749 installed.

/john

By: Alejandro J Donato (coyotenq) 2011-04-16 14:34:40

Well, at least I can confirm issue is gone simply using dahdi timing source.

4 days without any incident in 4 production PBX, using SIP, IAX2 and DAHDI channels. (before change I have 2 or 3 deadlocks in a day)

As John say, transfers was affected to. Now, all is working fine.

Maybe this is a combined issue between res_timing_timerfd.so and IAX2 channels.

Hope this info helps to find it.



By: John Covert (jcovert) 2011-04-24 08:50:06

I have stopped doing blind transfers (which I normally do very often, putting calls into the parking lot, moving between Snom 300/Snom 820 on my desk, etc) and now have five days uptime.

So something in blind transfers is leaving doots around which causes this problem to occur sometime later.

By: John Covert (jcovert) 2011-04-27 10:51:44

Now that we know that it's blind transfers which causes this, is there enough information in all the uploads I have provided here for someone to start working on figuring this out, is is there some other info I should collect?

By: John Covert (jcovert) 2011-06-08 15:30:41.013-0500

I have just updated from 1.8.3.2 to 1.8.4.2 and believe that it is possible that the

class->srcfd = -1;

inserted into res_musiconhold.c at revision 305131 might have fixed the problem with the sip channel hanging.  I only have a few hours experience so far, but it's looking good.

/john

By: Bobby Hakimi (bobbymc) 2015-10-14 14:26:09.436-0500

ive read other posts that also claim its the timing device, i switched to dahdi to see if it makes a difference and will report of my findings