ASTERISK-23719: Asterisk locks, UDP buffer overflow, 1000+ spawns of 'chan_iax2.c find_idle

[Home]

Summary: ASTERISK-23719: Asterisk locks, UDP buffer overflow, 1000+ spawns of 'chan_iax2.c find_idle_thread()'

Reporter: SteelPivot (steelpivot) Labels:

Date Opened: 2014-05-05 16:43:39 Date Closed: 2014-07-14 08:47:02

Priority: Critical Regression?

Status: Closed/Complete Components: Channels/chan_iax2

Versions: 11.6.1 Frequency of
Occurrence Frequent

Related
Issues:

Environment: CentOS 6.4min Attachments: ( 0) 1399319401-core-show-locks.txt
( 1) 1399324201-backtrace-threads.txt
( 2) 1399324201-core-show-threads.txt
( 3) 1399324201-netstat.txt
( 4) 1399750801-backtrace-threads.txt
( 5) 1399750801-core-show-taskprocessors.txt
( 6) 1399750801-core-show-threads.txt
( 7) 1399750801-netstat.txt
( 8) 1400606323-backtrace-threads.txt
( 9) 1400606323-client-iax2.pcap
(10) 1400606323-client-iax2-debug.txt
(11) 1400606323-core-show-taskprocessors.txt
(12) 1400606323-core-show-threads.txt
(13) 1400606323-host-iax2.pcap
(14) 1400606323-host-iax2-debug.txt
(15) 1400606323-netstat.txt

Description: We've been experience an issue for a few months concerning IAX2 peers which has recently gotten more severe after upgrading from 11.2 to 11.6cert2.

The initial symptom was all (100+) IAX2 peers going UNREACHABLE. However, after inspecting further it seems that what will happen is the UDP queues will sharply increase (seen by netstat -antup), the number of asterisk threads increases (to over 1000 threads in some cases), and Asterisk, of course, stops responding to inbound/outbound calls from any channel (SIP or IAX2).

After recompiling with DEBUG_THREADS and BETTER_BACKTRACES, I discovered that issuing a "gdb -ex "thread apply all bt"...(etc) " to grab a backtrace will free up the UDP queues, and Asterisk will then become responsive again. Currently I have a script running each 5 minutes that pulls the UDP queues for asterisk processes, and upon seeing a queue above 300,000packets, I issue a "netstat -antup", "core show locks", "core show threads", and "gdb -ex "thread apply all bt" --batch asterisk `pidof asterisk` > $debugdir/$date-backtrace-threads.txt".

I have previously increased the kernel UDP maximums in sysctl.conf, and added options for iaxthreadcount/iaxmaxthreadcount in iax.conf.

I cannot repeat this issue at will, but it happens every hour or so (sometimes every few minutes). I have debug logs and backtraces for each occurrence.

Comments: By: SteelPivot (steelpivot) 2014-05-05 16:44:56.137-0500

Backtrace, threads, netstat
By: Rusty Newton (rnewton) 2014-05-05 17:45:25.832-0500

If this is happening every hour or so, please test with the latest of the subversion 11 branch to make sure there is not already a fix existing, while we look at the debug here.
By: Matt Jordan (mjordan) 2014-05-06 10:40:25.169-0500

The output of {{core show locks}} shows that everything is piled up waiting for this channel lock to be released:

{noformat}
=== Thread ID: 0x7fb6261d9700 (pbx_thread started at [ 6865] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 3853 __ast_read chan 0x7fb60401f0d0 (1)
main/logger.c:1599 ast_bt_get_addresses() (0x50a283+1D)
main/lock.c:218 __ast_pthread_mutex_lock() (0x502454+C9)
main/astobj2.c:192 __ao2_lock() (0x44c6fa+96)
main/channel.c:3856 __ast_read()
main/channel.c:4365 ast_read() (0x47e09a+1D)
main/channel.c:7576 ast_generic_bridge()
main/channel.c:8046 ast_channel_bridge() (0x4898a4+19D3)
main/features.c:4483 ast_bridge_call() (0x4d3ea0+F80)
apps/app_dial.c:3045 dial_exec_full()
apps/app_dial.c:3129 dial_exec()
main/pbx.c:1622 pbx_exec() (0x52bc19+214)
main/pbx.c:4918 pbx_extension_helper()
main/pbx.c:6035 ast_spawn_extension() (0x539a1f+65)
main/pbx.c:6509 __ast_pbx_run()
main/pbx.c:6840 pbx_thread()
main/utils.c:1162 dummy_start()
pthread_create.c:0 start_thread()
:0 __clone() (0x7fb6982cab00+6D)
{noformat}

There isn't a condition in {{ast_read}} that is causing the lock to be unreleased. This appears to be an issue with {{DEBUG_THREDS}} causing locks to not be released.

Does this issue persist if you disable {{DEBUG_THREADS}}?
By: SteelPivot (steelpivot) 2014-05-06 11:12:03.654-0500

I enabled DEBUG_THREADS because my previous issue was that all IAX2 peers would drop within the same minute (11.2cert, no response to QUALIFY. more specifically, not processing the packets it was receiving). The only fix for this was a restart of Asterisk (I've seen similar bugs reported in the past). I wanted to be able to provide debug info in the event that I needed to open an issue, or perhaps if I was missing something.

I first upgraded to 11.6cert2 (no DEBUG_THREADS) since there were many updates to chan_iax2 from 11.2 to 11.6, but the unreachable IAX2 peers issue persisted (note, I hadn't noticed but also wasn't looking for the large UDP queue issue).

I then recompiled with DEBUG_THREADS and began noticing the UDP queuing issue. In this case I decided that the unreachable peers were probably more an effect of other issues rather than the issue itself, thus my post here.

I will try disabling DEBUG_THREADS and see if the UDP/threading issue persists. Since I won't be able to post locks in that case, I'll do a log_rotate and post the debug log should the issue appear.
By: SteelPivot (steelpivot) 2014-05-12 12:09:24.067-0500

Okay. So a bit of backstory here, we have two Asterisk servers running in active/active fail-over, between two datacenters. That setup is not terribly important, however, last week I recompiled without DEBUG_THREADS on both systems. These were both running on VMs until last week, when I moved the primary system to a bare metal server (but left the secondary on a VM). Since doing this, the primary has not had any issues.

However, the secondary has again shown the same symptoms (appropriate files attached, sans core-show-locks since no DEBUG_THREADS). The only fix was a "service asterisk restart" or "asterisk -rx 'core restart now'", otherwise the IAX2 trunks would never come back online.

The VMs is running on KVM, using virtio network driver. Normal type-1 bonding on the host NICs, etc. for KVM.

The only port showing an overflow in this case is port 4569, maxed at the system maximum, 230400. There were no active calls on this system at the time. Again, all IAX2 trunks went UNREACHABLE within the same minute (which is expected, given the default qualify time of 60s).
By: SteelPivot (steelpivot) 2014-05-12 12:10:17.330-0500

Uploaded May 12. Error occurred May 10, 15:37.
By: Matt Jordan (mjordan) 2014-05-12 14:41:48.657-0500

The backtrace of threads from the faulty system doesn't appear to show any odd state, nor does the taskprocessor dump. I'm curious what an IAX debug would show when this occurs. Right now, this doesn't look so much like a deadlock as it does some odd behaviour in {{chan_iax2}}.
By: SteelPivot (steelpivot) 2014-05-12 15:10:12.435-0500

Typically, an IAX2 debug on the faulty system will show it sending out POKE messages, but not receiving PONGs. However, on the client machine, I'll see the POKEs being received and PONGs being sent. A tcpdump on the faulty system shows the PONGs coming in, though Asterisk doesn't appear to acknowledge them. This would hold true if the UDP buffer were indeed overflowed, since the incoming packets would never be processed by Asterisk. I do not believe the opposite is true, since I've always seen the buffer overflow BEFORE the IAX2 trunks go unreachable, and not the other way round.

Additionally curious is that when I have DEBUG_THREADS enabled, running the gdm command will clear the locks and allow everything to go back to normal. Without DEBUG_THREADS, running gdm operates as expected (and only grabs the backtrace while freeing up nothing).

I had hoped this was caused by the issue that was fixed recently, "prevent network thread starting before all helper threads are ready On startup, it's possible for a frame to arrive before the processing threads were ready. In iax2_process_thread() the first pass through falls into ast_cond_wait, should a frame arrive before we are at ast_cond_wait, the signal will be ignored. The result iax2_process_thread stays at ast_cond_wait forever, with deferred frames being queued. Fix: When creating initial idle iax2_process_threads, wait for init_cond to be signalled after each thread is started." But it doesn't appear that this fix fixed my issue.

I'll grab a debug log should this occur again.
By: SteelPivot (steelpivot) 2014-05-20 12:46:53.092-0500

It's happened again this morning. Same symptoms... all IAX2 trunks drop within one minute, port 4569 UDP buffer full, and Asterisk not listening nor responding to incoming IAX2 messages.

I'll attach pcap files showing traffic on the problem host to a specific client, and the pcap traffic from that client. Here, you'll see that the problem host is sending out POKE messages AND getting PONG messages back, and likewise that the specified client is receiving POKEs and responding with PONGs (and sending its own POKEs but getting no responses). Of course, there are no ACKs anywhere.

An IAX2 debug inside Asterisk on the problem host shows only POKE messages being sent, while a debug on the client shows those POKEs received and PONG responses (while also sending POKEs of its own and receiving no response).

The debug logger messages on the problem host only shows recurring messages of "chan_iax2.c: ip callno count decremented to X for X.X.X.X", "chan_iax2.c: schedule decrement of callno used for X.X.X.X in 60 seconds", and " I was supposed to send a LAGRQ with callno XXXX, but no such call exists." These, I believe, are expected with a downed IAX2 trunk.
By: SteelPivot (steelpivot) 2014-05-20 12:47:43.049-0500

May 20, 2014
By: Rusty Newton (rnewton) 2014-06-11 15:25:09.546-0500

I'm out of ideas. However this does look buggy and we could use a bit more debug before opening it up.

Please provide an Asterisk full log with verbose and debug turned up to 5 , showing when the issue occurs. Be sure that IAX2 debugging is also enabled so that we have a trace meshed with the full log (so that we can correlate things happening between the two easier).

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
By: Rusty Newton (rnewton) 2014-06-25 19:22:57.452-0500

[~steelpivot] Is the issue still occurring?
By: Rusty Newton (rnewton) 2014-07-14 08:46:55.004-0500

Suspended due to lack of activity. Please request a bug marshal in #asterisk-bugs on the IRC network irc.freenode.net to reopen the issue should you have the additional information requested. Further information can be found at http://www.asterisk.org/developers/bug-guidelines