ASTERISK-18988: Confbridge ghost channels, segmentation fault under high load

[Home]

Summary: ASTERISK-18988: Confbridge ghost channels, segmentation fault under high load

Reporter: Alexander Akimov (alexander) Labels:

Date Opened: 2011-12-08 01:14:12.000-0600 Date Closed: 2012-02-01 08:50:06.000-0600

Priority: Blocker Regression? No

Status: Closed/Complete Components: Applications/app_confbridge

Versions: SVN Frequency of
Occurrence Frequent

Related
Issues:
must be completed before resolving ASTERISK-19129 Asterisk 10.2.0 Blockers

is related to ASTERISK-19100 ConfBridge crashes on closing timer when destroying conference bridge

Environment: Asterisk SVN-trunk-r346857 OS: Ubuntu 10.04.3 LTS Attachments: ( 0) asterisk-log.txt
( 1) backtrace.txt
( 2) backtrace2.txt
( 3) backtrace2012-01-02.txt
( 4) backtrace23_30_01.txt
( 5) chan_bridge_cleanup_v2.diff
( 6) chan_bridge_cleanup_v3.diff
( 7) confbrige_list.txt
( 8) core-locks.23_28_01.txt
( 9) core-locks.23_30_01.txt
(10) core-locks.23.29.01.txt
(11) ghosts-18-15--22-30.jpg
(12) ghosts-18-15--22-30.txt

Description: We have conferencing service running on confbridge application. When we have high load some of conferences stop working and they contain ghost sip channels, which are not closed properly.
We have the following repeatable errors in /var/log/asterisk/full file:
[Dec 6 22:34:47] ERROR[27638] utils.c: write() returned error: Broken pipe
[Dec 6 22:34:49] WARNING[22675] channel.c: Exceptionally long voice queue length queuing to Bridge/0x1aed600-input

[Dec 9 01:35:20] WARNING[24382] channel.c: Unable to write to alert pipe on Bridge/0x7f1014001b50-input (qlen = 35): Resource temporarily unavailable!

Comments: By: Alexander Akimov (alexander) 2011-12-08 01:17:07.322-0600

backtrace
By: Dmitry Melekhov (slesru) 2011-12-08 04:50:33.807-0600

Looks very similar to issue I have
ASTERISK-18885
no solution yet :-(

By: Matt Jordan (mjordan) 2011-12-08 11:20:44.978-0600

Can you reproduce the issue with a backtrace taken from a system with DONT_OPTIMIZE turned on?

Also: how many conference users were involved, with what codecs?
By: Alexander Akimov (alexander) 2011-12-08 11:30:23.369-0600

We are using ulaw only. We are not monitoring conference number. I think it was about 50. Asterisk was already compiled with DONT_OPTIMIZE option when I collected traceback
By: Matt Jordan (mjordan) 2011-12-09 16:40:39.740-0600

Unfortunately, the backtrace doesn't contain any useable information. You may need to try and reproduce this using the instructions below. Be sure you have DONT_OPTIMIZE enabled in menuselect within the Compiler Flags section, then:

make install

After enabling, reproduce the crash, and then execute the backtrace[1] instructions. When complete, attach that file to this issue report.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Alexander Akimov (alexander) 2011-12-11 03:03:06.422-0600

new backtrace
By: Alexander Akimov (alexander) 2011-12-11 03:12:49.812-0600

We reproduced the crash:
Problem started at 17:30 [Dec 10 17:30:35] WARNING[1261] channel.c: Exceptionally long voice queue length queuing to Bridge/0x7f90702fa4f0-input
At 18:53 another message appeared: [Dec 10 18:53:47] WARNING[17152] channel.c: Unable to write to alert pipe on Bridge/0x7f90702fa4f0-input (qlen = 51) : Resource temporarily unavailable!

Backtrace was generated at 19:45:36

We had about 30 conf rooms and 60 SIP ulaw channels when problem occured
By: David Vossel (dvossel) 2011-12-16 11:26:10.574-0600

We need 'core show locks' output.
By: Leif Madsen (lmadsen) 2011-12-19 12:24:13.165-0600

Requesting additional information from the reporter.
By: Alexander Akimov (alexander) 2011-12-20 10:20:36.957-0600

After enabling DEBUG_THREADS option Asterisk became very unstable, so we couldn't make tests for longer time. Asterisk started to crash more often and it was unacceptable in our production environment. We collected few core locks and a backtrace.
By: Alexander Akimov (alexander) 2011-12-20 10:22:18.250-0600

backtrace and core-locks after enabling DEBUG_THREADS
Core dump was generated at 23:30:01
By: Matt Jordan (mjordan) 2011-12-27 16:30:31.104-0600

Alexander:

A few things to note from your backtraces:
1) If you don't have DONT_OPTIMIZE turned on, the backtrace, unfortunately, isn't going to be of any use. DEBUG_THREADS is not as necessary in this particular case.
2) The backtrace loaded on 12/11/11 appears to have died in calloc. Seg faulting when merely attempting to allocate memory (in this case, for a new bridging channel) is going to be impossible for Asterisk to work around - in general, if we can't allocate memory for a new channel, things are going to go poorly at some point. What are the specs for the machine you are running Asterisk on? You mentioned you have 30 conference rooms going simultaneously - how many channels are joined to each conference room?
By: Alexander Akimov (alexander) 2011-12-28 01:45:29.149-0600

1) We have turned on DEBUG_THREADS only for generating core lock infromation. Backtrace2 and backtrace23_30_01 are generated with DONT_OPTIMISE option disabled.

2) Problem usually starts before asterisk dies with the following message:

channel.c: Exceptionally long voice queue length queuing to Bridge/....

Then another message appears
channel.c: Unable to write to alert pipe on Bridge/... (qlen = 35): Resource temporarily unavailable!

At the end asterisk usually dies and generates a coredump. So it seems that real problem starts earlier and then leads to crash. When the problem happens one room usually has a lot of ghost channels, maybe it causes calloc problem and asterisk crashes then?

We have about 10 channels per room. So far max number of channels was 92 and max number of rooms was 42 on this server. The server has 8 cores Intel(R) Xeon(R) CPU E31280 @ 3.50GHz. RAM size is 16G, only 630MB are used
By: David Vossel (dvossel) 2011-12-29 17:45:14.182-0600

I don't know if this will solve your issue or not, but it may be worth trying. I've attached a patch that cleans up several issues in the chan_bridge.c.
By: David Vossel (dvossel) 2011-12-30 10:51:25.053-0600

I've attached another patch. Test version 2 instead of version 1 of that patch. I'm removing version 1.
By: Alexander Akimov (alexander) 2012-01-02 02:29:03.471-0600

After applying path v2 confbridge doesn't crash but problem with ghost channels in confrooom still exists. The problem is that suddenly a number of channels in one conference room starts to grow up to some value (during the last two problems this number was 45). But these channels don't exist. When this problem begins the following type of message appears in /var/log/asterisk/full file:
NOTICE[22141] chan_sip.c: Disconnecting call 'SIP/172.16.1.2-00000863' for lack of RTP activity in 1
81 seconds

I've uploaded a graph showing max number of calls in a conference room and cat /var/log/asterisk/full | grep RTP file. Graph time is +5 from log time.

By: Alexander Akimov (alexander) 2012-01-03 03:50:00.608-0600

new backtrace backtrace2012-01-02. This time asterisk restarted with only two messages generated in log:

[Jan 2 20:19:25] ERROR[7982] astobj2.c: refcount -1 on object 0x7f6cbc191788
[Jan 2 20:19:25] ERROR[7982] res_timing_timerfd.c: Couldn't find timer with handle -1138934816

By: Alexander Akimov (alexander) 2012-01-04 01:30:45.834-0600

We have another type of crash. Asterisk restarts, but no crashdump is generated. When this problem occurs we have enormous number of "Too many open files" messages in asterisk log file. We monitor number of files opened by asterisk with lsof -p `ps -C asterisk -o pid=` | wc -l command. And two last times when this problem occurred number of opened files was about 1200 and 3200.
By: David Vossel (dvossel) 2012-01-04 16:49:39.436-0600

Try version 3 of the patch. I believe I found the issue causing the crash.
By: Alexander Akimov (alexander) 2012-01-06 02:38:11.749-0600

After applying this patch we still have another problem (reported 04/Jan/12 1:30 AM). Asterisk restarts without generating crashdump. This is detailed description of this problem:

Problem started with the following messages:

[Jan 5 22:49:37] WARNING[13140] manager.c: Failed to set manager tcp connection to TCP_NODELAY, getprotobyname("tcp")
failed
Some manager actions may be slow to respond.
[Jan 5 22:49:37] WARNING[13141] manager.c: Failed to set manager tcp connection to TCP_NODELAY, getprotobyname("tcp")
failed
Some manager actions may be slow to respond.
[Jan 5 22:49:37] WARNING[12959] res_agi.c: Unable to create toast pipe: Too many open files
[Jan 5 22:49:37] WARNING[19606] asterisk.c: Accept returned -1: Too many open files

Then it generated enourmous number of WARNING[19606] asterisk.c: Accept returned -1: Too many open files logs. Then another log entry repeated about 1000000 times:

WARNING[19616] tcptls.c: Accept failed: Too many open files

At the end asterisk generates:

[Jan 5 22:50:18] ERROR[12704] res_timing_timerfd.c: Failed to create timerfd timer: Too many open files
[Jan 5 22:50:18] WARNING[19616] tcptls.c: Accept failed: Too many open files

/var/log/messages message: Jan 5 22:50:18 ubuntu kernel: [822285.653642] asterisk[12704]: segfault at 8 ip 0000000000573f34 sp 00007ff0b19abb90 error 4 in asterisk[400000+207000]

and restarts.

As I mentioned in previous post we monitor number of files opened by asterisk with lsof -p `ps -C asterisk -o pid=` | wc -l command. And during this problem it was 1200.

By: Matt Jordan (mjordan) 2012-01-06 09:59:29.524-0600

That actually is not a bug in ConfBridge or Asterisk, but a performance limitation that you're hitting. Each channel created consumes some number of system resources, including file descriptors. When you hit the system limit on the number of concurrent open file descriptors (which you can determine using ulimit -n), Asterisk will be unable to create any more channels. This is generally a situation that Asterisk can handle only so far - because a wide variety of services need available file descriptors.

As far as "ghost" SIP channels are concerned, with respect to ConfBridge, Asterisk would not be creating temporary SIP channels. If that is still occurring post-v3 patch, can you provide a DEBUG Asterisk log indicating where the channels are being created?
By: David Vossel (dvossel) 2012-01-06 10:12:44.304-0600

There are several issues going on here. I'm going break this all down for everyone for historical reasons.

1. The "Exceptionally long queue" issue is a result of frames (coming from the bridging core) not being read of the chan_bridge channel in the confbridge instance. This was fixed by having those frames never get queued up to begin with on the bridge channel since that channel tech is only used to play audio into the bridge. Reading frames from the bridge is useless for that channel tech.

2. The crash reflected in the "backtrace2012-01-02.txt" backtrace file and others. This was a result of a channel being hung up twice which ended up causing a kind of double free sort of error on the timerfd structure (since it is the first item freed in hangup). The astobj2 ref -1 error indicates that the channel was being hung up twice for anyone who is interested in that. This issue was fixed by tracking down how the channel could ever be hungup twice and just not doing that :)

3. The third problem you have encountered is a file descriptor limit. By default you'll find that the fd limit is probably around 1024. What do you have it set to? If you are doing load tests you absolutely have to increase the file descriptor limit. Issue a ulimit -n 30000 (or some huge number of allowed file descriptors that makes sense for you) command and see if the problem persists. I do not expect that this is something we can fix unless we can determine a file descriptor leak is taking place some where.

At the moment, chan_bridge_cleanup_v3.diff, should resolve items 1 and 2. Item 3 may not be an Asterisk issue.

By: Alexander Akimov (alexander) 2012-01-06 17:13:46.375-0600

Ghost channels problem still exists after applying patch 3. During this problem we get messages like chan_sip.c: Disconnecting call 'SIP/172.16.1.2-000001d3' for lack of RTP activity in 181
seconds

Please, explain how to provide you with DEBUG Asterisk log indicating where the channels are being created. Which command should I execute to collect this information?

About issue number 3. Yes, we have ulimit -n set to 1024. I'm chaning it to 30000 now.
By: Alexander Akimov (alexander) 2012-01-08 05:02:13.048-0600

I'm attaching

Confbrige list command executed on the room containing ghost channels confbridge_list.txt
And asterisk log file filtered with grep "Disconnecting" asterisk-log.txt.

In both cases the number of entries is 75. Ghost channels number stopped growing after reaching this value. Why these channels are not disconnected if log showed that they were being disconnected?
By: Matt Jordan (mjordan) 2012-01-09 12:54:37.035-0600

The files you attached don't provide any information as to why the channels were created - merely that there is no RTP activity on those SIP channels. In order to determine that, we would need a DEBUG log. That being said, as David and I previously stated, this is almost certainly not a bug, and not related to the segmentation fault reported on this issue. If you'd like to provide a DEBUG log we can look at it, but the patch committed (r350550) should have resolved this issue.

I'll keep this issue open for another week or so in case you'd like to provide the DEBUG log.

You can find information on obtaining a DEBUG log here:
https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: Alexander Akimov (alexander) 2012-01-22 10:25:37.827-0600

We have collected DEBUG. But it contains about 50G of data. I have no idea how to filter it and send you necessary info. This debug is filled with myriad of
bridging.c: Going into a multithreaded waitfor for bridge channel 0x7f42e897b8d8 of bridge 0x7f42e891c4a8
messages.

The problem is that we have other servers where we don't have ghost channels and when we run debug with the same options we also have myriads of this bridging.c messages. Do you have any recommendations how to collect needed info from this extra-large DEBUG file.

By: Matt Jordan (mjordan) 2012-01-23 08:40:00.555-0600

Alex: that particular message (which is sent a lot) only occurs when the DEBUG level is set to a high level - in this case, 10. With respect to determining when a channel is created and when it is scheduled for destruction, we should only need the DEBUG level set to 5. Also, we wouldn't need a large timeframe of data - simply the portion where someone creates a SIP channel, calls into a ConfBridge, leaves the conference, and the corresponding SIP channel persists longer then 32 seconds after the user hangs up.

By: Alexander Akimov (alexander) 2012-02-01 01:18:34.065-0600

We migrated our platfrom to CentOS based server and now we don't have this problem on this new server. On old server we have only few calls which don't produce such problem anymore. So I think ghost channels were related to OS. And other problem with Asterisk crashing was resolved by patch number 3. So I think it's ok now.
By: Matt Jordan (mjordan) 2012-02-01 08:49:26.171-0600

Awesome - thanks for all the testing on this one. I'll go ahead and close out this issue - the patch for this issue should be in Asterisk 10.2.0.