[Home]

Summary:ASTERISK-25513: Crash: malloc failed with high load of subscriptions.
Reporter:John Bigelow (jbigelow)Labels:
Date Opened:2015-11-01 20:04:39.000-0600Date Closed:2015-11-02 19:48:31.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:pjproject/pjsip Resources/res_pjsip_pubsub
Versions:13.6.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Asterisk GIT-13-9a021a4, pjproject 2.4, testsuiteAttachments:( 0) backtrace_14507.txt
Description:Asterisk crashed with a large load of subscriptions. Backtrace is attached.
Comments:By: Mark Michelson (mmichelson) 2015-11-02 13:21:26.066-0600

The most immediate thing that jumps out when looking at the attached backtrace is that it has 9066 threads in it. Doing a search for "worker_start" shows that 8990 of these threads are worker threads in threadpools. Of those, 138 of those are active, and 8852 of those are idle.

Next, the actual thread that crashed did so because an exception was thrown when attempting to allocate a block for a PJLib pool. Looking at the PJLib code, the specific exception is thrown when attempting to allocate results in a NULL return. The {{pool_policy_malloc}} block allocation policy can return NULL in two conditions:
# The pool factory's {{on_block_alloc}} function returns {{PJ_FALSE}}
# {{malloc}} returns NULL
The first condition is not possible because the caching pool used by us always returns {{PJ_TRUE}}. Therefore, the exception must have been thrown because {{malloc}} returned NULL.
In other words, the crash occurred because we could not successfully allocate memory.

In this particular case, John was performing a stress test of SIP subscriptions. In such a case, there likely was no buggy behavior that led to the number of threads created for the threadpool. Likely, a strong influx of traffic resulted in many threads being created to handle the SUBSCRIBE dialogs.

This doesn't mean that some sort of action could not be taken to help with this problem. Here are a few ideas:
# From a configuration point of view, a maximum threadpool size and a shorter idle timeout would result in fewer threads being allocated and faster reclamation of idle threads' memory. This is already made available in {{pjsip.conf}} as well as other configuration files that use threadpools and would require no code changes.
# Within the source code, we could change the algorithm for determining when to expand the threadpool. Currently, if all threads are active, and a new task is added, it results in a new batch of threads being created. It may be more prudent to expand the threadpool using some function based on the threadpool's current size and its growth delta. This would lead potentially to a growth rate that is close to logarithmic than linear in the presence of heavy spikes of traffic.
# Also within the code, we could potentially recover from the out of memory exception that PJLib throws. Since it uses an exception-throwing system, we could surround pool allocations in PJLib's try-catch macros and fail without crashing.