[Home]

Summary:ASTERISK-18708: func_curl hangs channel under load
Reporter:Dave Cabot (dcabot)Labels:
Date Opened:2011-10-12 08:12:17Date Closed:2015-03-13 21:01:28
Priority:MajorRegression?
Status:Closed/CompleteComponents:Functions/func_curl
Versions:1.8.5.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:CentOS 5.2 Attachments:( 0) backtrace.txt
( 1) core-show-locks.txt
( 2) func_curl.patch
Description:Developing an IVR using pbx_lua.  Using func_curl to solicit info from a http server.  Load testing using sipp up to 100 concurrent calls.  Channels are getting stuck while waiting for curl to return.  http app is a simple dummy app that returns a response in 2 secs.

stack trace from gdb shows func_curl is stuck at a mutex.

#0  0x0078f402 in __kernel_vsyscall ()
#1  0x00a36a4e in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2  0x00a328c4 in _L_mutex_lock_760 () from /lib/libpthread.so.0
#3  0x00a32768 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x009325d6 in pthread_mutex_lock () from /lib/libc.so.6
#5  0x00b95c7c in acf_curl_helper (chan=0x801504b8, cmd=<value optimized out>, info=<value optimized out>, buf=0x93454b90 "", input_str=0x0, len=4096) at func_curl.c:486
#6  0x0812d7bf in ast_func_read (chan=0x801504b8,
   function=0x93454a90 "CURL(http://192.168.35.43:8200,<?xml version=\"1.0\" encoding=\"UTF-8\"?><OTA><AuthenticationToken>ODod923i5kLk9s01Lkd0hnczii</AuthenticationToken><Action method=\"AttachRequest\"><TRN>9070000995</TRN><CSR>"..., workspace=0x93454b90 "", len=4096) at pbx.c:3515
#7  0x00371cc2 in lua_get_variable_value (L=0xd337518) at pbx_lua.c:303
#8  0x0072a4e3 in luaD_precall (L=0xd337518, func=0xe63fe58, nresults=1) at ldo.c:319
#9  0x00734cc0 in luaV_execute (L=0xd337518, nexeccalls=3) at lvm.c:587
#10 0x0072a960 in luaD_call (L=0xd337518, func=0xe63fdf8, nResults=0) at ldo.c:377
#11 0x00725fd1 in f_call (L=0xd337518, ud=0x93455e74) at lapi.c:800
#12 0x0072a023 in luaD_rawrunprotected (L=0xd337518, f=0x725fb0 <f_call>, ud=0x93455e74) at ldo.c:116
#13 0x0072a088 in luaD_pcall (L=0xd337518, func=0x725fb0 <f_call>, u=0x93455e74, old_top=24, ef=12) at ldo.c:463
#14 0x00725e24 in lua_pcall (L=0xd337518, nargs=2, nresults=0, errfunc=1) at lapi.c:821
#15 0x003707b3 in exec (chan=0x801504b8, context=0x80150824 "incoming", exten=0x80150874 "400", priority=1, callerid=0x807f4f40 "9070000995", data=0x9406968 "") at pbx_lua.c:1307
#16 0x081339d9 in pbx_extension_helper (c=0x801504b8, con=0x0, context=0x0, exten=0x80150874 "400", priority=1, label=0x0, callerid=0x807f4f40 "9070000995", action=E_SPAWN, found=0x9345a264,
   combined_find_spawn=1) at pbx.c:4115
#17 0x081364ac in __ast_pbx_run (c=0x801504b8, args=0x0) at pbx.c:4723
#18 0x08138420 in pbx_thread (data=0x801504b8) at pbx.c:5058
#19 0x0817455a in dummy_start (data=0x80d3b578) at utils.c:1004
#20 0x00a3046b in start_thread () from /lib/libpthread.so.0
#21 0x00925dbe in clone () from /lib/libc.so.6
Comments:By: Richard Mudgett (rmudgett) 2011-10-12 10:27:41.019-0500

Debugging deadlocks: Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install).  This will then give you the console command "core show locks." When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx "core show locks" | tee /tmp/core-show-locks.txt
# gdb -se "asterisk" <pid of asterisk> | tee /tmp/backtrace.txt
gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!



By: Dave Cabot (dcabot) 2011-10-12 11:43:31.844-0500

Requested files provided.

Based on how the call flow is behaving, I'd say that the http requests are being handled serially, ie one has to complete before the next one can be done.

By: Richard Mudgett (rmudgett) 2011-10-12 13:38:00.966-0500

The core show locks does not indicate a deadlock.

Does the system recover if you take the load off?

By: Dave Cabot (dcabot) 2011-10-12 13:41:04.571-0500

yes it does.

Looking at the code I see:
{code}
       AST_LIST_LOCK(&global_curl_info);
       AST_LIST_TRAVERSE(&global_curl_info, cur, list) {
               if (cur->key == CURLOPT_SPECIAL_HASHCOMPAT) {
                       hashcompat = (cur->value != NULL) ? 1 : 0;
               } else {
                       curl_easy_setopt(*curl, cur->key, cur->value);
               }
       }

       if (chan && (store = ast_channel_datastore_find(chan, &curl_info, NULL))) {
               list = store->data;
               AST_LIST_LOCK(list);
               AST_LIST_TRAVERSE(list, cur, list) {
                       if (cur->key == CURLOPT_SPECIAL_HASHCOMPAT) {
                               hashcompat = (cur->value != NULL) ? 1 : 0;
                       } else {
                               curl_easy_setopt(*curl, cur->key, cur->value);
                       }
               }
       }

       curl_easy_setopt(*curl, CURLOPT_URL, args.url);
       curl_easy_setopt(*curl, CURLOPT_FILE, (void *) &str);

       if (args.postdata) {
               curl_easy_setopt(*curl, CURLOPT_POST, 1);
               curl_easy_setopt(*curl, CURLOPT_POSTFIELDS, args.postdata);
       }

       curl_easy_perform(*curl);

       if (store) {
               AST_LIST_UNLOCK(list);
       }
       AST_LIST_UNLOCK(&global_curl_info);
{code}

So curl_easy_perform gets called only by the thread that has the global_curl_info lock at the moment, causing it to be serial in access.

By: Dave Cabot (dcabot) 2011-10-14 12:58:40.732-0500

This seems to fix it

--
inline code removed
Code can only be submitted as an attachment and marked as code for licensing reasons.
--

But it would be a better fix to chg these to RWLOCKs.

Can you review the diff and let me know if there's an issue I'm missing?

Thanks,
Dave

By: Dave Cabot (dcabot) 2011-10-14 12:59:05.201-0500

I posted a possible fix.

Also, I was confused by this line in func_curl.c.  Copy and paste error?

{code}

                       pbx_builtin_setvar_helper(chan, "~ODBCFIELDS~", ast_str_buffer(fields));

{code}

By: Richard Mudgett (rmudgett) 2011-10-14 14:28:33.968-0500

Please attach patches and code submissions to the issue after signing the licensing agreement.

By: Dave Cabot (dcabot) 2011-10-17 08:41:16.659-0500

as requested.

Pls review and let me know if you see any design issues.  Thanks.

By: Dave Cabot (dcabot) 2011-10-17 08:42:14.933-0500

code submitted.

By: Richard Mudgett (rmudgett) 2011-10-17 11:12:47.615-0500

Converting normal locks to rwlocks poses a hazard because rwlocks are not reentrant.  A function call analysis needs to be done to determine if there is a reentrant path.

By: Dave Cabot (dcabot) 2011-10-17 11:24:05.729-0500

Who does that?  AFAICT it doesn't.

By: Alex Po (seik0) 2013-03-22 06:59:55.324-0500

I've got issue easily reproducted. On asterisk 1.8.11.0


{code}
1235,1,Set(CURLOPT(userpwd)=<name>:<pass>)
1235,2,NoOp(SleepTest: ${CURL(http://<name>:<pass>@<host>:<port>/asterisk/curl_wrapping.sleep_test?p$sec=30)})
1235,3,Hangup
{code}

curl_wrapping.sleep_test?p$sec=30 just waits for p$sec seconds before return string "p$sec=<p$sec>".

Made this plan on Local channel to simply test simulatiously working threads. Then made several call-files to call this local channel.

In CLI it's easy seen that each next call finishes after previous call done + 30 seconds.

Then, Walter Doekes on #asterisk channel at freenode.net proposed me simple patch:

{code}
Index: funcs/func_curl.c
===================================================================
--- funcs/func_curl.c   (revision 383575)
+++ funcs/func_curl.c   (working copy)
@@ -609,6 +609,7 @@
                       curl_easy_setopt(*curl, cur->key, cur->value);
               }
       }
+       AST_LIST_UNLOCK(&global_curl_info);

       if (chan && (store = ast_channel_datastore_find(chan, &curl_info, NULL))) {
               list = store->data;
@@ -647,7 +648,6 @@
       if (store) {
               AST_LIST_UNLOCK(list);
       }
-       AST_LIST_UNLOCK(&global_curl_info);

       if (args.postdata) {
               curl_easy_setopt(*curl, CURLOPT_POST, 0);
{code}

(it's also should be here http://pastebin.com/dqB7ezbz#)

This patch works. Each next call do it's work regardless previous call flow. No more locking.
This patch simply makes unlock on global_curl_info a bit earlier.
The question now is how it's eligible to be, why we need to unlock originally so late? What we should expect this patch is wrecking?