[Home]

Summary:ASTERISK-20237: Assert failure in res_rtp_asterisk calling pj_mutex_unlock after a few calls
Reporter:Jeremy Pepper (jpepper)Labels:
Date Opened:2012-08-15 15:27:03Date Closed:2012-08-20 15:12:07
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/General Resources/res_rtp_asterisk
Versions:11.0.0-beta1 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Ubuntu 10.04.1 LTSAttachments:( 0) ASTERISK-20237.patch
( 1) valgrind.log
Description:I have a single SIP account configured to register a softphone to a test server. After making a few calls, I invariably get this error:

{noformat}
asterisk: ../src/pj/os_core_unix.c:1276: pj_mutex_unlock: Assertion `mutex->owner == pj_thread_this()' failed.
Aborted (core dumped)
{noformat}

GDB provides this backtrace:

{noformat}
#0  0xb7857430 in __kernel_vsyscall ()
#1  0xb7718651 in *__GI_raise (sig=6)
   at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xb771ba82 in *__GI_abort () at abort.c:92
#3  0xb7711718 in *__GI___assert_fail (
   assertion=0xb5facce4 "mutex->owner == pj_thread_this()",
   file=0xb5faca78 "../src/pj/os_core_unix.c", line=1276,
   function=0xb5face18 "pj_mutex_unlock") at assert.c:81
#4  0xb5f9788e in pj_mutex_unlock (mutex=0xb78443ec)
   at ../src/pj/os_core_unix.c:1276
#5  0xb5f9710d in pj_atomic_get (atomic_var=0x93ba440)
   at ../src/pj/os_core_unix.c:888
#6  0xb5f7fbb8 in pj_stun_session_destroy (sess=0x93b9e4c)
   at ../src/pjnath/stun_session.c:542
#7  0xb5f75c9a in destroy_ice (ice=0x93cf29c, reason=0)
   at ../src/pjnath/ice_session.c:474
#8  0xb5f75db3 in pj_ice_sess_destroy (ice=0x93cf29c)
   at ../src/pjnath/ice_session.c:503
#9  0xb5f687f4 in ast_rtp_destroy (instance=0x93bb0bc)
   at res_rtp_asterisk.c:1236
#10 0x081908a9 in instance_destructor (obj=0x93bb0bc) at rtp_engine.c:212
#11 0x0808f486 in internal_ao2_ref (user_data=<value optimized out>,
   delta=<value optimized out>, file=<value optimized out>, line=539,
   func=0x81f496e "__ao2_ref") at astobj2.c:461
#12 0x0818eeb9 in ast_rtp_instance_destroy (instance=0x93bb0bc)
   at rtp_engine.c:231
#13 0xb654dc98 in __sip_destroy (p=0x93b23cc, lockowner=1, lockdialoglist=1)
   at chan_sip.c:6152
#14 0xb654e74d in sip_destroy (p=0x93b23cc) at chan_sip.c:6430
#15 0xb654fdbe in sip_destroy_fn (p=0x93b23cc) at chan_sip.c:6419
#16 0x0808f486 in internal_ao2_ref (user_data=<value optimized out>,
   delta=<value optimized out>, file=<value optimized out>, line=539,
   func=0x81f496e "__ao2_ref") at astobj2.c:461
#17 0xb652ce4a in dialog_unref_debug (p=0x0,
   tag=0xb65d61e0 "Let's unbump the count in the unlink so the poor pvt can disappear if it is time", file=0xb65cac48 "chan_sip.c", line=3171,
   func=0xb65d3e89 "dialog_unlink_all") at chan_sip.c:2329
#18 0xb6540437 in dialog_unlink_all (dialog=0x93b23cc) at chan_sip.c:3171
#19 0xb6540d1f in dialog_needdestroy (dialogobj=0x93b23cc, arg=0x0, flags=6)
   at chan_sip.c:18591
#20 0x08091334 in internal_ao2_callback (c=<value optimized out>,
   flags=<value optimized out>, cb_fn=<value optimized out>, arg=0x0,
   data=0x0, type=DEFAULT, tag=0x0, file=0x0, line=0, func=0x0)
   at astobj2.c:1083
#21 0x08091a1a in __ao2_callback (c=0x8f196ec, flags=6,
   cb_fn=0xb6540ab0 <dialog_needdestroy>, arg=0x0) at astobj2.c:1188
#22 0xb6594f96 in do_monitor (data=0x0) at chan_sip.c:28033
#23 0x081d2cab in dummy_start (data=0x8f26fa8) at utils.c:1026
#24 0xb738b96e in start_thread (arg=0xb6653b70) at pthread_create.c:300
#25 0xb77bb98e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130
{noformat}

I can't reproduce the issue with thread debugging enabled and optimization disabled.
Comments:By: Matt Jordan (mjordan) 2012-08-15 17:28:34.046-0500

Just to double check, this occurs when DEBUG_THREADS is enabled, or disabled?

By: Jeremy Pepper (jpepper) 2012-08-16 13:22:58.291-0500

This occurs with DEBUG_THREADS disabled.

Interestingly enough, turning on either DONT_OPTIMIZE or DEBUG_THREADS seems to cure it—at least, if it still crashes with one of those set, it takes longer than I've given it.

I was also unable to crash it in valgrind, but I did get some potentially interesting output:

{noformat}
==31194== Thread 12:
==31194== Conditional jump or move depends on uninitialised value(s)
==31194==    at 0x73FE8BC: pj_thread_register (os_core_unix.c:421)
==31194==    by 0x73D9D80: ast_rtp_new (res_rtp_asterisk.c:426)
==31194==    by 0x8191AC3: ast_rtp_instance_new (rtp_engine.c:286)
==31194==    by 0x6DE3E51: dialog_initialize_rtp (chan_sip.c:5514)
==31194==    by 0x6E55EC7: check_peer_ok (chan_sip.c:17460)
==31194==    by 0x6E57575: check_user_full (chan_sip.c:17577)
==31194==    by 0x6E6C5B1: handle_request_invite (chan_sip.c:24442)
==31194==    by 0x6E6F283: handle_incoming (chan_sip.c:27313)
==31194==    by 0x6E70D57: handle_request_do (chan_sip.c:27517)
==31194==    by 0x6E72824: sipsock_read (chan_sip.c:27449)
==31194==    by 0x814033E: ast_io_wait (io.c:292)
==31194==    by 0x6E43FC6: do_monitor (chan_sip.c:28048)
==31194==
==31194== Conditional jump or move depends on uninitialised value(s)
==31194==    at 0x40260D6: strcpy (mc_replace_strmem.c:311)
==31194==    by 0x73FF7B8: pj_mutex_lock (os_core_unix.c:1238)
==31194==    by 0x73DE2D9: pj_ice_sess_add_cand (ice_session.c:712)
==31194==    by 0x73D1D9C: T.441 (res_rtp_asterisk.c:609)
==31194==    by 0x73D2009: rtp_add_candidates_to_ice (res_rtp_asterisk.c:1057)
==31194==    by 0x73D284E: ast_rtp_prop_set (res_rtp_asterisk.c:3258)
==31194==    by 0x6DE3EF5: dialog_initialize_rtp (chan_sip.c:5558)
==31194==    by 0x6E55EC7: check_peer_ok (chan_sip.c:17460)
==31194==    by 0x6E57575: check_user_full (chan_sip.c:17577)
==31194==    by 0x6E6C5B1: handle_request_invite (chan_sip.c:24442)
==31194==    by 0x6E6F283: handle_incoming (chan_sip.c:27313)
==31194==    by 0x6E70D57: handle_request_do (chan_sip.c:27517)
==31194==
==31194== Conditional jump or move depends on uninitialised value(s)
==31194==    at 0x40260F5: strcpy (mc_replace_strmem.c:311)
==31194==    by 0x73FF7B8: pj_mutex_lock (os_core_unix.c:1238)
==31194==    by 0x73DE2D9: pj_ice_sess_add_cand (ice_session.c:712)
==31194==    by 0x73D1D9C: T.441 (res_rtp_asterisk.c:609)
==31194==    by 0x73D2009: rtp_add_candidates_to_ice (res_rtp_asterisk.c:1057)
==31194==    by 0x73D284E: ast_rtp_prop_set (res_rtp_asterisk.c:3258)
==31194==    by 0x6DE3EF5: dialog_initialize_rtp (chan_sip.c:5558)
==31194==    by 0x6E55EC7: check_peer_ok (chan_sip.c:17460)
==31194==    by 0x6E57575: check_user_full (chan_sip.c:17577)
==31194==    by 0x6E6C5B1: handle_request_invite (chan_sip.c:24442)
==31194==    by 0x6E6F283: handle_incoming (chan_sip.c:27313)
==31194==    by 0x6E70D57: handle_request_do (chan_sip.c:27517)
==31194==
==31194== Thread 35:
==31194== Conditional jump or move depends on uninitialised value(s)
==31194==    at 0x73FE8BC: pj_thread_register (os_core_unix.c:421)
==31194==    by 0x73D039E: pj_thread_register_check (res_rtp_asterisk.c:426)
==31194==    by 0x73D177D: T.443 (res_rtp_asterisk.c:916)
==31194==    by 0x73DAECC: ast_rtp_raw_write (res_rtp_asterisk.c:934)
==31194==    by 0x73DBD89: ast_rtp_write (res_rtp_asterisk.c:2088)
==31194==    by 0x6E5BFEB: sip_write (chan_sip.c:6913)
==31194==    by 0x80CF017: ast_write (channel.c:5147)
==31194==    by 0x6D9BB5E: moh_files_generator (res_musiconhold.c:427)
==31194==    by 0x80BD089: generator_force (channel.c:3076)
==31194==    by 0x80D3087: __ast_read (channel.c:3876)
==31194==    by 0x80D7998: T.996 (channel.c:4322)
==31194==    by 0x6D9C5BF: play_moh_exec (res_musiconhold.c:806)
==31194==
==31194== Invalid read of size 1
==31194==    at 0x40260CA: strcpy (mc_replace_strmem.c:311)
==31194==    by 0x73FF7B8: pj_mutex_lock (os_core_unix.c:1238)
==31194==    by 0x73E294C: pj_ice_sess_send_data (ice_session.c:2879)
==31194==    by 0x73D17A6: T.443 (res_rtp_asterisk.c:918)
==31194==    by 0x73DAECC: ast_rtp_raw_write (res_rtp_asterisk.c:934)
==31194==    by 0x73DBD89: ast_rtp_write (res_rtp_asterisk.c:2088)
==31194==    by 0x6E5BFEB: sip_write (chan_sip.c:6913)
==31194==    by 0x80CF017: ast_write (channel.c:5147)
==31194==    by 0x6D9BB5E: moh_files_generator (res_musiconhold.c:427)
==31194==    by 0x80BD089: generator_force (channel.c:3076)
==31194==    by 0x80D3087: __ast_read (channel.c:3876)
==31194==    by 0x80D7998: T.996 (channel.c:4322)
==31194==  Address 0x918fe9c is just below the stack ptr.  To suppress, use: --workaround-gcc296-bugs=yes
{noformat}

Looking at the core file from a crash, there appears to be some memory corruption:

{noformat}
(gdb) frame 4
#4  0xb5fee88e in pj_mutex_unlock (mutex=0xb77973ec)
   at ../src/pj/os_core_unix.c:1276
1276    pj_assert(mutex->owner == pj_thread_this());
(gdb) print *mutex
$1 = {mutex = {__data = {__lock = 0, __count = 159555984, __owner = 159122496,
     __kind = 159116656, __nusers = 159122496, {__spins = 159008360,
       __list = {__next = 0x97a4668}}},
   __size = "\000\000\000\000\220\241\202\t@\004|\tp\355{\t@\004|\thFz\t",
   __align = 0},
 obj_name = "hFz\t\200G{\t\230o{\t\370$z\t\240Jz\t@\017z\t\260G{\t\270\321y\t", nesting_level = 159064160, owner = 0x97a5488,
 owner_name = "\030\346{\t(ty\267(ty\267\070\357{\t0K{\t8ty\267\070ty\267\060lw\t"}
(gdb) print *mutex->owner
$2 = {
 obj_name = "\000\000\000\000\071\000\000\000\360>{\t ty\267\377\377\377\377", '\000' <repeats 11 times>, thread = 0, proc = 0, arg = 0x1, signature1 = 0,
 signature2 = 1, suspended_mutex = 0x0}
{noformat}

By: Matt Jordan (mjordan) 2012-08-17 10:12:51.602-0500

What version of glibc are you using?

By: Jeremy Pepper (jpepper) 2012-08-17 11:47:41.693-0500

{{GNU C Library (Ubuntu EGLIBC 2.11.1-0ubuntu7.10) stable release version 2.11.1, by Roland McGrath et al.}}

I'm not sure this fully explains what's happening, but I've found something that could be contributing:

{code:title=res_rtp_asterisk.c}
static void pj_thread_register_check(void)
{
pj_thread_desc desc;
pj_thread_t *thread;

if (pj_thread_is_registered() == PJ_TRUE) {
return;
}

pj_thread_register("Asterisk Thread", desc, &thread);
}
{code}

{{pj_thread_desc}} is defined like this:
{code}
typedef long pj_thread_desc[PJ_THREAD_DESC_SIZE];
{code}

{{pj_thread_register}} stores a pointer to this structure in thread-local storage, but the memory's almost immediately re-used. This is also causing PJSIP to copy arbitrary garbage to mutexes' {{owner_name}} fields, which IIRC is causing another, rarer crash.

What I still can't explain is exactly how a mutex's {{owner}} field (which is just a pointer to this trashed memory) seems to change between locking and unlocking it, sometimes.

By: Joshua C. Colp (jcolp) 2012-08-17 12:01:28.928-0500

Great analysis on that. On the machine that runs the testsuite for a period of time it was crashing on the lock, not the unlock, around that same atomic. There is definitely some sort of memory corruption going on with a race condition intermingled.

By: Mark Michelson (mmichelson) 2012-08-17 15:33:06.150-0500

As discussed on IRC, a likely cause for the corruption is that pjlib ends up referencing data that went out of scope long ago. The patch I'm uploading (ASTERISK-20237.patch) changes the pj_thread_desc structure to be stored in thread-local storage. In my local tests, valgrind's numerous errors disappeared. My hope is that this also means the corruption that led to the crash is also cleared up. Please test and let me know how things go.

By: Jeremy Pepper (jpepper) 2012-08-17 16:49:23.206-0500

Seems to be working well for me with the patch! I haven't been able to crash it.