[Home]

Summary:ASTERISK-24774: Segfault in ast_context_destroy with extensions.ael and extensions.conf
Reporter:Corey Farrell (coreyfarrell)Labels:
Date Opened:2015-02-09 16:34:32.000-0600Date Closed:2015-04-20 06:31:04
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/PBX
Versions:11.16.0 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace_11054.txt
( 1) backtrace_noload-pbx_lua.txt
( 2) extensions.conf
( 3) testsuite-pbx-callerid_match.patch
Description:While attempting to resolve open channels in testsuite/tests/pbx/callerid_match I am experiencing a segfault every time.  I do not know AGI enough to understand why, but running 'agi.finish()' on the calls in this test seems to cause a segfault on shutdown (somehow contexts become corrupted).
Comments:By: Corey Farrell (coreyfarrell) 2015-02-09 16:36:12.906-0600

Patch to testsuite allows this issue to be reproduced every time.

By: Corey Farrell (coreyfarrell) 2015-02-09 16:37:19.673-0600

Backtrace showing segfault during 'core stop gracefully'.

By: Corey Farrell (coreyfarrell) 2015-02-09 16:44:49.940-0600

Making the same change to testsuite: tests/pbx/call-files does not cause a segfault on that test.

By: Matt Jordan (mjordan) 2015-02-09 20:21:16.631-0600

Well, this is weird:

{noformat}
#0  0x00000000004fdd33 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711
#1  0x0000000000554b67 in __ast_context_destroy (list=0x158e3e0, contexttab=0x1575ce0, con=0x0, registrar=0x7f491739342a "pbx_lua") at pbx.c:10889
#2  0x00000000005551a5 in ast_context_destroy (con=0x0, registrar=0x7f491739342a "pbx_lua") at pbx.c:10968
#3  0x00007f49173932d6 in unload_module () at pbx_lua.c:1649
{noformat}

What is in your {{extensions.lua}}?

By: Corey Farrell (coreyfarrell) 2015-02-09 20:30:09.814-0600

It's the default from 'make samples'.  This was done using 'run-local' in the testsuite.

By: Corey Farrell (coreyfarrell) 2015-02-11 12:40:06.000-0600

New backtrace attached with noload pbx_lua.so.  In this case {{features_shutdown}} crashes the same way pbx_lua had.

By: Corey Farrell (coreyfarrell) 2015-02-11 13:10:03.846-0600

Minimum extensions.conf to cause this failure

By: Corey Farrell (coreyfarrell) 2015-02-12 11:29:37.690-0600

I've reproduced this with a simpler test - using the following modules.conf:
{noformat}
[modules]
autoload=no
load => pbx_config.so
load => res_ael_share.so
load => pbx_ael.so
{noformat}

Start Asterisk with the attached extensions.conf.  extensions.ael can have anything or nothing more than "globals {};".

From Asterisk CLI:
{noformat}
*CLI> module unload pbx_config.so
   -- Remove test/agi4/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil)
   -- Remove test/agi4/2, registrar=pbx_config; con=<nil>((nil)); con->root=(nil)
[Feb 12 12:20:42] ERROR[19590]: pbx.c:7192 ast_context_remove_extension_callerid2: Could not find priority 2 of exten agi4 in context test!
   -- Remove test/agi4/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil)
Unloaded pbx_config.so
*CLI> module load pbx_config.so
 == Parsing '/home/cfarrell/svn/origsvn/11-2/live/etc/asterisk/extensions.conf': Found
   -- Registered extension context 'test'; registrar: pbx_config
   -- Added extension 'agi4' priority 1 to test
   -- Added extension 'agi4' priority 2 to test
   -- Added extension 'agi4' priority 1 (CID match '') to test
 == Parsing '/home/cfarrell/svn/origsvn/11-2/live/etc/asterisk/users.conf': Found

Program received signal SIGSEGV, Segmentation fault.
#0  0x00000000004fdda8 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711
#1  0x000000000054e1d6 in context_merge (extcontexts=0x7ffff5606d18 <local_contexts>, exttable=0xe860e0, context=0xe79af0, registrar=0x7ffff5402d50 <registrar> "pbx_config") at pbx.c:8846
#2  0x000000000054e62d in ast_merge_contexts_and_delete (extcontexts=0x7ffff5606d18 <local_contexts>, exttable=0xe860e0, registrar=0x7ffff5402d50 <registrar> "pbx_config") at pbx.c:8962
#3  0x00007ffff5402b9c in pbx_load_module () at pbx_config.c:1872
#4  0x00007ffff5402c6f in load_module () at pbx_config.c:1894
#5  0x000000000050b695 in start_resource (mod=0xe81cc0) at loader.c:897
#6  0x000000000050ba77 in load_resource (resource_name=0x8a73ec "pbx_config.so", global_symbols_only=0, resource_heap=0x0, required=0) at loader.c:985
#7  0x000000000050bcfe in ast_load_resource (resource_name=0x8a73ec "pbx_config.so") at loader.c:1001
#8  0x000000000049bfc7 in handle_load (e=0x850db0 <cli_cli+2800>, cmd=-4, a=0x7fffffffbb50) at cli.c:253
#9  0x00000000004a46fe in ast_cli_command_full (uid=-1, gid=-1, fd=1, s=0x8a79f0 "module load pbx_config.so ") at cli.c:2691
#10 0x00000000004460dc in consolehandler (s=0x8a79f0 "module load pbx_config.so ") at asterisk.c:2094
#11 0x000000000044d28d in main (argc=2, argv=0x7fffffffe138) at asterisk.c:4412
{noformat}

After unloading pbx_config other dialplan related operations (like dialplan show) will cause crashes as well, but the backtrace is less useful.

By: Matt Jordan (mjordan) 2015-04-19 14:53:37.806-0500

Using the latest tip of 11, I'm not able to reproduce the crash you have here using the provided dialplan (extensions.conf) and a blank extensions.ael.

I am still able to reproduce a similar crash using the {{pbx/callerid_merge}} test, although the backtrace does look a bit different:

{noformat}
16 [Thread debugging using libthread_db enabled]
17 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
18 Core was generated by `/usr/sbin/asterisk -f -g -q -m -n -C /tmp/asterisk-testsuite/483c7dcc41a42228c9'.
19 Program terminated with signal SIGSEGV, Segmentation fault.
20 #0  0x00000000004f7336 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711
21 711             it->next = tab->tlist;
22 #0  0x00000000004f7336 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711
23         it = 0x7fa9ac000930
24         __PRETTY_FUNCTION__ = "ast_hashtab_start_traversal"
25 #1  0x000000000054cafe in __ast_context_destroy (list=0x255c810, contexttab=0x2554c20, con=0x0, registrar=0x5caa1e "features") at pbx.c:10889
26         end_traversal = 1
27         prio_iter = 0x7fa9ac000930
28         ipl = 0x0
29         ipn = 0x0
30         i = 0x0
31         sw = 0x0
32         exten_iter = 0x7fa9ac000b40
33         ip = 0x0
34         pi = 0x0
35         ni = 0x0
36         next = 0x0
37         tmp = 0x255ad40
38         tmpl = 0x0
39         exten_item = 0x255f720
40         prio_item = 0x0
41         __PRETTY_FUNCTION__ = "__ast_context_destroy"
42 #2  0x000000000054d13c in ast_context_destroy (con=0x0, registrar=0x5caa1e "features") at pbx.c:10968
43 No locals.
44 #3  0x00000000004e5628 in features_shutdown () at features.c:9072
45 No locals.
46 #4  0x0000000000442c63 in ast_run_atexits (run_cleanups=1) at asterisk.c:973    
{noformat}

In this case, we're crashing on shutdown, with what appears to be a NULL {{peer_table}}. It doesn't look like the PBX core is especially clear on whether or not this can be NULL. Plenty of places assume that it can't be NULL, but in other locations we clearly set the pointer to NULL. Lovely.

By: Matt Jordan (mjordan) 2015-04-19 15:39:00.449-0500

What a freaking mess. The handling of extensions in the PBX core is sometimes so screwy it boggles the mind. Thank god we have code reviews today.

Here is what is most likely happening.

# {{pbx_config}} and {{pbx_ael}} are unloaded prior to {{res_agi}}. As a result, they are unable to fully unload their extensions, as an application is still using it:
{code}
[Apr 19 15:24:59] VERBOSE[1269] pbx.c:     -- Remove test/agi1/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil)
[Apr 19 15:24:59] ERROR[1269] pbx.c: Did not remove this exten (agi1) from the context root_table (test) (priority 1)
{code}
# However, we now have a problem. Throwing a bit more debug into the process, we can see that {{end_traversal}} returns 0, which causes the priority object to get destroyed. The comment above this explains why:
{code}
[Apr 19 15:24:59] VERBOSE[1269] pbx.c:     -- Remove test/agi1/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil)
[Apr 19 15:24:59] ERROR[1269] pbx.c: Destroy 0x175f220
[Apr 19 15:24:59] ERROR[1269] pbx.c: End traversal is 0
[Apr 19 15:24:59] WARNING[1269] pbx.c: Freeing prio_iter on  0x175f220, 0x1764740
...
0910                                         /* Explanation:
10911                                          * ast_context_remove_extension_callerid2 will destroy the extension that it comes across. This
10912                                          * destruction includes destroying the exten's peer_table, which we are currently traversing. If
10913                                          * ast_context_remove_extension_callerid2 ever should return '0' then this means we have destroyed
10914                                          * the hashtable which we are traversing, and thus calling ast_hashtab_end_traversal will result
10915                                          * in reading invalid memory. Thus, if we detect that we destroyed the hashtable, then we will simply
10916                                          * free the iterator
10917                                          */
10918                                         if (end_traversal) {
10919                                                 ast_hashtab_end_traversal(prio_iter);
10920                                         } else {
10921                                                 ast_log(LOG_WARNING, "Freeing prio_iter on %s %p, %p\n",
10922                                                                 exten_item->exten,
10923                                                                 exten_item, exten_item->peer_table);
10924                                                 ast_free(prio_iter);
10925                                         }
{code}
That is, *we destroyed peer_table* while iterating over it. However, we also failed to remove the extension from the context's {{root_table}}, as AGI was still hanging around.
# Following that, AGI unloads:
{code}
[Apr 19 15:24:59] VERBOSE[1269] res_agi.c:   == AGI Command 'answer' unregistered
[Apr 19 15:24:59] VERBOSE[1269] res_agi.c:   == AGI Command 'asyncagi break' unregistered
[Apr 19 15:24:59] VERBOSE[1269] res_agi.c:   == AGI Command 'channel status' unregistered
[Apr 19 15:24:59] VERBOSE[1269] res_agi.c:   == AGI Command 'database del' unregistered
[Apr 19 15:24:59] VERBOSE[1269] res_agi.c:   == AGI Command 'database deltree' unregistered
{code}
# And then core {{features}} unloads. However, because we still have the AGI-dependent extensions lurking in the {{root_table}}, we try removing and destroying their {{peer_table}} as well:
{code}
[Apr 19 15:24:59] VERBOSE[1269] pbx.c:     -- Remove parkedcalls/700/1, registrar=features; con=<nil>((nil)); con->root=(nil)
[Apr 19 15:24:59] ERROR[1269] pbx.c: Destroy 0x1761b90
[Apr 19 15:24:59] ERROR[1269] pbx.c: End traversal is 0
[Apr 19 15:24:59] WARNING[1269] pbx.c: Freeing prio_iter on  0x1761b90, 0x1761db0
[Apr 19 15:24:59] ERROR[1269] pbx.c: Starting traversal on agi1 0x1764de0, (nil)
{code}
The important bit in the above is that we are ostensibly only wanting to destroy context/extension/priority tuples registered by {{features}}, but our lurking {{agi1}} extension - which has already had its {{peer_table}} destroyed - it still waiting.
# At which point, we pass a NULL pointer into {{ast_hashtab_start_traversal}}, and blow up:
{code}
10894                                         prio_iter = ast_hashtab_start_traversal(exten_item->peer_table);
{code}

The solution here probably is as simple as simply checking that someone else didn't destroy {{peer_table}} and leave the extension floating around.