Summary: | ASTERISK-24774: Segfault in ast_context_destroy with extensions.ael and extensions.conf | ||
Reporter: | Corey Farrell (coreyfarrell) | Labels: | |
Date Opened: | 2015-02-09 16:34:32.000-0600 | Date Closed: | 2015-04-20 06:31:04 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | Core/PBX |
Versions: | 11.16.0 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) backtrace_11054.txt ( 1) backtrace_noload-pbx_lua.txt ( 2) extensions.conf ( 3) testsuite-pbx-callerid_match.patch | |
Description: | While attempting to resolve open channels in testsuite/tests/pbx/callerid_match I am experiencing a segfault every time. I do not know AGI enough to understand why, but running 'agi.finish()' on the calls in this test seems to cause a segfault on shutdown (somehow contexts become corrupted). | ||
Comments: | By: Corey Farrell (coreyfarrell) 2015-02-09 16:36:12.906-0600 Patch to testsuite allows this issue to be reproduced every time. By: Corey Farrell (coreyfarrell) 2015-02-09 16:37:19.673-0600 Backtrace showing segfault during 'core stop gracefully'. By: Corey Farrell (coreyfarrell) 2015-02-09 16:44:49.940-0600 Making the same change to testsuite: tests/pbx/call-files does not cause a segfault on that test. By: Matt Jordan (mjordan) 2015-02-09 20:21:16.631-0600 Well, this is weird: {noformat} #0 0x00000000004fdd33 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711 #1 0x0000000000554b67 in __ast_context_destroy (list=0x158e3e0, contexttab=0x1575ce0, con=0x0, registrar=0x7f491739342a "pbx_lua") at pbx.c:10889 #2 0x00000000005551a5 in ast_context_destroy (con=0x0, registrar=0x7f491739342a "pbx_lua") at pbx.c:10968 #3 0x00007f49173932d6 in unload_module () at pbx_lua.c:1649 {noformat} What is in your {{extensions.lua}}? By: Corey Farrell (coreyfarrell) 2015-02-09 20:30:09.814-0600 It's the default from 'make samples'. This was done using 'run-local' in the testsuite. By: Corey Farrell (coreyfarrell) 2015-02-11 12:40:06.000-0600 New backtrace attached with noload pbx_lua.so. In this case {{features_shutdown}} crashes the same way pbx_lua had. By: Corey Farrell (coreyfarrell) 2015-02-11 13:10:03.846-0600 Minimum extensions.conf to cause this failure By: Corey Farrell (coreyfarrell) 2015-02-12 11:29:37.690-0600 I've reproduced this with a simpler test - using the following modules.conf: {noformat} [modules] autoload=no load => pbx_config.so load => res_ael_share.so load => pbx_ael.so {noformat} Start Asterisk with the attached extensions.conf. extensions.ael can have anything or nothing more than "globals {};". From Asterisk CLI: {noformat} *CLI> module unload pbx_config.so -- Remove test/agi4/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil) -- Remove test/agi4/2, registrar=pbx_config; con=<nil>((nil)); con->root=(nil) [Feb 12 12:20:42] ERROR[19590]: pbx.c:7192 ast_context_remove_extension_callerid2: Could not find priority 2 of exten agi4 in context test! -- Remove test/agi4/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil) Unloaded pbx_config.so *CLI> module load pbx_config.so == Parsing '/home/cfarrell/svn/origsvn/11-2/live/etc/asterisk/extensions.conf': Found -- Registered extension context 'test'; registrar: pbx_config -- Added extension 'agi4' priority 1 to test -- Added extension 'agi4' priority 2 to test -- Added extension 'agi4' priority 1 (CID match '') to test == Parsing '/home/cfarrell/svn/origsvn/11-2/live/etc/asterisk/users.conf': Found Program received signal SIGSEGV, Segmentation fault. #0 0x00000000004fdda8 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711 #1 0x000000000054e1d6 in context_merge (extcontexts=0x7ffff5606d18 <local_contexts>, exttable=0xe860e0, context=0xe79af0, registrar=0x7ffff5402d50 <registrar> "pbx_config") at pbx.c:8846 #2 0x000000000054e62d in ast_merge_contexts_and_delete (extcontexts=0x7ffff5606d18 <local_contexts>, exttable=0xe860e0, registrar=0x7ffff5402d50 <registrar> "pbx_config") at pbx.c:8962 #3 0x00007ffff5402b9c in pbx_load_module () at pbx_config.c:1872 #4 0x00007ffff5402c6f in load_module () at pbx_config.c:1894 #5 0x000000000050b695 in start_resource (mod=0xe81cc0) at loader.c:897 #6 0x000000000050ba77 in load_resource (resource_name=0x8a73ec "pbx_config.so", global_symbols_only=0, resource_heap=0x0, required=0) at loader.c:985 #7 0x000000000050bcfe in ast_load_resource (resource_name=0x8a73ec "pbx_config.so") at loader.c:1001 #8 0x000000000049bfc7 in handle_load (e=0x850db0 <cli_cli+2800>, cmd=-4, a=0x7fffffffbb50) at cli.c:253 #9 0x00000000004a46fe in ast_cli_command_full (uid=-1, gid=-1, fd=1, s=0x8a79f0 "module load pbx_config.so ") at cli.c:2691 #10 0x00000000004460dc in consolehandler (s=0x8a79f0 "module load pbx_config.so ") at asterisk.c:2094 #11 0x000000000044d28d in main (argc=2, argv=0x7fffffffe138) at asterisk.c:4412 {noformat} After unloading pbx_config other dialplan related operations (like dialplan show) will cause crashes as well, but the backtrace is less useful. By: Matt Jordan (mjordan) 2015-04-19 14:53:37.806-0500 Using the latest tip of 11, I'm not able to reproduce the crash you have here using the provided dialplan (extensions.conf) and a blank extensions.ael. I am still able to reproduce a similar crash using the {{pbx/callerid_merge}} test, although the backtrace does look a bit different: {noformat} 16 [Thread debugging using libthread_db enabled] 17 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 18 Core was generated by `/usr/sbin/asterisk -f -g -q -m -n -C /tmp/asterisk-testsuite/483c7dcc41a42228c9'. 19 Program terminated with signal SIGSEGV, Segmentation fault. 20 #0 0x00000000004f7336 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711 21 711 it->next = tab->tlist; 22 #0 0x00000000004f7336 in ast_hashtab_start_traversal (tab=0x0) at hashtab.c:711 23 it = 0x7fa9ac000930 24 __PRETTY_FUNCTION__ = "ast_hashtab_start_traversal" 25 #1 0x000000000054cafe in __ast_context_destroy (list=0x255c810, contexttab=0x2554c20, con=0x0, registrar=0x5caa1e "features") at pbx.c:10889 26 end_traversal = 1 27 prio_iter = 0x7fa9ac000930 28 ipl = 0x0 29 ipn = 0x0 30 i = 0x0 31 sw = 0x0 32 exten_iter = 0x7fa9ac000b40 33 ip = 0x0 34 pi = 0x0 35 ni = 0x0 36 next = 0x0 37 tmp = 0x255ad40 38 tmpl = 0x0 39 exten_item = 0x255f720 40 prio_item = 0x0 41 __PRETTY_FUNCTION__ = "__ast_context_destroy" 42 #2 0x000000000054d13c in ast_context_destroy (con=0x0, registrar=0x5caa1e "features") at pbx.c:10968 43 No locals. 44 #3 0x00000000004e5628 in features_shutdown () at features.c:9072 45 No locals. 46 #4 0x0000000000442c63 in ast_run_atexits (run_cleanups=1) at asterisk.c:973 {noformat} In this case, we're crashing on shutdown, with what appears to be a NULL {{peer_table}}. It doesn't look like the PBX core is especially clear on whether or not this can be NULL. Plenty of places assume that it can't be NULL, but in other locations we clearly set the pointer to NULL. Lovely. By: Matt Jordan (mjordan) 2015-04-19 15:39:00.449-0500 What a freaking mess. The handling of extensions in the PBX core is sometimes so screwy it boggles the mind. Thank god we have code reviews today. Here is what is most likely happening. # {{pbx_config}} and {{pbx_ael}} are unloaded prior to {{res_agi}}. As a result, they are unable to fully unload their extensions, as an application is still using it: {code} [Apr 19 15:24:59] VERBOSE[1269] pbx.c: -- Remove test/agi1/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil) [Apr 19 15:24:59] ERROR[1269] pbx.c: Did not remove this exten (agi1) from the context root_table (test) (priority 1) {code} # However, we now have a problem. Throwing a bit more debug into the process, we can see that {{end_traversal}} returns 0, which causes the priority object to get destroyed. The comment above this explains why: {code} [Apr 19 15:24:59] VERBOSE[1269] pbx.c: -- Remove test/agi1/1, registrar=pbx_config; con=<nil>((nil)); con->root=(nil) [Apr 19 15:24:59] ERROR[1269] pbx.c: Destroy 0x175f220 [Apr 19 15:24:59] ERROR[1269] pbx.c: End traversal is 0 [Apr 19 15:24:59] WARNING[1269] pbx.c: Freeing prio_iter on 0x175f220, 0x1764740 ... 0910 /* Explanation: 10911 * ast_context_remove_extension_callerid2 will destroy the extension that it comes across. This 10912 * destruction includes destroying the exten's peer_table, which we are currently traversing. If 10913 * ast_context_remove_extension_callerid2 ever should return '0' then this means we have destroyed 10914 * the hashtable which we are traversing, and thus calling ast_hashtab_end_traversal will result 10915 * in reading invalid memory. Thus, if we detect that we destroyed the hashtable, then we will simply 10916 * free the iterator 10917 */ 10918 if (end_traversal) { 10919 ast_hashtab_end_traversal(prio_iter); 10920 } else { 10921 ast_log(LOG_WARNING, "Freeing prio_iter on %s %p, %p\n", 10922 exten_item->exten, 10923 exten_item, exten_item->peer_table); 10924 ast_free(prio_iter); 10925 } {code} That is, *we destroyed peer_table* while iterating over it. However, we also failed to remove the extension from the context's {{root_table}}, as AGI was still hanging around. # Following that, AGI unloads: {code} [Apr 19 15:24:59] VERBOSE[1269] res_agi.c: == AGI Command 'answer' unregistered [Apr 19 15:24:59] VERBOSE[1269] res_agi.c: == AGI Command 'asyncagi break' unregistered [Apr 19 15:24:59] VERBOSE[1269] res_agi.c: == AGI Command 'channel status' unregistered [Apr 19 15:24:59] VERBOSE[1269] res_agi.c: == AGI Command 'database del' unregistered [Apr 19 15:24:59] VERBOSE[1269] res_agi.c: == AGI Command 'database deltree' unregistered {code} # And then core {{features}} unloads. However, because we still have the AGI-dependent extensions lurking in the {{root_table}}, we try removing and destroying their {{peer_table}} as well: {code} [Apr 19 15:24:59] VERBOSE[1269] pbx.c: -- Remove parkedcalls/700/1, registrar=features; con=<nil>((nil)); con->root=(nil) [Apr 19 15:24:59] ERROR[1269] pbx.c: Destroy 0x1761b90 [Apr 19 15:24:59] ERROR[1269] pbx.c: End traversal is 0 [Apr 19 15:24:59] WARNING[1269] pbx.c: Freeing prio_iter on 0x1761b90, 0x1761db0 [Apr 19 15:24:59] ERROR[1269] pbx.c: Starting traversal on agi1 0x1764de0, (nil) {code} The important bit in the above is that we are ostensibly only wanting to destroy context/extension/priority tuples registered by {{features}}, but our lurking {{agi1}} extension - which has already had its {{peer_table}} destroyed - it still waiting. # At which point, we pass a NULL pointer into {{ast_hashtab_start_traversal}}, and blow up: {code} 10894 prio_iter = ast_hashtab_start_traversal(exten_item->peer_table); {code} The solution here probably is as simple as simply checking that someone else didn't destroy {{peer_table}} and leave the extension floating around. |