[Home]

Summary:ASTERISK-26344: Asterisk 13.11.0 + PJSIP crash
Reporter:Ian Gilmour (tuxian)Labels:
Date Opened:2016-09-07 09:19:16Date Closed:2016-10-31 11:38:20
Priority:MajorRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:13.11.0 Frequency of
Occurrence
Related
Issues:
is related toASTERISK-26387 Asterisk segfaults shortly after starting even with no active calls.
is related toASTERISK-26310 Crash occurs every 24 - 48 hours with backtrace log showing fault related to pjsip hash
is related toASTERISK-26516 pjsip: Memory corruption with possible memory leak.
is related toASTERISK-26517 pjsip: Memory corruption with possible memory leak.
Environment:Centos 6.8 (64-bit) + Asterisk-13.11.0 (with bundled 2.5.5 pjsip) + libsrtp-1.5.4Attachments:( 0) 0001-r5400-pjsip_tx_data_dec_ref.patch
( 1) 0006_jira_asterisk_26344_v13_fix_cache.patch
( 2) 0006_jira_asterisk_26344_v13_fix.patch
( 3) 2016-10-28.tgz
( 4) cli-and-gdb.tgz
( 5) cli-and-gdb.tgz
( 6) cli-and-gdb-2016-10-24.tgz
( 7) cli-and-gdb-2016-10-24-crash2.tgz
( 8) cli-and-gdb-2016-10-25-crash1.tgz
( 9) cli-and-gdb-2016-10-26-crash1-after-fix.tgz
(10) cli-and-gdb-2016-10-27.tgz
(11) cli-and-gdb-3-crashes.tgz
(12) cli-and-gdb-bt-on-destroying_tx_data.tgz
(13) cli-and-gdb-inc-dec-ref-logging.tgz
(14) jira_asterisk_26344_v13_debuging.patch
Description:Hi,

I have a development Asterisk 13.11.0 test setup (uses the bundled pjsip-2.5.5).

Environment is Centos 6.8 (64-bit) + Asterisk-13.11.0 + libsrtp-1.5.4.

On startup Asterisk registers 5 Asterisk users with a remote OpenSIPS server, over TLS, using PJSIP. As part of the test all 5 Asterisk PJSIP users are reregistered with OpenSIPS Server every couple of mins.

All outgoing/incoming pjsip call media is encrypted using SRTP and via an external RTPPROXY running alongside the external OpenSIPS Server.

Asterisk is also configured to use chan_sip on 127.0.0.1:5060 to allow calls from a locally run SIPp process. All SIPp calls are TCP+RTP.

I use SIPp to run multiple concurrent loopback calls (calls vary in
duration) through Asterisk to the OpenSIPS server and back to an echo() service running on the same Asterisk).
i.e.
{noformat}
 SIPp <-TCP/RTP-> chan_sip <-> chan_pjsip <-TLS/SRTP->
     OpenSIPS server (+ rtpproxy) <-TLS/SRTP-> chan_pjsip (echo service).
{noformat}

Initially I see all chan_pjsip registrations and reregistrations for all 5 PJSIP users go out through a single TCP port. I then start a SIPp test running multiple concurrent calls. At some point into the test the Asterisk PJSIP TCP port gets closed and reopened - when it does so I see Asterisk crash shortly afterwards. Possibly significantly\(?) the time of the crash was around the time one of the PJSIP users should have reregistered after the TCP outgoing port change (The log shows all 5 PJSIP users reregistering after the PJSIP TCP port change, but only 4 of the 5 reregistering twice before the crash).

{noformat}
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa2814700 (LWP 7166)]
__pthread_mutex_lock (mutex=0x44492d6c6c6143) at pthread_mutex_lock.c:50
50        unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
(gdb) bt
#0  __pthread_mutex_lock (mutex=0x44492d6c6c6143) at pthread_mutex_lock.c:50
#1  0x00007ffff78e9d9b in pj_mutex_lock (mutex=0x44492d6c6c6143) at ../src/pj/os_core_unix.c:1265
#2  0x00007ffff78e9e39 in pj_atomic_dec_and_get (atomic_var=0x7fffd8074630) at ../src/pj/os_core_unix.c:962
#3  0x00007ffff787d7e0 in pjsip_tx_data_dec_ref (tdata=0x7fff8c3bfab8) at ../src/pjsip/sip_transport.c:495
#4  0x00007ffff788a087 in tsx_shutdown (tsx=0x7fff94060a98) at ../src/pjsip/sip_transaction.c:1062
#5  0x00007ffff788b4bc in tsx_set_state (tsx=0x7fff94060a98, state=PJSIP_TSX_STATE_DESTROYED, event_src_type=PJSIP_EVENT_TIMER, event_src=0x7fff94060c50, flag=0) at ../src/pjsip/sip_transaction.c:1271
#6  0x00007ffff788b88e in tsx_on_state_terminated (tsx=<value optimized out>, event=<value optimized out>) at ../src/pjsip/sip_transaction.c:3337
#7  0x00007ffff788bcd5 in tsx_timer_callback (theap=<value optimized out>, entry=0x7fff94060c50) at ../src/pjsip/sip_transaction.c:1171
#8  0x00007ffff78fc449 in pj_timer_heap_poll (ht=0x1137950, next_delay=0x7fffa2813d30) at ../src/pj/timer.c:643
#9  0x00007ffff7875b19 in pjsip_endpt_handle_events2 (endpt=0x1137668, max_timeout=0x7fffa2813d70, p_count=0x0) at ../src/pjsip/sip_endpoint.c:712
#10 0x00007ffff1320b00 in monitor_thread_exec (endpt=<value optimized out>) at res_pjsip.c:3889
#11 0x00007ffff78ea5d6 in thread_main (param=0x114dee8) at ../src/pj/os_core_unix.c:541
#12 0x00007ffff5a8faa1 in start_thread (arg=0x7fffa2814700) at pthread_create.c:301
#13 0x00007ffff509baad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) bt full
#0  __pthread_mutex_lock (mutex=0x44492d6c6c6143) at pthread_mutex_lock.c:50
       type = <value optimized out>
       id = <value optimized out>
#1  0x00007ffff78e9d9b in pj_mutex_lock (mutex=0x44492d6c6c6143) at ../src/pj/os_core_unix.c:1265
       status = <value optimized out>
#2  0x00007ffff78e9e39 in pj_atomic_dec_and_get (atomic_var=0x7fffd8074630) at ../src/pj/os_core_unix.c:962
       new_value = <value optimized out>
#3  0x00007ffff787d7e0 in pjsip_tx_data_dec_ref (tdata=0x7fff8c3bfab8) at ../src/pjsip/sip_transport.c:495
No locals.
#4  0x00007ffff788a087 in tsx_shutdown (tsx=0x7fff94060a98) at ../src/pjsip/sip_transaction.c:1062
No locals.
#5  0x00007ffff788b4bc in tsx_set_state (tsx=0x7fff94060a98, state=PJSIP_TSX_STATE_DESTROYED, event_src_type=PJSIP_EVENT_TIMER, event_src=0x7fff94060c50, flag=0) at ../src/pjsip/sip_transaction.c:1271
       prev_state = PJSIP_TSX_STATE_TERMINATED
#6  0x00007ffff788b88e in tsx_on_state_terminated (tsx=<value optimized out>, event=<value optimized out>) at ../src/pjsip/sip_transaction.c:3337
No locals.
#7  0x00007ffff788bcd5 in tsx_timer_callback (theap=<value optimized out>, entry=0x7fff94060c50) at ../src/pjsip/sip_transaction.c:1171
       event = {prev = 0x7fff8c5f4908, next = 0x1bfe, type = PJSIP_EVENT_TIMER, body = {timer = {entry = 0x7fff94060c50}, tsx_state = {src = {rdata = 0x7fff94060c50, tdata = 0x7fff94060c50, timer = 0x7fff94060c50, status = -1811542960, data = 0x7fff94060c50},
             tsx = 0x7fffa2813c90, prev_state = -1568588592, type = 32767}, tx_msg = {tdata = 0x7fff94060c50}, tx_error = {tdata = 0x7fff94060c50, tsx = 0x7fffa2813c90}, rx_msg = {rdata = 0x7fff94060c50}, user = {user1 = 0x7fff94060c50, user2 = 0x7fffa2813c90,
             user3 = 0x7fffa2813cd0, user4 = 0x0}}}
       tsx = 0x7fff94060a98
#8  0x00007ffff78fc449 in pj_timer_heap_poll (ht=0x1137950, next_delay=0x7fffa2813d30) at ../src/pj/timer.c:643
       node = 0x7fff94060c50
       grp_lock = 0x7fffd8000ab8
       now = {sec = 613363, msec = 925}
       count = 2
#9  0x00007ffff7875b19 in pjsip_endpt_handle_events2 (endpt=0x1137668, max_timeout=0x7fffa2813d70, p_count=0x0) at ../src/pjsip/sip_endpoint.c:712
       timeout = {sec = 0, msec = 0}
       count = 0
       net_event_count = 0
       c = <value optimized out>
#10 0x00007ffff1320b00 in monitor_thread_exec (endpt=<value optimized out>) at res_pjsip.c:3889
       delay = {sec = 0, msec = 10}
#11 0x00007ffff78ea5d6 in thread_main (param=0x114dee8) at ../src/pj/os_core_unix.c:541
       rec = 0x114dee8
       result = <value optimized out>
#12 0x00007ffff5a8faa1 in start_thread (arg=0x7fffa2814700) at pthread_create.c:301
       __res = <value optimized out>
       pd = 0x7fffa2814700
       now = <value optimized out>
       unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140735919769344, -4896504223120570676, 140737488337344, 140735919770048, 0, 3, 4896356555646224076, 4896525551845689036}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
             canceltype = 0}}}
       not_first_call = <value optimized out>
       pagesize_m1 = <value optimized out>
       sp = <value optimized out>
       freesize = <value optimized out>
#13 0x00007ffff509baad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
{noformat}
Comments:By: Asterisk Team (asteriskteam) 2016-09-07 09:19:17.415-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2016-09-07 20:46:53.734-0500

Were you able to reproduce this? Or did it only happen the one time?

If you can reproduce it, can you recompile with DONT_OPTIMIZE and BETTER_BACKTRACES and provide a new trace? Along, with that a full log showing verbose and debug( both level 5 or above) up to the time of the crash?

By: Ian Gilmour (tuxian) 2016-09-08 04:00:40.558-0500

I repeated the test (with DONT_OPTIMIZE and BETTER_BACKTRACES) this morning - got another (different) crash after ~50 mins.

During the test it closed and reopened PJSIP TCP port a few times and appeared to recover ok each time.

n.b.
* Full log file is > 1GB in size - so I've only attaching the tail end of it (let me know if you need more).
* I inadvertently left PJ_LOG_MAX_LEVEL set to 5 during the run (I can rerun the test with it set to the default 3 if need be).

{noformat}
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff420c1700 (LWP 21901)]
0x00007ffff78e36d2 in find_entry (pool=0x0, ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=11, val=0x0, hval=0x7fff420c071c, entry_buf=0x0, lower=0) at ../src/pj/hash.c:177
177    for (p_entry = &ht->table[hash & ht->rows], entry=*p_entry;
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.5-7.el6_0.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-20.el6_7.x86_64 elfutils-libelf-0.164-2.el6.x86_64 gsm-1.0.13-4.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-57.el6.x86_64 libacl-2.2.49-6.el6.x86_64 libattr-2.4.44-7.el6.x86_64 libcap-2.16-5.5.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 libtool-ltdl-2.2.6-15.5.el6.x86_64 libuuid-2.17.2-12.24.el6.x86_64 libxml2-2.7.6-21.el6_8.1.x86_64 libxslt-1.1.26-2.el6_3.1.x86_64 lm_sensors-libs-3.1.1-17.el6.x86_64 lua-5.1.4-4.1.el6.x86_64 mysql-libs-5.1.73-7.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64 net-snmp-libs-5.5-57.el6.x86_64 nspr-4.11.0-1.el6.x86_64 nss-3.21.0-8.el6.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.21.0-2.el6.x86_64 openldap-2.4.40-12.el6.x86_64 perl-libs-5.10.1-141.el6_7.1.x86_64 popt-1.13-7.el6.x86_64 rpm-libs-4.8.0-55.el6.x86_64 speex-1.2-0.12.rc1.1.el6.x86_64 sqlite-3.6.20-1.el6_7.2.x86_64 tcp_wrappers-libs-7.6-58.el6.x86_64 unixODBC-2.2.14-14.el6.x86_64 xz-libs-4.999.9-0.5.beta.20091007git.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00007ffff78e36d2 in find_entry (pool=0x0, ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=11, val=0x0, hval=0x7fff420c071c, entry_buf=0x0, lower=0) at ../src/pj/hash.c:177
#1  0x00007ffff78e3890 in pj_hash_get (ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=4294967295, hval=0x7fff420c071c) at ../src/pj/hash.c:230
#2  0x00007ffff12c168f in ast_sip_dict_get (ht=0x7fff88277558, key=0x7fffa2bb7b05 "on_response") at res_pjsip.c:3954
#3  0x00007fffa2bb5bb2 in session_inv_on_tsx_state_changed (inv=0x7fff94049678, tsx=0x7fff8976af48, e=0x7fff420c08d0) at res_pjsip_session.c:2623
#4  0x00007ffff7818594 in mod_inv_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip-ua/sip_inv.c:699
#5  0x00007ffff7864460 in pjsip_dlg_on_tsx_state (dlg=0x7fff887891d8, tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_dialog.c:2055
#6  0x00007ffff7864cd0 in mod_ua_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_ua_layer.c:178
#7  0x00007ffff785cdf8 in tsx_set_state (tsx=0x7fff8976af48, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffd8257bc8, flag=0) at ../src/pjsip/sip_transaction.c:1233
#8  0x00007ffff785fd39 in tsx_on_state_proceeding_uac (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2957
#9  0x00007ffff785f357 in tsx_on_state_calling (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2540
#10 0x00007ffff785dcef in pjsip_tsx_recv_msg (tsx=0x7fff8976af48, rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:1787
#11 0x00007ffff785c28a in mod_tsx_layer_on_rx_response (rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:875
#12 0x00007ffff7842b8d in pjsip_endpt_process_rx_data (endpt=0x123d1a8, rdata=0x7fffd8257bc8, p=0x7ffff14f7e40, p_handled=0x7fff420c0b64) at ../src/pjsip/sip_endpoint.c:894
#13 0x00007ffff12dce59 in distribute (data=0x7fffd8257bc8) at res_pjsip/pjsip_distributor.c:765
#14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x7fff94094648) at taskprocessor.c:938
#15 0x00000000005f6aca in execute_tasks (data=0x7fff94094648) at threadpool.c:1322
#16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x1236e88) at taskprocessor.c:938
#17 0x00000000005f4dd5 in threadpool_execute (pool=0x123ab28) at threadpool.c:351
#18 0x00000000005f63e8 in worker_active (worker=0x7fff9006c868) at threadpool.c:1105
#19 0x00000000005f6194 in worker_start (arg=0x7fff9006c868) at threadpool.c:1024
#20 0x00000000006023ee in dummy_start (data=0x7fff900696d0) at utils.c:1235
#21 0x00007ffff5a4baa1 in start_thread (arg=0x7fff420c1700) at pthread_create.c:301
#22 0x00007ffff5057aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

(gdb) bt full
#0  0x00007ffff78e36d2 in find_entry (pool=0x0, ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=11, val=0x0, hval=0x7fff420c071c, entry_buf=0x0, lower=0) at ../src/pj/hash.c:177
       hash = 1918918923
       p_entry = 0x3085c
       entry = 0x57d11483
#1  0x00007ffff78e3890 in pj_hash_get (ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=4294967295, hval=0x7fff420c071c) at ../src/pj/hash.c:230
       entry = 0x7fff94049678
#2  0x00007ffff12c168f in ast_sip_dict_get (ht=0x7fff88277558, key=0x7fffa2bb7b05 "on_response") at res_pjsip.c:3954
       hval = 1918918923
#3  0x00007fffa2bb5bb2 in session_inv_on_tsx_state_changed (inv=0x7fff94049678, tsx=0x7fff8976af48, e=0x7fff420c08d0) at res_pjsip_session.c:2623
       cb = 0x4
       id = 15
       session = 0x7fff9404cab8
       tdata = 0x7ffff78dd302
       __PRETTY_FUNCTION__ = "session_inv_on_tsx_state_changed"
#4  0x00007ffff7818594 in mod_inv_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip-ua/sip_inv.c:699
       dlg = 0x7fff887891d8
       inv = 0x7fff94049678
#5  0x00007ffff7864460 in pjsip_dlg_on_tsx_state (dlg=0x7fff887891d8, tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_dialog.c:2055
       i = 2
#6  0x00007ffff7864cd0 in mod_ua_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_ua_layer.c:178
       dlg = 0x7fff887891d8
#7  0x00007ffff785cdf8 in tsx_set_state (tsx=0x7fff8976af48, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffd8257bc8, flag=0) at ../src/pjsip/sip_transaction.c:1233
       e = {prev = 0x7fff420c0910, next = 0x7ffff78f33cb, type = PJSIP_EVENT_TSX_STATE, body = {timer = {entry = 0x7fffd8257bc8}, tsx_state = {src = {rdata = 0x7fffd8257bc8, tdata = 0x7fffd8257bc8, timer = 0x7fffd8257bc8, status = -668632120, data = 0x7fffd8257bc8}, tsx = 0x7fff8976af48, prev_state = 1, type = PJSIP_EVENT_RX_MSG}, tx_msg = {tdata = 0x7fffd8257bc8}, tx_error = {tdata = 0x7fffd8257bc8, tsx = 0x7fff8976af48}, rx_msg = {rdata = 0x7fffd8257bc8}, user = {user1 = 0x7fffd8257bc8, user2 = 0x7fff8976af48, user3 = 0x300000001, user4 = 0x28976aff0}}}
       prev_state = PJSIP_TSX_STATE_CALLING
#8  0x00007ffff785fd39 in tsx_on_state_proceeding_uac (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2957
       timeout = {sec = 0, msec = 0}
#9  0x00007ffff785f357 in tsx_on_state_calling (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2540
       msg = 0x7fffd8259690
       code = 200
#10 0x00007ffff785dcef in pjsip_tsx_recv_msg (tsx=0x7fff8976af48, rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:1787
       event = {prev = 0x0, next = 0x7fff8816e528, type = PJSIP_EVENT_RX_MSG, body = {timer = {entry = 0x7fffd8257bc8}, tsx_state = {src = {rdata = 0x7fffd8257bc8, tdata = 0x7fffd8257bc8, timer = 0x7fffd8257bc8, status = -668632120, data = 0x7fffd8257bc8}, tsx = 0x7fff420c0a70, prev_state = -141699390, type = 32767}, tx_msg = {tdata = 0x7fffd8257bc8}, tx_error = {tdata = 0x7fffd8257bc8, tsx = 0x7fff420c0a70}, rx_msg = {rdata = 0x7fffd8257bc8}, user = {user1 = 0x7fffd8257bc8, user2 = 0x7fff420c0a70, user3 = 0x7ffff78dd6c2, user4 = 0x7fff420c0a70}}}
#11 0x00007ffff785c28a in mod_tsx_layer_on_rx_response (rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:875
       key = {ptr = 0x7fff89121f80 "c$BYE$z9hG4bKPj4ce37655-c6fe-467e-9834-0cd83e7839c6", slen = 51}
       hval = 452279487
       tsx = 0x7fff8976af48
#12 0x00007ffff7842b8d in pjsip_endpt_process_rx_data (endpt=0x123d1a8, rdata=0x7fffd8257bc8, p=0x7ffff14f7e40, p_handled=0x7fff420c0b64) at ../src/pjsip/sip_endpoint.c:894
       msg = 0x7fffd8259690
       def_prm = {start_prio = 1108085632, start_mod = 0x6f3e0b, idx_after_start = 7291691, silent = 0}
       mod = 0x7ffff7b54220
       handled = 0
       i = 1
       status = 0
#13 0x00007ffff12dce59 in distribute (data=0x7fffd8257bc8) at res_pjsip/pjsip_distributor.c:765
       param = {start_prio = 0, start_mod = 0x7ffff14f7ca0, idx_after_start = 1, silent = 0}
       handled = 0
       rdata = 0x7fffd8257bc8
       is_request = 0
       is_ack = 0
       endpoint = 0x7fff00000000
#14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x7fff94094648) at taskprocessor.c:938
       local = {local_data = 0x7fff420c19c0, data = 0x600b26}
       t = 0x7fffd817e970
       size = 9989040
       __PRETTY_FUNCTION__ = "ast_taskprocessor_execute"
#15 0x00000000005f6aca in execute_tasks (data=0x7fff94094648) at threadpool.c:1322
       tps = 0x7fff94094648
#16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x1236e88) at taskprocessor.c:938
       local = {local_data = 0x57d11483, data = 0x0}
       t = 0x7fffd80a2b30
       size = 140734301476608
       __PRETTY_FUNCTION__ = "ast_taskprocessor_execute"
#17 0x00000000005f4dd5 in threadpool_execute (pool=0x123ab28) at threadpool.c:351
       __PRETTY_FUNCTION__ = "threadpool_execute"
#18 0x00000000005f63e8 in worker_active (worker=0x7fff9006c868) at threadpool.c:1105
       alive = 1
#19 0x00000000005f6194 in worker_start (arg=0x7fff9006c868) at threadpool.c:1024
       worker = 0x7fff9006c868
       saved_state = 32767
       __PRETTY_FUNCTION__ = "worker_start"
#20 0x00000000006023ee in dummy_start (data=0x7fff900696d0) at utils.c:1235
       __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140734301476608, -8273484635704556838, 140737221511808, 140734301477312, 140735609810640, 3, -8273484635729722662, 8273157970701812442}, __mask_was_saved = 0}}, __pad = {0x7fff420c0e30, 0x0, 0x0, 0x20}}
       __cancel_routine = 0x451815 <ast_unregister_thread>
       __cancel_arg = 0x7fff420c1700
       not_first_call = 0
       ret = 0x7ffff52f9858
       a = {start_routine = 0x5f610d <worker_start>, data = 0x7fff9006c868, name = 0x7fff900ce6e0 "worker_start         started at [ 1079] threadpool.c worker_thread_start()"}
#21 0x00007ffff5a4baa1 in start_thread (arg=0x7fff420c1700) at pthread_create.c:301
       __res = <value optimized out>
       pd = 0x7fff420c1700
       now = <value optimized out>
       unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140734301476608, 8273158244416717530, 140737221511808, 140734301477312, 140735609810640, 3, -8273484635702459686, -8273179729874550054}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
       not_first_call = <value optimized out>
       pagesize_m1 = <value optimized out>
       sp = <value optimized out>
       freesize = <value optimized out>
#22 0x00007ffff5057aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
{noformat}



By: Ian Gilmour (tuxian) 2016-09-08 04:11:32.476-0500

cli prior to crash + gdb output.

By: Ian Gilmour (tuxian) 2016-09-09 06:42:04.940-0500

This morning (after a rebuild), using the same config, I got another crash - this time it occurred immediately on asterisk startup (i.e. no calls running)!!

Could it be caused by some sort of race condition on multiple PJSIP registrations to the same SIP server? I can't think of anything else that's odd about this setup.

{noformat}
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa18ed700 (LWP 16459)]
0x00007ffff741b1b4 in pj_atomic_inc_and_get (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:935
935         pj_mutex_lock( atomic_var->mutex );
(gdb) bt
#0  0x00007ffff741b1b4 in pj_atomic_inc_and_get (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:935
#1  0x00007ffff741b20c in pj_atomic_inc (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:949
#2  0x00007ffff7387836 in pjsip_tx_data_add_ref (tdata=0x7fff8800d958) at ../src/pjsip/sip_transport.c:464
#3  0x00007fff6410e929 in sip_outbound_registration_response_cb (param=0x7fffa18ec5a0) at res_pjsip_outbound_registration.c:939
#4  0x00007ffff7360202 in call_callback (regc=0x7fff88008338, status=0, st_code=200, reason=0x7fffcc006ae0, rdata=0x7fffcc005008, expiration=120, contact_cnt=1, contact=0x7fffa18ec680, is_unreg=0) at ../src/pjsip-ua/sip_reg.c:775
#5  0x00007ffff73614c1 in regc_tsx_callback (token=0x7fff88008338, event=0x7fffa18ec8d0) at ../src/pjsip-ua/sip_reg.c:1345
#6  0x00007ffff739e59f in mod_util_on_tsx_state (tsx=0x7fff88015b98, event=0x7fffa18ec8d0) at ../src/pjsip/sip_util_statefull.c:81
#7  0x00007ffff739ae90 in tsx_set_state (tsx=0x7fff88015b98, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffcc005008, flag=0) at ../src/pjsip/sip_transaction.c:1233
#8  0x00007ffff739ddd1 in tsx_on_state_proceeding_uac (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2957
#9  0x00007ffff739d3ef in tsx_on_state_calling (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2540
#10 0x00007ffff739bd87 in pjsip_tsx_recv_msg (tsx=0x7fff88015b98, rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:1787
#11 0x00007ffff739a322 in mod_tsx_layer_on_rx_response (rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:875
#12 0x00007ffff7380c25 in pjsip_endpt_process_rx_data (endpt=0x123f798, rdata=0x7fffcc005008, p=0x7ffff1035e40, p_handled=0x7fffa18ecb64) at ../src/pjsip/sip_endpoint.c:894
#13 0x00007ffff0e1ae59 in distribute (data=0x7fffcc005008) at res_pjsip/pjsip_distributor.c:765
#14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x11a7cb8) at taskprocessor.c:938
#15 0x00000000005f6aca in execute_tasks (data=0x11a7cb8) at threadpool.c:1322
#16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x123ce38) at taskprocessor.c:938
#17 0x00000000005f4dd5 in threadpool_execute (pool=0x123d088) at threadpool.c:351
#18 0x00000000005f63e8 in worker_active (worker=0x7fff90000ee8) at threadpool.c:1105
#19 0x00000000005f6194 in worker_start (arg=0x7fff90000ee8) at threadpool.c:1024
#20 0x00000000006023ee in dummy_start (data=0x7fff90000e70) at utils.c:1235
#21 0x00007ffff5589aa1 in start_thread (arg=0x7fffa18ed700) at pthread_create.c:301
#22 0x00007ffff4b95aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) bt full
#0  0x00007ffff741b1b4 in pj_atomic_inc_and_get (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:935
       new_value = 140737354108928
#1  0x00007ffff741b20c in pj_atomic_inc (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:949
No locals.
#2  0x00007ffff7387836 in pjsip_tx_data_add_ref (tdata=0x7fff8800d958) at ../src/pjsip/sip_transport.c:464
No locals.
#3  0x00007fff6410e929 in sip_outbound_registration_response_cb (param=0x7fffa18ec5a0) at res_pjsip_outbound_registration.c:939
       retry_after = 0x0
       tsx = 0x7fff88015b98
       client_state = 0x11a7b78
       response = 0x7fff88018b88
       callback_invoked = 0x7fff88010060
       __PRETTY_FUNCTION__ = "sip_outbound_registration_response_cb"
#4  0x00007ffff7360202 in call_callback (regc=0x7fff88008338, status=0, st_code=200, reason=0x7fffcc006ae0, rdata=0x7fffcc005008, expiration=120, contact_cnt=1, contact=0x7fffa18ec680, is_unreg=0) at ../src/pjsip-ua/sip_reg.c:775
       cbparam = {regc = 0x7fff88008338, token = 0x11a7b78, status = 0, code = 200, reason = {ptr = 0x7fffcc006b40 "OK.5", slen = 2}, rdata = 0x7fffcc005008, expiration = 120, contact_cnt = 1, contact = {0x7fff8800db98, 0xa18ec680, 0x0, 0x7fffa18ec620, 0x7ffff741b75a, 0x7fff8800dc40, 0x7fff88007fa8, 0x7835015fb8, 0x88008338, 0x7fffa18ec640}, is_unreg = 0}
#5  0x00007ffff73614c1 in regc_tsx_callback (token=0x7fff88008338, event=0x7fffa18ec8d0) at ../src/pjsip-ua/sip_reg.c:1345
       is_unreg = 0
       rdata = 0x7fffcc005008
       expiration = 120
       contact_cnt = 1
       contact = {0x7fff8800db98, 0x45e035, 0xffffffffa18ec6b0, 0x7fff880119e8, 0x7fff8800e388, 0x11a7bf0, 0x7fffa18ec830, 0x7fffa18ec820, 0x7fffa18ec710, 0x7ffff741cc2c}
       status = 48
       regc = 0x7fff88008338
       tsx = 0x7fff88015b98
       handled = 0
       update_contact = 0
#6  0x00007ffff739e59f in mod_util_on_tsx_state (tsx=0x7fff88015b98, event=0x7fffa18ec8d0) at ../src/pjsip/sip_util_statefull.c:81
       tsx_data = 0x7fff88015fb8
#7  0x00007ffff739ae90 in tsx_set_state (tsx=0x7fff88015b98, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffcc005008, flag=0) at ../src/pjsip/sip_transaction.c:1233
       e = {prev = 0x7fffa18ec910, next = 0x7ffff7431463, type = PJSIP_EVENT_TSX_STATE, body = {timer = {entry = 0x7fffcc005008}, tsx_state = {src = {rdata = 0x7fffcc005008, tdata = 0x7fffcc005008, timer = 0x7fffcc005008, status = -872394744, data = 0x7fffcc005008}, tsx = 0x7fff88015b98, prev_state = 1, type = PJSIP_EVENT_RX_MSG}, tx_msg = {tdata = 0x7fffcc005008}, tx_error = {tdata = 0x7fffcc005008, tsx = 0x7fff88015b98}, rx_msg = {rdata = 0x7fffcc005008}, user = {user1 = 0x7fffcc005008, user2 = 0x7fff88015b98, user3 = 0x300000001, user4 = 0x288015c40}}}
       prev_state = PJSIP_TSX_STATE_CALLING
#8  0x00007ffff739ddd1 in tsx_on_state_proceeding_uac (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2957
       timeout = {sec = 0, msec = 0}
#9  0x00007ffff739d3ef in tsx_on_state_calling (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2540
       msg = 0x7fffcc006ad0
       code = 200
#10 0x00007ffff739bd87 in pjsip_tsx_recv_msg (tsx=0x7fff88015b98, rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:1787
       event = {prev = 0x0, next = 0x7fff88016438, type = PJSIP_EVENT_RX_MSG, body = {timer = {entry = 0x7fffcc005008}, tsx_state = {src = {rdata = 0x7fffcc005008, tdata = 0x7fffcc005008, timer = 0x7fffcc005008, status = -872394744, data = 0x7fffcc005008}, tsx = 0x7fffa18eca70, prev_state = -146688166, type = 32767}, tx_msg = {tdata = 0x7fffcc005008}, tx_error = {tdata = 0x7fffcc005008, tsx = 0x7fffa18eca70}, rx_msg = {rdata = 0x7fffcc005008}, user = {user1 = 0x7fffcc005008, user2 = 0x7fffa18eca70, user3 = 0x7ffff741b75a, user4 = 0x7fffa18eca70}}}
#11 0x00007ffff739a322 in mod_tsx_layer_on_rx_response (rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:875
       key = {ptr = 0x7fff8800de50 "c$REGISTER$z9hG4bKPjc25c7b82-ee5b-4ac0-9679-dfe63f1cfdef", slen = 56}
       hval = 4161045600
       tsx = 0x7fff88015b98
#12 0x00007ffff7380c25 in pjsip_endpt_process_rx_data (endpt=0x123f798, rdata=0x7fffcc005008, p=0x7ffff1035e40, p_handled=0x7fffa18ecb64) at ../src/pjsip/sip_endpoint.c:894
       msg = 0x7fffcc006ad0
       def_prm = {start_prio = 2710489984, start_mod = 0x6f3e0b, idx_after_start = 2710490000, silent = 32767}
       mod = 0x7ffff76922a0
       handled = 0
       i = 1
       status = 0
#13 0x00007ffff0e1ae59 in distribute (data=0x7fffcc005008) at res_pjsip/pjsip_distributor.c:765
       param = {start_prio = 0, start_mod = 0x7ffff1035ca0, idx_after_start = 1, silent = 0}
       handled = 0
       rdata = 0x7fffcc005008
       is_request = 0
       is_ack = 0
       endpoint = 0x0
#14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x11a7cb8) at taskprocessor.c:938
       local = {local_data = 0x7fffa18ed9c0, data = 0x600b26}
       t = 0x7fffcc02a000
       size = 9989040
       __PRETTY_FUNCTION__ = "ast_taskprocessor_execute"
#15 0x00000000005f6aca in execute_tasks (data=0x11a7cb8) at threadpool.c:1322
       tps = 0x11a7cb8
#16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x123ce38) at taskprocessor.c:938
       local = {local_data = 0x57d28c4d, data = 0x0}
       t = 0x7fffcc029d10
       size = 140735903880960
       __PRETTY_FUNCTION__ = "ast_taskprocessor_execute"
#17 0x00000000005f4dd5 in threadpool_execute (pool=0x123d088) at threadpool.c:351
       __PRETTY_FUNCTION__ = "threadpool_execute"
#18 0x00000000005f63e8 in worker_active (worker=0x7fff90000ee8) at threadpool.c:1105
       alive = 0
#19 0x00000000005f6194 in worker_start (arg=0x7fff90000ee8) at threadpool.c:1024
       worker = 0x7fff90000ee8
       saved_state = 32767
       __PRETTY_FUNCTION__ = "worker_start"
#20 0x00000000006023ee in dummy_start (data=0x7fff90000e70) at utils.c:1235
       __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140735903880960, -6756505628295498910, 140737220745856, 140735903881664, 0, 3, -6756505628337441950, 6756334233892352866}, __mask_was_saved = 0}}, __pad = {0x7fffa18ece30, 0x0, 0x0, 0x0}}
       __cancel_routine = 0x451815 <ast_unregister_thread>
       __cancel_arg = 0x7fffa18ed700
       not_first_call = 0
       ret = 0x0
       a = {start_routine = 0x5f610d <worker_start>, data = 0x7fff90000ee8, name = 0x7fff90001000 "worker_start         started at [ 1079] threadpool.c worker_thread_start()"}
#21 0x00007ffff5589aa1 in start_thread (arg=0x7fffa18ed700) at pthread_create.c:301
       __res = <value optimized out>
       pd = 0x7fffa18ed700
       now = <value optimized out>
       unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140735903880960, 6756335057497289570, 140737220745856, 140735903881664, 0, 3, -6756505628289207454, -6756322751223106718}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
       not_first_call = <value optimized out>
       pagesize_m1 = <value optimized out>
       sp = <value optimized out>
       freesize = <value optimized out>
#22 0x00007ffff4b95aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
{noformat}

By: Ian Gilmour (tuxian) 2016-09-09 06:54:22.492-0500

cli + gdb output from system startup for new crash.

By: Ian Gilmour (tuxian) 2016-09-09 08:06:39.416-0500

Observation - changing res_pjsip_outbound_registration.c:1267 from:

{noformat}
   schedule_registration(state->client_state, (ast_random() % 10) + 1);
{noformat}

to:

{noformat}
   schedule_registration(state->client_state, 3 + 1);
{noformat}

seems to gives me a system that crashes on PJSIP user registration during startup (the only time I see it work is when I see a SIP 503 reported for all the initial registration attempts).


By: Richard Mudgett (rmudgett) 2016-10-11 15:49:23.647-0500

I have been unable to find a cause for the crash.  A patch was recently committed to the v13 branch to help gather more information.  The patch extends MALLOC_DEBUG to cover PJPROJECT memory pools when you use the --with-pjproject-bundled Asterisk configure script flag.

./configure --with-pjproject-bundled --enable-dev-mode

https://gerrit.asterisk.org/#/c/4030/

By: Ian Gilmour (tuxian) 2016-10-13 04:15:19.720-0500

Hi Richard,

I built the current Asterisk 13 trunk code (commit 2971c1b4ebf8c4fb5ee2053c5812ea3896d6d17d) to generate the cli-and-gdb-3-crashes.txt.

The system was configured with:

{noformat}
./configure --prefix=$PREFIX --with-crypto=/usr --with-ssl=/usr \
           --with-srtp=$PREFIX --with-libcurl=$PREFIX --with-jansson=$PREFIX \
           --with-pjproject-bundled --enable-dev-mode
{noformat}
and
{noformat}
menuselect/menuselect \
           --enable DONT_OPTIMIZE --enable BETTER_BACKTRACES --enable MALLOC_DEBUG menuselect.makeopts
{noformat}
I changed 1 line of code - the schedule_registration() call as per above (replacing (ast_random() % 10) with 3) in order to force the issue to occur more quickly as I am pretty sure it's related to overlapping registrations all going out on the same pjsip transport.

With this 1 line change crashes occur almost immediately on startup. Without the change crashes occur more randomly.

The cli-and-gdb-3-crashes.txt file contains 3 crashes.

I've dumped out some additional data structures with gdb when the crashes occur. At least some of this data looks to be corrupt e.g. in a print *tsx->last_tx there is an is_pending value of 36. A quick glance through the code suggests this field should only ever be 0 or 1. So it looks like it's potentially accessing data structures that have already been freed and the memory reallocated - but that's just a guess.

The var/log/asterisk/mmlog file only contains the "New session" text for each run.

By: Ian Gilmour (tuxian) 2016-10-13 09:31:34.956-0500

For cli-and-gdb-inc-dec-ref-logging.txt I modified third-party/pjproject/patches/0001-r5400-pjsip_tx_data_dec_ref.patch to log calls to pjsip_tx_add_ref() and pjsip_tx_dec_ref().

At the time of the crash it's attempting to use a tdata of 0x7fff80000eb8 but this pjsip_tx_data was destroyed earlier ~[2016-10-13 13:49:34.858].


By: Ian Gilmour (tuxian) 2016-10-13 10:50:43.554-0500

cli-and-gdb-bt-on-destroying_tx_data.tgz shows a slightly different crash with the same root cause. In frame 1 tsx->last_tx is pointing to data that has already been freed.

On this gdb trace I set a breakpoint when the pjsip_tx_data is about to be destroyed and got gdb to do a "bt full; cont" on hitting the breakpoint. So you can identify where the pjsip_tx_data destruction is originating.



By: Ross Beer (rossbeer) 2016-10-13 10:56:09.813-0500

Not sure if this is of any help, however could this be fixed by using the SVN version of PJSIP. There has just been a patch put up to help resolve TLS issues:

https://trac.pjsip.org/repos/changeset/5459

By: Ian Gilmour (tuxian) 2016-10-13 11:27:43.660-0500

In the case of the last crash on frame 1 there is an assumption in the code that if tsx->last_tx is non-NULL then it points to valid pjsip_tx_data. But this clearly isn't the case.

I suspect the crashes occurs because tsx->last_tx is not being set to NULL everywhere it should be. (i.e. whenever pjsip_tx_data_dec_ref() returns PJSIP_EBUFDESTROYED).

There are places in the asterisk code where it sets:

{noformat}
response->old_request = tsx->last_tx;
[sip_outbound_registration_response_cb()]
{noformat}

and others where:

{noformat}
pjsip_tx_data_dec_ref(response->old_request);
[registration_response_destroy()]
{noformat}

with no check for the PJSIP_EBUFDESTROYED return value and no setting to NULL of response->old_request or tsx->last_tx as a result.

I suspect the error lies in these bits of code rather than the PJSIP TLS code.



By: Richard Mudgett (rmudgett) 2016-10-17 18:09:37.503-0500

Both ASTERISK-26387 and ASTERISK-26344 are likely the same issue.  I have been studying the logs from both issues.  I have noticed that on the ASTERISK-26387 logs there are OPTIONS ping responses for endpoint qualification that are being processed by a different serializer than the request.  This can cause reentrancy problems (e.g. crashes).  The outgoing OPTIONS requests go out on a pjsip/default serializer while the response is processed by a pjsip/distributor serializer because the distributor cannot find the original serializer that sent the request.  I also noticed that when this happened the updated contact status was reporting for an endpoint that needed DNS resolution (sip:sbc.anveno.com was one).  On the ASTERISK-26344 logs a similar thing is happening but I see it for outbound registration requests needing DNS resolution.  The REGISTER response is being processed by a pjsip/distributor serializer while the request went out on a pjsip/outreg serializer.

By: Ian Gilmour (tuxian) 2016-10-21 08:17:42.226-0500

I spent a bit more time looking over the code but I still cannot see where the pjsip_tx_data ref_cnt isn't been incremented correctly.

But, to try and confirm that this is the problem, as a test I modified pjsip_tx_data_create() to call pjsip_tx_data_add_ref() just prior to returning PJ_SUCCESS.

i.e.

{noformat}
#if defined(PJ_DEBUG) && PJ_DEBUG!=0
   pj_atomic_inc( tdata->mgr->tdata_counter );
#endif

   *p_tdata = tdata;
   pjsip_tx_data_add_ref(tdata);
   return PJ_SUCCESS;
}
{noformat}

and I now have a system that no longer crashes. But (as expected) it does now leak memory. :-)


By: Richard Mudgett (rmudgett) 2016-10-21 22:05:08.315-0500

[~tuxian] I still haven't found why the tdata ref count is off.  There is some subtle race condition in play here between the serializer and the monitor thread.  I am attaching a patch to gather more debug information concerning the tdata ref counts, transaction, and the transaction's group locking.  If you can apply the patch and gather the same kind of information you gathered for the [cli-and-gdb.tgz] we might be able to see what is happening.  Debug level 5.

[^jira_asterisk_26344_v13_debuging.patch] - This patch replaces the patch you attached.

# On a clean workspace apply the patch:
{{patch -p1 -i jira_asterisk_26344_v13_debuging.patch}}
# Modify the {{schedule_registration(state->client_state, 3 + 1);}} line to cause the crash so we can see which tdata to look for in the debug log.
# Configure asterisk to use the bundled pjproject.

FYI: With MALLOC_DEBUG enabled, when you see something reference the 0xdeaddeaddeaddead address you know someone is attempting to use already freed memory.

By: Ian Gilmour (tuxian) 2016-10-24 08:55:52.438-0500

Hi Richard,
cli-and-gdb-2016-10-14.tgz log was generated after:
# applying your patch.
# using:
{noformat}schedule_registration(state->client_state, 3 + 1);{noformat}
# configured using:
{noformat}
--with-pjproject-bundled --enable-dev-mode
{noformat}
# menuselect used:
{noformat}
menuselect/menuselect \
   --enable DONT_OPTIMIZE --enable BETTER_BACKTRACES --enable MALLOC_DEBUG menuselect.makeopts
{noformat}

At the time of the crash frame 0 has entry=0xdeaddeaddeaddead.


By: Ian Gilmour (tuxian) 2016-10-24 09:34:02.792-0500

cli-and-gdb-2016-10-24-crash2.txt is the same code build. but shows a slightly different crash. This time hdr=0xdeaddeaddeaddead.

By: Ian Gilmour (tuxian) 2016-10-25 04:37:42.304-0500

cli-and-gdb-2016-10-25-crash1.txt - another crash similar to the last one but with a bit more debug as to when hdr gets set to 0xdeaddeaddeaddead.

By: Richard Mudgett (rmudgett) 2016-10-25 17:44:23.679-0500

[^0006_jira_asterisk_26344_v13_fix.patch] - This is an initial attempt to fix the crashes resulting from tdata being unreffed too many times.  Copy this file as is into the third-party/pjproject/patches directory.

Let me know how well it works.

By: Ian Gilmour (tuxian) 2016-10-26 05:19:40.679-0500

still crashing i'm afraid - see [^cli-and-gdb-2016-10-26-crash1-after-fix.tgz] for details.

By: Richard Mudgett (rmudgett) 2016-10-26 13:08:21.767-0500

[^0006_jira_asterisk_26344_v13_fix_cache.patch] - This patch should fix the DNS cache crash you are now seeing.  Place this file also in the third-party/pjproject/patches directory.  You will need both fix patches because the DNS cache crash is actually a different bug.

Let me know how well these fixes are working.

By: Ian Gilmour (tuxian) 2016-10-27 04:40:57.730-0500

With the latest patch in place I got no immediate crash so I left it running overnight. Still no crashes by morning. So good progress. :-)

But I suspect the following indicate we now have a memory leak.

{noformat}
grep -e ":40:0[012]" -B 3 screenlog.2 | grep -e ":40:0" -e 31339 | sed 's/[[:space:]]*$//'
31339 iang      20   0 2778m  63m  13m S  0.3  0.8   0:47.22 asterisk
top - 21:40:01 up 14 days, 13:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  63m  13m S  0.7  0.8   1:03.18 asterisk
top - 22:40:02 up 14 days, 14:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  64m  13m S  1.7  0.8   1:19.21 asterisk
top - 23:40:01 up 14 days, 15:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  65m  13m S  0.3  0.8   1:36.02 asterisk
top - 00:40:02 up 14 days, 16:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  67m  13m S  0.0  0.9   1:52.29 asterisk
top - 01:40:00 up 14 days, 17:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  67m  13m S  0.7  0.9   2:08.39 asterisk
top - 02:40:01 up 14 days, 18:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  68m  13m S  0.7  0.9   2:24.50 asterisk
top - 03:40:02 up 14 days, 19:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  68m  13m S  0.3  0.9   2:41.09 asterisk
top - 04:40:01 up 14 days, 20:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  69m  13m S  0.3  0.9   2:59.74 asterisk
top - 05:40:02 up 14 days, 21:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  70m  13m S  0.3  0.9   3:16.87 asterisk
top - 06:40:00 up 14 days, 22:39,  5 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  72m  13m S  0.7  0.9   3:33.20 asterisk
top - 07:40:01 up 14 days, 23:39,  6 users,  load average: 0.00, 0.00, 0.00
31339 iang      20   0 2778m  73m  13m S  0.3  0.9   3:49.29 asterisk
top - 08:40:02 up 15 days, 39 min,  6 users,  load average: 0.00, 0.00, 0.00
{noformat}

and,

{noformat}
grep -nH -B 5 -e "bytes in all allocations and deferred free allocations" screenlog.0
screenlog.0-56326-
screenlog.0-56327-  12590115 bytes in all allocations
screenlog.0-56328-   1409194 bytes in deferred free large allocations
screenlog.0-56329-     35735 bytes in deferred free small allocations
screenlog.0-56330-   1444929 bytes in deferred free allocations
screenlog.0:56331:  14035044 bytes in all allocations and deferred free allocations
--
screenlog.0-56594-
screenlog.0-56595-  12591138 bytes in all allocations
screenlog.0-56596-   1405537 bytes in deferred free large allocations
screenlog.0-56597-     35642 bytes in deferred free small allocations
screenlog.0-56598-   1441179 bytes in deferred free allocations
screenlog.0:56599:  14032317 bytes in all allocations and deferred free allocations
--
screenlog.0-59106-
screenlog.0-59107-  12638906 bytes in all allocations
screenlog.0-59108-   1416980 bytes in deferred free large allocations
screenlog.0-59109-     36774 bytes in deferred free small allocations
screenlog.0-59110-   1453754 bytes in deferred free allocations
screenlog.0:59111:  14092660 bytes in all allocations and deferred free allocations
--
screenlog.0-66080-
screenlog.0-66081-  12690106 bytes in all allocations
screenlog.0-66082-   1441081 bytes in deferred free large allocations
screenlog.0-66083-     36137 bytes in deferred free small allocations
screenlog.0-66084-   1477218 bytes in deferred free allocations
screenlog.0:66085:  14167324 bytes in all allocations and deferred free allocations
--
screenlog.0-67593-
screenlog.0-67594-  12700346 bytes in all allocations
screenlog.0-67595-   1439375 bytes in deferred free large allocations
screenlog.0-67596-     36300 bytes in deferred free small allocations
screenlog.0-67597-   1475675 bytes in deferred free allocations
screenlog.0:67598:  14176021 bytes in all allocations and deferred free allocations
--
screenlog.0-285757-
screenlog.0-285758-  15318479 bytes in all allocations
screenlog.0-285759-   1413605 bytes in deferred free large allocations
screenlog.0-285760-     36032 bytes in deferred free small allocations
screenlog.0-285761-   1449637 bytes in deferred free allocations
screenlog.0:285762:  16768116 bytes in all allocations and deferred free allocations
--
screenlog.0-325786-
screenlog.0-325787-  15715874 bytes in all allocations
screenlog.0-325788-   1381982 bytes in deferred free large allocations
screenlog.0-325789-     36186 bytes in deferred free small allocations
screenlog.0-325790-   1418168 bytes in deferred free allocations
screenlog.0:325791:  17134042 bytes in all allocations and deferred free allocations
{noformat}

[^cli-and-gdb-2016-10-27.tgz] is last night log.

By: Richard Mudgett (rmudgett) 2016-10-27 14:07:40.715-0500

We have more than a potential memory leak.  MALLOC_DEBUG is reporting memory corruption in the [^cli-and-gdb-2016-10-27.tgz] log.  Unfortunately, MALLOC_DEBUG cannot be of any more help in hunting down the corruption.  You'll need to switch to using valgrind after disabling MALLOC_DEBUG to get more information on who is corrupting the memory.

https://wiki.asterisk.org/wiki/display/AST/Valgrind

By: Ian Gilmour (tuxian) 2016-10-28 10:22:47.001-0500

[^2016-10-28.tgz] has the results of 5 valgrind runs with associated console/gdb outputs.
Some of these I ^C'ed as they didn't want to crash on their own.
Others I changed the Valgrind debug switches used.
Hopefully there is something in here that proves useful.


By: Friendly Automation (friendly-automation) 2016-10-28 17:15:16.306-0500

Change 4228 had a related patch set uploaded by Richard Mudgett:
bundled pjproject: Crashes while resolving DNS names.

[https://gerrit.asterisk.org/4228|https://gerrit.asterisk.org/4228]

By: Friendly Automation (friendly-automation) 2016-10-28 17:15:37.648-0500

Change 4229 had a related patch set uploaded by Richard Mudgett:
bundled pjproject: Crashes while resolving DNS names.

[https://gerrit.asterisk.org/4229|https://gerrit.asterisk.org/4229]

By: Friendly Automation (friendly-automation) 2016-10-28 17:15:54.880-0500

Change 4230 had a related patch set uploaded by Richard Mudgett:
bundled pjproject: Crashes while resolving DNS names.

[https://gerrit.asterisk.org/4230|https://gerrit.asterisk.org/4230]

By: Richard Mudgett (rmudgett) 2016-10-28 17:22:30.435-0500

[~tuxian] I'm going to declare this issue resolved when the above reviews are merged because the original crashing issue and the cache crash is fixed with the patch.  Teluu has already merged those changes into their code.  The memory corruption and possible memory leak issue will be continued in the ASTERISK-26516 issue.

By: Friendly Automation (friendly-automation) 2016-10-31 11:38:21.138-0500

Change 4228 merged by Joshua Colp:
bundled pjproject: Crashes while resolving DNS names.

[https://gerrit.asterisk.org/4228|https://gerrit.asterisk.org/4228]

By: Friendly Automation (friendly-automation) 2016-10-31 12:45:05.731-0500

Change 4229 merged by zuul:
bundled pjproject: Crashes while resolving DNS names.

[https://gerrit.asterisk.org/4229|https://gerrit.asterisk.org/4229]

By: Friendly Automation (friendly-automation) 2016-10-31 12:45:08.793-0500

Change 4230 merged by zuul:
bundled pjproject: Crashes while resolving DNS names.

[https://gerrit.asterisk.org/4230|https://gerrit.asterisk.org/4230]