[Home]

Summary:ASTERISK-28140: repeated segmentation faults
Reporter:Eyal Hasson (eyal@kolhl.com)Labels:pjsip
Date Opened:2018-10-30 09:52:30Date Closed:2018-11-19 08:01:32.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/PBX
Versions:13.22.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-28486 Randomly generated segfault in asterisk process
Environment:CentOS 7, FreeBPX 14.0.4.1, Dell PowerEdge R430Attachments:( 0) ast_coredumper
( 1) core.phonelinuxsrv.kolhl.com-2018-10-29T17-01-31+0200-thread1.txt
( 2) core.phonelinuxsrv.kolhl.com-2018-10-30T03-44-03+0200-thread1.txt
( 3) core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03+0200-brief.txt
( 4) core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03+0200-full.txt
( 5) core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03+0200-locks.txt
( 6) core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03+0200-thread1.txt
Description:System is running a large ARI application with 1000 lines. It ran smoothly for 4 months. Suddenly yesterday night it started getting segmentation faults, and crashes often - sometimes after 10 minutes, and sometimes after a 3 hours.
Comments:By: Asterisk Team (asteriskteam) 2018-10-30 09:52:31.748-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Eyal Hasson (eyal@kolhl.com) 2018-10-30 09:54:25.347-0500

These are the backtrace of the last crash

By: George Joseph (gjoseph) 2018-10-30 15:15:56.266-0500

The crashes are a result of an object being reference counted over 100000 times.  When that happens AND you specified --enable-dev-mode when you ./configured asterisk, asterisk will abort.

We've fixed the refcount issues in the latest versions of Asterisk but you should never run Asterisk in production with the --enable-dev-mode flag turned on.  If you recompile with that flag turned off, you'll see the messages but Asterisk will continue to run.



By: Eyal Hasson (eyal@kolhl.com) 2018-10-30 15:46:05.275-0500

I am running Asterisk with FreePBX, so I have no control on the build configuration. Any invocation option that will turn this off?
Also, is there anyway to circumvent the problem meanwhile? Will changing to PJSIP instead of SIP will help?

By: Corey Farrell (coreyfarrell) 2018-10-30 16:01:10.236-0500

An interesting thing is that the actual segmentation fault is happening in libbfd which is part of binutils.  This really shouldn't happen.  Have you installed system updates recently without rebooting?  Any possibility you have a faulty RAM chip?

If chan_sip is leaking a references then switching to PJSIP might help, but it's still concerning that you are having a crash inside binutils.

[~gjoseph]: ao2_ref produces an assertion failure message and backtrace when a reference count exceeds 100k (with or without dev-mode).  Even with dev-mode enabled this wouldn't cause an abort unless DO_CRASH is enabled (I assume FreePBX would not enable that).  In this case asterisk is definitely not aborting.

By: George Joseph (gjoseph) 2018-10-30 16:07:07.578-0500

Yeah, I saw the "excessive refcount" thing and stopped looking at the backtrace.  My bad.  I thought this was the stasis message type issue we just fixed.
I'm going to stop shooting my mouth off and look at this fresh in the morning.




By: Eyal Hasson (eyal@kolhl.com) 2018-10-30 16:18:11.566-0500

This system has 96GB of RAM (because of a memory leak problem we had: https://community.asterisk.org/t/memory-leak-consumes-16gb-in-less-then-24-hours/71916/18), so it usually only around 6% of usage.
Can you give more details where in ARI the problem is? Maybe some change in my application triggered it.

By: George Joseph (gjoseph) 2018-10-31 08:38:00.617-0500

[~eyal@kolhl.com] I messed up.  Although this issue was triggered by ARI, it's not the issue I was thinking of.  The one I was thinking of involved a stasis message type whose reference count increased over time until it started spitting out the refcount exceeded messages.  That was the one that was just fixed.  Your's actually happened during the process of an audio frame.  Here was the sequence...

* Something happened during the processing of the frame to cause the refcount to exceed 100000.  That's a problem in itself.
* Since that should never happen, asterisk attempted to print a message and call assert().
* The assert() call then attempted to print a backtrace showing the call stack.
* During the generation of the backtrace (which occurs in the libbfd system library) something got corrupted and generated a segfault which crashed asterisk.

If the segfault hadn't happened, asterisk probably would have simply printed the message and backtrace and continued.

So now we have 2 issues...  The refcount > 100000 issue and the segfault during the backtrace generation.

Let me dig into this a bit.

Can you post a few more backtraces though?  I want to see if they're all in the same place.  Just the *thread1.txt file is enough.


By: Eyal Hasson (eyal@kolhl.com) 2018-10-31 11:13:36.936-0500

Two more backtraces

By: Eyal Hasson (eyal@kolhl.com) 2018-10-31 11:18:22.996-0500

Also, looking at the 'full' log file, I see that it had refcount error 1996 in one day. This must mean that it occurs many times, but only a few of them actually crashes the system. Here are some examples, with one that cause a seg fault:

[2018-10-30 06:33:49] ERROR[20378][C-000003e8] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x23130b8
[2018-10-30 06:33:49] ERROR[20378][C-000003e8] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x23130b8 (0)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: Got 28 backtrace records
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #0: [0x5fcc55] asterisk __ast_assert_failed() (0x5fcbd1+84)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #1: [0x45d4bd] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #2: [0x45d816] asterisk __ao2_ref() (0x45d7e8+2E)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #3: [0x52169c] asterisk ast_frdup() (0x521430+26C)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #4: [0x521093] asterisk ast_frisolate() (0x521070+23)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #5: [0x5ef7bd] asterisk ast_trans_frameout() (0x5ef6da+E3)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #6: [0x5ef7e1] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #7: [0x5efeb5] asterisk ast_translate() (0x5efbdd+2D8)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #8: [0x4b99f9] asterisk ast_write() (0x4b89db+101E)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #9: [0x485df0] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #10: [0x4863ef] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #11: [0x486ab7] asterisk bridge_channel_internal_join() (0x48656f+548)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #12: [0x46d1c1] asterisk ast_bridge_join() (0x46cf59+268)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #13: [0x50b07d] asterisk ast_bridge_call_with_flags() (0x50aed6+1A7)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #14: [0x50b14f] asterisk ast_bridge_call() (0x50b122+2D)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #15: [0x7f4f7a3a1667] app_dial.so <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #16: [0x7f4f7a3a19fe] app_dial.so <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #17: [0x582ef6] asterisk pbx_exec() (0x582dda+11C)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #18: [0x570787] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #19: [0x573c6a] asterisk ast_spawn_extension() (0x573c06+64)
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #20: [0x574945] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #21: [0x576115] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[21848][C-00000486] logger.c: #22: [0x5f9c08] asterisk <unknown>()
[2018-10-30 06:33:49] ERROR[17367][C-000002c4] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x23130b8
[2018-10-30 06:33:49] ERROR[17367][C-000002c4] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x23130b8 (0)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: Got 28 backtrace records
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #0: [0x5fcc55] asterisk __ast_assert_failed() (0x5fcbd1+84)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #1: [0x45d4bd] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #2: [0x45d816] asterisk __ao2_ref() (0x45d7e8+2E)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #3: [0x52169c] asterisk ast_frdup() (0x521430+26C)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #4: [0x521093] asterisk ast_frisolate() (0x521070+23)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #5: [0x5ef7bd] asterisk ast_trans_frameout() (0x5ef6da+E3)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #6: [0x5ef7e1] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #7: [0x5efeb5] asterisk ast_translate() (0x5efbdd+2D8)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #8: [0x4b99f9] asterisk ast_write() (0x4b89db+101E)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #9: [0x485df0] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #10: [0x4863ef] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #11: [0x486ab7] asterisk bridge_channel_internal_join() (0x48656f+548)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #12: [0x46d1c1] asterisk ast_bridge_join() (0x46cf59+268)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #13: [0x50b07d] asterisk ast_bridge_call_with_flags() (0x50aed6+1A7)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #14: [0x50b14f] asterisk ast_bridge_call() (0x50b122+2D)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #15: [0x7f4f7a3a1667] app_dial.so <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #16: [0x7f4f7a3a19fe] app_dial.so <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #17: [0x582ef6] asterisk pbx_exec() (0x582dda+11C)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #18: [0x570787] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #19: [0x573c6a] asterisk ast_spawn_extension() (0x573c06+64)
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #20: [0x574945] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #21: [0x576115] asterisk <unknown>()
[2018-10-30 06:33:49] VERBOSE[20378][C-000003e8] logger.c: #22: [0x5f9c08] asterisk <unknown>()
[2018-10-30 06:33:49] ERROR[17701][C-000002e1] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x23130b8
[2018-10-30 06:33:49] ERROR[17701][C-000002e1] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x23130b8 (0)
[2018-10-30 06:33:54] Asterisk 13.22.0 built by mockbuild @ jenkins7 on a x86_64 running Linux on 2018-07-25 22:30:39 UTC



By: Corey Farrell (coreyfarrell) 2018-10-31 11:38:57.193-0500

From https://issues.asterisk.org/jira/secure/attachment/57970/core.phonelinuxsrv.kolhl.com-2018-10-30T03-44-03%2B0200-thread1.txt:
{noformat}
Thread 1 (Thread 0x7f16a7253700 (LWP 189313)):
#0  0x00007f19ad4f7277 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f19ad4f8968 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f19ad539d37 in __libc_message () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007f19ad543c86 in _int_malloc () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007f19ad54684c in malloc () from /lib64/libc.so.6
{noformat}

IMO this indicates memory corruption.

By: George Joseph (gjoseph) 2018-11-01 08:26:44.966-0500

Yeah I agree with [~coreyfarrell].  If __libc_message() is explicitly calling abort() then there's something deeply wrong on that system.  Whether it's hardware or software is impossible to tell.  Any further troubleshooting steps would require you to configure and build asterisk from source.  

By: Eyal Hasson (eyal@kolhl.com) 2018-11-01 08:29:34.525-0500

So, what do you suggest? Do you thing reinstalling the system might help?

By: George Joseph (gjoseph) 2018-11-01 11:01:09.607-0500

It's really hard to say.   Does {{dmesg}} give you any hint on failing hardware?

Reinstalling would be the first thing to try.  Next would be running a memory test or simply moving memory sticks around.

In the long run, you should change to chan_pjsip since chan_sip is only community supported now.  I hesitate to recommend changing over now though because the change might make troubleshooting a little more complicated BUT the change will certainly alter Asterisk's memory utilization profile.  That may resolve the crashes but if it works it may only be masking an underlying issue.  



By: Eyal Hasson (eyal@kolhl.com) 2018-11-01 11:44:04.999-0500

dmesg doesn't show anything except Asterisk's seg faults. Also the Dell server manager shows all memory (and all other hardware components) as good.

Before I do anything drastic, any idea what specific operation in Asterisk triggers the problem? It might be easier to circumvent it.

By: George Joseph (gjoseph) 2018-11-05 07:08:32.542-0600

We've just had another report of this exact same behavior so I wouldn't do anything at this point.  Based on what we're seeing, its the signed linear codec (slin) who's refcount is running away.  Do you do a lot of transcoding?   Do you have "transcode_via_slin" turned on in asterisk.conf?  Could you turn "transcode_via_slin" off and see if that helps stabilize things?


By: Eyal Hasson (eyal@kolhl.com) 2018-11-05 07:23:58.764-0600

transcode_via_slin is NOT turned on. How do I know if I make many transcodings? We playback many vox files, while the calls are usually in alaw. Does this implies transcoding?
I was about to convert to PJSIP today. Should I go on, or stay with SIP?

By: George Joseph (gjoseph) 2018-11-05 07:47:49.060-0600

Simple playback does not invoke the transcoding path so that's not the issue.
Changing over to chan_pjsip will probably not help in this case so it's up to you as to whether you want to move over now.  I'd suggest that you make ready to do it but don't feel you have to rush because of this particular issue.

To help us debug, could you run
{noformat}
$ /var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps --no-default-search "/tmp/core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03+0200"
{noformat}

Replace the coredump path with whatever the real path is.
This will produce a tarball that captures everything we need to dig into the this further.
Upload the file to the file hosting service of your choice (it'll be big) and email the link to asteriskteam@digium.com.
Mention ASTERISK-28140 in the subject line.
Don't try to attach the file here or attach it to the email. :)



By: Eyal Hasson (eyal@kolhl.com) 2018-11-05 07:59:16.823-0600

Trying to create the tarball file, I get: "tar: ./tmp/core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03+0200: File removed before we read it", and the tar.gz file is only 7.7M. Is this what it suppose to be?

By: Eyal Hasson (eyal@kolhl.com) 2018-11-05 10:12:08.137-0600

Is it safe for me to upgrade my Asterisk to 13.23. Will it help?

By: George Joseph (gjoseph) 2018-11-05 14:12:20.212-0600

Sorry, I'm not getting notifications for some reason...
Try using the attached ast_coredumper script.

Yes it's safe to upgrade to 13.23.  I was going to suggest that.


By: George Joseph (gjoseph) 2018-11-05 14:23:16.843-0600

Also if you still have the coredump for
core.phonelinuxsrv.kolhl.com-2018-10-30T03-44-03+0200
that one would be interesting as well.  it's slightly different than the others.


By: Eyal Hasson (eyal@kolhl.com) 2018-11-05 15:04:06.770-0600

Your script worked. I created the tarball file from the dump you asked and uploaded it to the jumbomail web site (you should receive a mail with the link). it's a 49MB file.

By: Eyal Hasson (eyal@kolhl.com) 2018-11-07 00:48:17.443-0600

Did you get the tarball file?

By: Eyal Hasson (eyal@kolhl.com) 2018-11-10 15:41:29.929-0600

I tried now to move to PJSIP. It made the system completely unresponsive. Asterisk would not crash but after receiving a few calls simply stop responding. I reverted to SIP. Is this related?

By: Friendly Automation (friendly-automation) 2018-11-19 08:01:33.193-0600

Change 10602 merged by George Joseph:
backtrace:  Refactor ast_bt_get_symbols so it doesn't crash

[https://gerrit.asterisk.org/10602|https://gerrit.asterisk.org/10602]

By: Friendly Automation (friendly-automation) 2018-11-19 08:02:03.381-0600

Change 10606 merged by George Joseph:
backtrace:  Refactor ast_bt_get_symbols so it doesn't crash

[https://gerrit.asterisk.org/10606|https://gerrit.asterisk.org/10606]

By: Friendly Automation (friendly-automation) 2018-11-19 08:02:28.724-0600

Change 10607 merged by George Joseph:
backtrace:  Refactor ast_bt_get_symbols so it doesn't crash

[https://gerrit.asterisk.org/10607|https://gerrit.asterisk.org/10607]

By: Friendly Automation (friendly-automation) 2018-11-26 13:55:26.219-0600

Change 10702 merged by Jenkins2:
backtrace:  Refactor ast_bt_get_symbols so it doesn't crash

[https://gerrit.asterisk.org/10702|https://gerrit.asterisk.org/10702]