[Home]

Summary:ASTERISK-27859: Crash when freeing after bridge
Reporter:Olivier Krief (okrief)Labels:pjsip
Date Opened:2018-05-16 06:56:40Date Closed:2020-01-14 11:13:27.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/Bridging
Versions:13.20.0 Frequency of
Occurrence
Occasional
Related
Issues:
is duplicated byASTERISK-27898 Crash at the and of a call
Environment:Debian Stretch on VMWare. Asterisk compiled from sourceAttachments:( 0) attach.zip
( 1) attach2.zip
( 2) core-brief.txt
( 3) core-full.txt
( 4) core-locks.txt
( 5) core-thread1.txt
( 6) debug_extract
( 7) syslog_extract
Description:Asterisk stops with the following message (from /var/log/syslog):
May 16 12:11:07 localhost asterisk[3216]: *** Error in `/usr/sbin/asterisk': free(): invalid size: 0x00007f1a540db0a0 ***
May 16 12:11:07 localhost asterisk[3216]: ======= Backtrace: =========
May 16 12:11:07 localhost asterisk[3216]: /lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7f1a6800dbfb]
May 16 12:11:07 localhost asterisk[3216]: /lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7f1a68013fc6]
May 16 12:11:07 localhost asterisk[3216]: /lib/x86_64-linux-gnu/libc.so.6(+0x7780e)[0x7f1a6801480e]
May 16 12:11:07 localhost asterisk[3216]: /usr/sbin/asterisk(ast_bridge_run_after_callback+0x11c)[0x56445a48b5cb]
May 16 12:11:07 localhost asterisk[3216]: /usr/sbin/asterisk(+0x80032)[0x56445a481032]
May 16 12:11:07 localhost asterisk[3216]: /usr/sbin/asterisk(+0x226d4e)[0x56445a627d4e]
May 16 12:11:07 localhost asterisk[3216]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7494)[0x7f1a69153494]
May 16 12:11:07 localhost asterisk[3216]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f1a68085acf]
Comments:By: Asterisk Team (asteriskteam) 2018-05-16 06:56:41.649-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Joshua C. Colp (jcolp) 2018-05-16 07:04:30.462-0500

Thank you for taking the time to report this bug and helping to make Asterisk better. Unfortunately, we cannot work on this bug because your description did not include enough information. Please read over the Asterisk Issue Guidelines [1] which discusses the information necessary for your issue to be resolved and the format that information needs to be in. We would be grateful if you would then provide a more complete description of the problem. At a minimum, we need:

1. The specific steps or actions you took that caused you to encounter the problem.
2. The behavior you expected and the location of documentation that led you to that expectation.
3. The behavior you actually encountered.

To demonstrate the issue in detail, please include Asterisk log files generated per the instructions on the wiki [2]. If applicable, please ensure that protocol-level trace debugging is enabled, e.g., 'sip set debug on' if the issue involves chan_sip, and configuration information such as dialplan and channel configuration.

Thanks!

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

[2] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information



By: Olivier Krief (okrief) 2018-05-16 07:19:27.931-0500

Hello,
Latest comment appeared while I was still gathering files needed to describe the issue.
Please, do not hesitate to ask any further information.
Cheers

By: Joshua C. Colp (jcolp) 2018-05-16 07:24:22.269-0500

We also need information about how Asterisk is specifically being used - what are you doing? any attended transfers or other things? Your debug log does not include verbose information showing this.

By: Olivier Krief (okrief) 2018-05-16 09:01:51.194-0500

Asterisk is used only for SIP trunking passing calls to/from carriers from/to call centers IPBX.
I can't exclude one such IPBX is sending un-supported signals (REFER, ...).
Can I privately send config files (sip.conf, extensions.conf) ?

By: Richard Mudgett (rmudgett) 2018-05-22 15:53:44.480-0500

Your backtrace appears to contain a memory corruption. We need one or both of the following items to continue investigation of the issue:
1. Valgrind output. See https://wiki.asterisk.org/wiki/display/AST/Valgrind for instructions on how to use Valgrind with Asterisk.
2. MALLOC_DEBUG output. See https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag for instructions on how to use the MALLOC_DEBUG option.

Note that MALLOC_DEBUG and Valgrind are mutually exclusive options. Valgrind output is preferable, but will be more system resource intensive and may be difficult to get on a production system. In such a case, you may have better luck getting the necessary output from MALLOC_DEBUG.



By: Olivier Krief (okrief) 2018-05-23 02:23:32.385-0500

Hello Richard,

1. Shall I understand a whole new set of captured files (debug, core-*, ...) is needed ? If positive, mean time between issue occurences is 4 to 8 weeks.
2. While setting MALLOC_DEBUG option, I can also upgrade from 13.20.0 to 13.21.0. Would you prefer me to stay with current 13.20.0 version or upgrade to 13.21.0 ?


By: Richard Mudgett (rmudgett) 2018-05-23 06:57:29.451-0500

Memory corruption crashes in malloc/free rarely provide clues about what is corrupting memory and thus the need for new information under MALLOC_DEBUG/valgrind.  You could use either version we just need to know which was used.

P.S. ASTERISK-27862 is another memory corruption issue.  I mention this only because they may be related.

By: Olivier Krief (okrief) 2018-05-23 07:24:17.778-0500

So I think we will then go with 13.21.0 with MALLOC_DEBUG set, as this box handles up to 150 simultaneous calls, and from previous message, I'm afraid Valgring may unfortunately create issues, consumming too much resources.

By: Olivier Krief (okrief) 2018-06-01 07:36:05.452-0500

A new crash occurred two days ago.
Unfortunately, our system was still configured with Asterisk 13.20.0 and without MALLOC_DEBUG.
Is worth or useful posting last crash data (core files, ...) ?

By: Richard Mudgett (rmudgett) 2018-06-01 07:56:36.575-0500

No.  When memory corruption is involved we need the MALLOC_DEBUG information or backtraces when it is enabled to advance this issue.

Awaiting MALLOC_DEBUG information.

By: Olivier Krief (okrief) 2018-06-01 09:18:17.753-0500

If you had to try to avoid the whole issue (not trying to collect data necessary to fin the root cause), would you try to :

0. jump from VMWare to bare metal platform ?
1. jump to another OS (from Debian to CentOS) ?
2. use Certified-Asterisk instead of standard Asterisk (ie picking a precise Asterisk LTS version, not the lastest one) ?
3. use binary Asterisk package instead of compiling from source ?
4. favor PJSIP over chansip (we tried both) ?
5. use old tricks such as nightly or weekly reboots

I'm not asking here any moral commitment, just some advice that could have for others, before.


By: Richard Mudgett (rmudgett) 2018-06-01 11:20:21.818-0500

{quote}
0. jump from VMWare to bare metal platform ?
{quote}
Asterisk is a more real time type application and thus requires more responsiveness from the machine.  It depends upon what is happening on other virtual machines on the physical machine.
{quote}
1. jump to another OS (from Debian to CentOS) ?
{quote}
Asterisk doesn't really care which supported OS you use.
{quote}
2. use Certified-Asterisk instead of standard Asterisk (ie picking a precise Asterisk LTS version, not the lastest one) ?
{quote}
It is up to you to pick the version you want to use.  The supplied link describes certified asterisk and has a FAQ about certified asterisk.  Basically you get *no* special benefit to using certified asterisk over other versions unless you have a service level agreement with Digium.  Without a service level agreement any bug fixes resulting from an issue you have will not be put into the certified version.
https://www.digium.com/products/asterisk/certified-asterisk
{quote}
3. use binary Asterisk package instead of compiling from source ?
{quote}
I think compiling Asterisk from source is always better.  Precompiled distributions are either way behind or you cannot provide backtrace debug information if you have an issue.
{quote}
4. favor PJSIP over chansip (we tried both) ?
{quote}
chan_sip has no maintainer and is entirely community supported.  chan_pjsip is actively maintained and gets new features.
{quote}
5. use old tricks such as nightly or weekly reboots
{quote}

Basically you cannot avoid something if you don't know what triggers the problem.  We currently don't have enough information to know what triggers the problem let alone the root cause.

By: Olivier Krief (okrief) 2018-06-05 09:45:40.594-0500

Strangely, just after upgrading to 13.21.0 and setting MALLOC_DEBUG, we had two crashes within a 7-hours production timeframe, while it used to occur once every 5 or 6 weeks.
As we can't afford such frequency, I think settings we be changed very soon.

As we used a new Asterisk version, I filed ASTERISK-27898 (forgive me if it was the right thing to do).


By: Olivier Krief (okrief) 2018-06-05 10:00:21.944-0500

I could not find a way te replace current attached files with latest ones attached to  ASTERISK-27898.
If I can help on this, please, let me know.

By: Richard Mudgett (rmudgett) 2018-06-05 10:08:42.562-0500

Just use new names that indicate which files are which.  You could even put them in a {{.zip}} file as JIRA knows how to deal with zip files.

By: Olivier Krief (okrief) 2018-06-05 11:04:02.553-0500

File attach.zip includes new log files, each starting with a new_ prefix.

By: Joshua C. Colp (jcolp) 2018-06-12 08:17:33.669-0500

I think this is the result of an attended transfer scenario, potentially with some off nominal code occurring or a race condition.

By: Joshua C. Colp (jcolp) 2018-06-12 08:18:35.285-0500

Looking through the debug log does show an attended transfer happening - trying to piece together the specific call flow and scenario is difficult, do you know what would have happened at that time?

By: Olivier Krief (okrief) 2018-06-12 10:07:22.660-0500

I can see a suspicious line bellow:
[2018-06-05 15:05:26] DEBUG[18079][C-0000440d] chan_sip.c: **** Received REFER (9) - Command in SIP REFER
[2018-06-05 15:05:26] DEBUG[18079][C-0000440d] chan_sip.c: Attended transfer: Will use Replace-Call-ID : 43ccdab68f6069ee10d5f9af6f80dc00 F-tag: fa73310e6256f809 T-tag: as0a89b783

Is the one you are suspecting ?
If positive, I'll try to grab some log from the IPBX logs (an Avaya IPO, I think) to get a clearer view of what was done at the moment.


By: Joshua C. Colp (jcolp) 2018-06-12 10:18:20.694-0500

Yes, following the channels involved and the channel that caused the crash is what lead me to that.

By: Olivier Krief (okrief) 2018-06-19 12:21:04.444-0500

Today, I add a new issue on the same box.
I've attached attach2.zip file with relevant logs.
Unfortunately, as reported earlier, we could not leave the box with MALLOC_DEBUG turned as the only time we left this option on, we had two issues in the same day as opposed once every 2 to 8 weeks.

Previously we had in /var/log/syslog, lines such as :
Jun  5 15:05:26 localhost kernel: [56636.313611] traps: asterisk[19465] general protection ip:563441700f41 sp:7f95b5acbc20 error:0

Today's case is:
Jun 19 16:12:26 localhost kernel: [1178276.251548] asterisk[27065]: segfault at 7f7484111e50 ip 00007f7484111e50 sp 00007f742141cbf8 error 15

Latest case has a very specific pattern : it is self-aimed (don't ask why such call occurs) as this call is from 0320290292 to 0320290292  (See channel SIP/Webdeal-IPO-000a697b in file syslog_extract inside attach2.zip file).

This type of call is supposed to follow the following path
Customer IPBX1  -----> Our asterisk -----> ProviderA ---- ? ---> ProviderA  ----->  Our asterisk  ----> Customer IPBX1

Is is very possible our provider (called ProviderA in the above drawaing) simply "forwards us back" our outbound calls aimed at one of our incoming numbers, without passing through other boxes.
This provider is using Asterisk 11.25 (I won't blame him for that, shall I ? ;-))))






By: Joshua C. Colp (jcolp) 2018-06-20 06:48:28.003-0500

The second zip file does not include the Asterisk log. Can you also provide that? If it confirms a transfer happened, then that narrows things down enough.

By: Olivier Krief (okrief) 2018-06-20 09:51:46.859-0500

Unfortunately, when removing MALLOC_DEBUG option (see previous comment), an unfortunate "make logrotate-init" spoiled our log settings so that Asterisk's debug file is not available for this latter issue.
Enclosed syslog file still have some detailed data.


By: Joshua C. Colp (jcolp) 2018-06-20 09:59:06.837-0500

Before accepting this I'd really like to see a log to ensure transfers is indeed where the problem is, so that it's not as much needle in a haystack.

By: Asterisk Team (asteriskteam) 2018-07-04 12:00:02.493-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines