[Home]

Summary:ASTERISK-25498: Asterisk crashes when negotiating g729 without that module installed
Reporter:Ben Langfeld (benlangfeld)Labels:
Date Opened:2015-10-27 09:44:34Date Closed:2015-12-14 17:03:00.000-0600
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Codecs/General
Versions:11.20.0 13.4.0 13.6.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Stepping: 2 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7 $ free -m total used free shared buffers cached Mem: 12010 11670 339 0 174 8262 -/+ buffers/cache: 3233 8777 Swap: 4092 26 4066 $ uname -mrs Linux 3.13.0-62-generic x86_64 $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.3 LTS Release: 14.04 Codename: trustyAttachments:( 0) backtrace.txt
( 1) backtrace.txt
( 2) backtrace-asterisk-13.6.0.txt
( 3) better_translate_sans_codec.patch
( 4) minute.log
( 5) newcrash.log
( 6) sip.conf
( 7) translation_sans_trans_pvt.patch
Description:Regularly the following log lines are followed by Asterisk crashing (backtrace attached):

```
[Oct 19 11:17:43] WARNING[15022][C-0001f68d] translate.c: No translator path: (starting codec is not valid)
[Oct 19 11:17:43] WARNING[15051][C-0001f68d] channel.c: Unable to find a codec translation path: (g729) -> (ulaw)
```

I do not see any relevant changes between 13.4.0 and 13.6.0, but I am in the process of upgrading anyway to confirm this issue is still present.
Comments:By: Asterisk Team (asteriskteam) 2015-10-27 09:44:36.086-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2015-10-27 19:10:43.289-0500

Let us know how the test in the latest 13 goes! We definitely shouldn't be crashing here.

By: Ben Klang (bklang) 2015-11-02 10:06:41.466-0600

Unfortunately, it looks like the issue persists with 13.6.0.  

Final few lines in the Asterisk log:

{code}
[Nov  2 10:56:47] VERBOSE[14913][C-0000dc1e] app_amd.c: AMD: SIP/lsi-00012dd7 8329407074 (N/A) (Fmt: ulaw)
[Nov  2 10:56:47] VERBOSE[14913][C-0000dc1e] app_amd.c: AMD: initialSilence [2500] greeting [1500] afterGreetingSilence [600] totalAnalysisTime [3500] minimumWordLength [100] betweenWordsSilence [50] maximumNumberOfWords [3] silenceThreshold [256] maximumWordLength [5000]
[Nov  2 10:56:47] VERBOSE[14913][C-0000dc1e] app_amd.c: AMD: Channel [SIP/lsi-00012dd7]. Changed state to STATE_IN_SILENCE
[Nov  2 10:56:47] WARNING[14545][C-0000dbe6] translate.c: No translator path: (starting codec is not valid)
[Nov  2 10:56:47] WARNING[14785][C-0000dbe6] channel.c: Unable to find a codec translation path: (g729) -> (ulaw)
[Nov  2 10:56:47] VERBOSE[14785][C-0000dbe6] bridge_channel.c: Channel SIP/331-00012d86 left 'simple_bridge' basic-bridge <f138706b-58df-4c1d-91db-28c070e9e352>
[Nov  2 10:56:47] VERBOSE[14545][C-0000dbe6] bridge_channel.c: Channel SIP/lsi-00012d92 left 'simple_bridge' basic-bridge <f138706b-58df-4c1d-91db-28c070e9e352>
{code}

{code}
(gdb) bt
#0  0x00007fca1601f181 in ast_translate (path=0x0, f=f@entry=0x7fc8e80a60a0, consume=consume@entry=0) at translate.c:582
#1  0x00007fca15ee3430 in audiohook_list_translate_to_slin (frame=0x7fc8e80a60a0, direction=AST_AUDIOHOOK_DIRECTION_READ, audiohook_list=0x7fc8e83e4db0) at audiohook.c:837
#2  audio_audiohook_write_list (frame=0x7fc8e80a60a0, direction=AST_AUDIOHOOK_DIRECTION_READ, audiohook_list=0x7fc8e83e4db0, chan=0x7fc8e82f8448) at audiohook.c:931
#3  ast_audiohook_write_list (chan=chan@entry=0x7fc8e82f8448, audiohook_list=0x7fc8e83e4db0, direction=direction@entry=AST_AUDIOHOOK_DIRECTION_READ,
   frame=frame@entry=0x7fc8e80a60a0) at audiohook.c:1086
#4  0x00007fca15f2bfe1 in __ast_read (chan=chan@entry=0x7fc8e82f8448, dropaudio=dropaudio@entry=0) at channel.c:4170
#5  0x00007fca15f2d507 in ast_read (chan=chan@entry=0x7fc8e82f8448) at channel.c:4277
#6  0x00007fc95b9015e8 in async_agi_read_frame (chan=0x7fc8e82f8448) at res_agi.c:1379
#7  launch_asyncagi (efd=0x0, argv=<optimized out>, argc=<optimized out>, chan=0x7fc8e82f8448) at res_agi.c:1576
#8  launch_script (opid=<synthetic pointer>, efd=0x0, fds=0x7fc8da3e3570, argv=<optimized out>, argc=<optimized out>, script=<optimized out>, chan=0x7fc8e82f8448)
   at res_agi.c:1841
#9  agi_exec_full (chan=0x7fc8e82f8448, data=<optimized out>, enhanced=<optimized out>, dead=0) at res_agi.c:4145
#10 0x00007fca15fbcf45 in pbx_exec (c=c@entry=0x7fc8e82f8448, app=app@entry=0x7fca17271960, data=data@entry=0x7fc8da3e4bd0 "agi:async") at pbx.c:1720
#11 0x00007fca15fc9de8 in pbx_extension_helper (c=c@entry=0x7fc8e82f8448, con=con@entry=0x0, context=0x7fc8e82f8e00 "adhearsion-redirect", exten=exten@entry=0x7fc8e82f8e50 "1",
   priority=priority@entry=1, label=label@entry=0x0, callerid=callerid@entry=0x7fc8e853f990 "8479165363", action=action@entry=E_SPAWN, found=found@entry=0x7fc8da3e6c7c,
   combined_find_spawn=combined_find_spawn@entry=1) at pbx.c:4992
#12 0x00007fca15fcbe8f in ast_spawn_extension (combined_find_spawn=1, found=0x7fc8da3e6c7c, callerid=0x7fc8e853f990 "8479165363", priority=1, exten=0x7fc8e82f8e50 "1",
   context=<optimized out>, c=0x7fc8e82f8448) at pbx.c:6162
#13 __ast_pbx_run (c=c@entry=0x7fc8e82f8448, args=args@entry=0x0) at pbx.c:6579
#14 0x00007fca15fcd3d3 in ast_pbx_run_args (c=c@entry=0x7fc8e82f8448, args=args@entry=0x0) at pbx.c:6947
#15 0x00007fca15fcd89d in ast_pbx_run (c=0x7fc8e82f8448) at pbx.c:6956
#16 pbx_outgoing_exec (data=data@entry=0x7fc8e8020618) at pbx.c:10491
#17 0x00007fca16023bfe in dummy_start (data=<optimized out>) at utils.c:1237
#18 0x00007fca13f0f182 in start_thread (arg=0x7fc8da3e7700) at pthread_create.c:312
#19 0x00007fca134eb47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{code}

By: Ben Langfeld (benlangfeld) 2015-11-02 13:21:50.610-0600

Full backtrace from Asterisk 13.6.0 reproduction.

By: Ben Langfeld (benlangfeld) 2015-11-04 10:59:44.645-0600

Note that in packet captures I see no evidence of G.729 being offered by any calls.

By: Rusty Newton (rnewton) 2015-11-05 15:25:42.780-0600

Can you post additional traces without optimizations? i.e. recompile with the DONT_OPTIMIZE and BETTER_BACKTRACES flags.

As well as a debug log captured leading up to the crash?

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

By: Ben Klang (bklang) 2015-11-05 15:29:51.949-0600

Thanks for the feedback. Yes, we will work on collecting that information. It may be a couple of days before we can get it because we need to rebuild the package and schedule a deploy, but will do so as quickly as possible.

By: Rusty Newton (rnewton) 2015-11-05 15:37:21.358-0600

Actually turns out [~jcolp] was able to take a quick look and it appears that what we have here is good enough to move forward with. I'll quote him with "audiohooks doesn't have a check and assumes that a translation path will exist" as the cause of the issue.

So, if someone gets free to work on this issue, then there looks to be a path forward.

By: Jonathan Rose (jrose) 2015-11-23 17:45:11.565-0600

I've got a patch for this issue that I want to try out, but I've been having a lot of trouble reproducing the issue. I usually end up in a situation where an audiohook will simply not execute on frames when the codec has no translation path to ulaw, but here the audiohook attempts to translate even after this check is (or would normally be) made.

I'd like to get some more information if you could... the following to be specific:

1. all log messages from the start of the call to the end of the call that triggers this crash

2. All extensions involved with the call. If the call is to stasis, all Stasis commands ran in order.

3. Configuration for the endpoints involved with the call. I think you are using chan_sip, so the endpoints used in sip.conf... not including sensitive information like passwords and IP addresses. Codec order would be a good one to have in particular. If the endpoints are defined via realtime, the CLI output for 'sip show peer <peername> load' works fine too.

By: Ben Langfeld (benlangfeld) 2015-11-23 17:49:38.701-0600

Thank you Jonathan, your effort on this is very much appreciated. I will collect the requested info tomorrow and drop it here. I will note that it is difficult for us to reproduce also; we can only do so in our heavily loaded production environment so far, so that's where I'll be extracting logs from. It will also make testing a patch...exciting, but that's why we do this, right? ;)

By: Ben Langfeld (benlangfeld) 2015-11-24 09:08:47.897-0600

SIP endpoint configuration attached. Logs to come after some reproduction and filtering.

By: Ben Langfeld (benlangfeld) 2015-11-24 10:53:17.634-0600

In these logs, the crash occurs at 09:59:24:

{noformat}
[Nov  4 09:59:24] VERBOSE[37797][C-0000ac72] res_agi.c: AGI Script Executing Application: (MixMonitor) Options: (/var/spool/asterisk/monitor/69266520_813131.wav)
[Nov  4 09:59:24] VERBOSE[37871][C-0000ac72] app_mixmonitor.c: Begin MixMonitor Recording SIP/lsi-0000da92
[Nov  4 09:59:24] VERBOSE[37797][C-0000ac72] res_agi.c: AGI Script Executing Application: (AMD) Options: ()
[Nov  4 09:59:24] VERBOSE[37797][C-0000ac72] app_amd.c: AMD: SIP/lsi-0000da92 7702129411 (N/A) (Fmt: ulaw)
[Nov  4 09:59:24] VERBOSE[37797][C-0000ac72] app_amd.c: AMD: initialSilence [2500] greeting [1500] afterGreetingSilence [600] totalAnalysisTime [3500] minimumWordLength [100] betweenWordsSilence [50] maximumNumberOfWords [3] silenceThreshold [256] maximumWordLength [5000]
[Nov  4 09:59:24] VERBOSE[37765][C-0000ac6c] app_amd.c: AMD: Channel [SIP/lsi-0000da86]. HUMAN: silenceDuration:600 afterGreetingSilence:600
[Nov  4 09:59:24] VERBOSE[37760][C-0000ac71] app_amd.c: AMD: Channel [SIP/lsi-0000da84]. Changed state to STATE_IN_SILENCE
[Nov  4 09:59:24] VERBOSE[37797][C-0000ac72] app_amd.c: AMD: Channel [SIP/lsi-0000da92]. Changed state to STATE_IN_SILENCE
[Nov  4 09:59:24] WARNING[37704][C-0000ac66] translate.c: No translator path: (starting codec is not valid)
[Nov  4 09:59:24] WARNING[37855][C-0000ac66] channel.c: Unable to find a codec translation path: (g729) -> (ulaw)
[Nov  4 09:59:24] VERBOSE[37855][C-0000ac66] bridge_channel.c: Channel SIP/381-0000c16b left 'simple_bridge' basic-bridge <3448de28-864d-48b5-a9d8-eb4c250e4413>
[Nov  4 09:59:24] VERBOSE[37704][C-0000ac66] bridge_channel.c: Channel SIP/lsi-0000da6c left 'simple_bridge' basic-bridge <3448de28-864d-48b5-a9d8-eb4c250e4413>
[Nov  4 09:59:30] Asterisk 13.6.0~dfsg-0~ppa1 built by buildd @ lgw01-18 on a x86_64 running Linux on 2015-10-10 16:07:38 UTC
{noformat}

The channel triggering this is thus `SIP/381-0000c16b` or `SIP/lsi-0000da6c` joining bridge `3448de28-864d-48b5-a9d8-eb4c250e4413`. The full Asterisk log of the 90s leading up to the crash is attached as `minute.log`.

These calls are driven by AsyncAGI and AMI (Adhearsion). I believe the executed actions logged by Asterisk should be sufficient to meet your request, but if they're not, let me know.

I'm going to continue looking through the log for any further clues I can highlight, but I wanted to get it to you first incase what you need is obvious.

By: Ben Langfeld (benlangfeld) 2015-11-24 16:01:57.257-0600

The SDP for the call with channel `SIP/lsi-0000da6c` was:

{noformat}
v=0
o=- 44909901933898 44909901933899 IN IP4 10.0.6.10
s=-
c=IN IP4 10.0.6.10
t=0 0
m=audio 63632 RTP/AVP 18 0 101
a=rtpmap:101 telephone-event/8000
a=ptime:20
{noformat}

This indicates G.729 and G.711 PCMU. The AMD log line states `(Fmt: ulaw)`. I'm not sure why something attempts to use G.729 here.

By: Ben Langfeld (benlangfeld) 2015-11-25 09:56:38.535-0600

We have disabled G.729 in all incoming SDPs and will be monitoring for further crashes. I'll report back once we're confident either way that this helped or not.

By: Ben Langfeld (benlangfeld) 2015-11-25 14:44:20.010-0600

Since removing G.729 from SDPs, we instead crash with:

[Nov 25 15:20:02] WARNING[23414][C-000138ba] translate.c: No translator path: (starting codec is not valid)

I have attached a new backtrace.

By: Jonathan Rose (jrose) 2015-11-30 11:15:00.101-0600

Do you have logs for that backtrace as well? I'm resuming on this today (just got back from Thanksgiving).

By: Ben Langfeld (benlangfeld) 2015-11-30 12:46:12.712-0600

The logs came out the same for that case, Jonathan Rose, but I'll see if I can dig them up anyway.

By: Jonathan Rose (jrose) 2015-11-30 15:16:49.540-0600

I have a patch for you to try (translation_sans_trans_pvt.patch). What I think happened here is that an assumption was made that the format would either be equal to signed linear or not equal to signed linear, but the possibility that signed linear would be included as a subset of the format used was discounted... and this is why we were able to get to ast_translate without having a translation path set. It wasn't being managed by the part of the code intended to handle when the formats were equal or when they were unequal, so it attempts to translate with nothing which causes problems.

Let me know if the patch doesn't work and I'll dig a little deeper. Also let me know if the backtrace changes as a result of the patch.

By: Jonathan Rose (jrose) 2015-11-30 17:19:30.316-0600

better_translate_sans_codec.patch

Slightly better version of the patch... leaves less room for ambiguity.

By: Krandon Bruse (krandonbruse) 2015-12-01 04:17:18.111-0600

Experiencing this in Asterisk 13.5.0 now, just getting the message

translate.c: No translator path: (starting codec is not valid) followed by a crash

We force ulaw on the sip peer, but it still seems to happen. Will try to get a BT, but lurking on the side and will try any patches. It only seems to happen under load.

By: Ben Langfeld (benlangfeld) 2015-12-01 04:19:54.814-0600

We will be trying out this patch as soon as we can (though it may take us a few days to get it through our ops people). If you get to it any faster than us, Krandon, I'd appreciate any notes you have :)

By: Krandon Bruse (krandonbruse) 2015-12-01 04:30:33.190-0600

@Ben Langfeld - we will put this into production today, on a few boxes, as it's currently causing us a lot of issues. We have about 100 boxes in the wild and on any given day, about 10 of them will crash with this issue. I have a feeling our issue is identical, in that our provider is sometimes responding with 729 as an option. All of our playback files are in ulaw only. I'll be monitoring for this error as well from all the boxes and will report back asap. Thanks!

By: Krandon Bruse (krandonbruse) 2015-12-01 08:16:23.025-0600

Just pushed to production, we have multiple crashes per day, so I'll report back tonight. Thanks!

By: Krandon Bruse (krandonbruse) 2015-12-01 12:45:58.684-0600

100,000 calls in, still looking good, no crashes. Would have fully expected one by now.

By: Jonathan Rose (jrose) 2015-12-01 17:57:29.958-0600

Any substantial log spam? I would have anticipated that where it was previously crashing, you might get some heavy log messages.

Not that this is actionable mind... it's just something I would have anticipated.

By: Krandon Bruse (krandonbruse) 2015-12-02 00:48:02.131-0600

One of the boxes died. Getting log info now.

The logging on the other boxes is much less. I think that much logging could have been killing the system too.

By: Krandon Bruse (krandonbruse) 2015-12-02 00:53:07.800-0600

Log file for patch of better_translate_sans_codec.patch

It at least wasn't the same error (uploaded newcrash.log - manually changed numbers for anonymity)

This is still much better. This crash may be totally unrelated.

By: Jonathan Rose (jrose) 2015-12-02 10:10:37.817-0600

The log isn't super-telling unfortunately. Is there any chance you have a way to get the backtrace?

By: Krandon Bruse (krandonbruse) 2015-12-02 12:52:56.229-0600

A back trace with DONT_OPTIMIZE?

May take a few days as we will put some asterisk servers into recompiled DONT_OPTIMIZE and wait for the specific boxes to crash

By: Jonathan Rose (jrose) 2015-12-02 14:04:15.775-0600

Yes please. I'm going to go ahead and put this patch up for review since I don't *think* the crash is related... but if it crops back up again, we may need to revisit this or possibly file a new issue.

By: Krandon Bruse (krandonbruse) 2015-12-03 12:49:55.432-0600

Jonathan - no crashes since. I have enabled DONT_OPTIMIZE on 10/100 servers. We'll wait and see.

Otherwise, I think you can mark this as fixed. Nice work!

By: Jonathan Rose (jrose) 2015-12-03 15:37:32.923-0600

Well, there actually is a fairly significant problem with the patch I sent you which was uncovered in review which we speculate will result in a large CPU burden under some cases.

The new patch on gerrit should fix the behavior without doing that though.

https://gerrit.asterisk.org/#/c/1754/

https://gerrit.asterisk.org/changes/1754/revisions/69457b8d61ddc62e018dd4037fb42a231f73c3b5/patch?zip

If you can test the reviewed version of the patch, that would be helpful.

EDIT:

On second thought, maybe better to use the 13 version of the patch.
https://gerrit.asterisk.org/#/c/1759/1
https://gerrit.asterisk.org/changes/1759/revisions/072d94183ce2b6d2272543732dd5d47390099bb3/patch?zip

By: Krandon Bruse (krandonbruse) 2015-12-03 17:29:01.715-0600

Jonathan - just to clarify, this patch?

https://gerrit.asterisk.org/#/c/1759/1/main/audiohook.c

Because https://gerrit.asterisk.org/#/c/1759/1/main/audiohook.c and https://gerrit.asterisk.org/#/c/1754/3/main/audiohook.c look identical to me

By: Jonathan Rose (jrose) 2015-12-04 09:31:47.354-0600

The patches are in fact identical, the 13 version is just a cherry pick of the master... and *may* apply slightly more evenly to 13 than master. I haven't checked to see if there are any differences between existing 13 branches and master for audiohook.c just yet.

By: Jonathan Rose (jrose) 2015-12-10 13:55:35.233-0600

Note: While this issue is marked as closed since the patch has already went in, I'm still waiting on results from Ben and/or Krandon before I take the issue off of my sprint chart.

By: Ben Langfeld (benlangfeld) 2015-12-10 13:59:17.395-0600

We've had this in production for around 24 hours and have not seen a crash since we deployed it. I'll give it another 48 hours before considering it definitely resolved, but it looks great! Thank you for your help Johnathan.

By: Asterisk Team (asteriskteam) 2015-12-10 13:59:18.170-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Jonathan Rose (jrose) 2015-12-14 10:32:12.138-0600

Last call for mentioning problems related to this issue. I'll be closing it at ~5:00PM U.S. central time if I don't hear about any show-stoppers today.

By: Ben Langfeld (benlangfeld) 2015-12-14 10:50:01.617-0600

We havn't seen an Asterisk crash since we deployed this. My vote is to close this and ship it :)

By: Jonathan Rose (jrose) 2015-12-14 17:03:01.049-0600

Thanks for all the feedback everyone.