ASTERISK-26831: res_rtp_asterisk: Race condition when RTCP and WebRTC is used

[Home]

Summary: ASTERISK-26831: res_rtp_asterisk: Race condition when RTCP and WebRTC is used

Reporter: Badalian Vyacheslav (slavon) Labels:

Date Opened: 2017-03-01 17:17:23.000-0600 Date Closed: 2020-01-14 11:13:49.000-0600

Priority: Minor Regression?

Status: Closed/Complete Components: Resources/res_rtp_asterisk

Versions: 13.14.0 13.15.0 Frequency of
Occurrence

Related
Issues:
is related to ASTERISK-26692 res_rtp_asterisk: Crash in dtls_srtp_handle_timeout at res_rtp_asterisk (using chan_sip)

Environment: Attachments: ( 0) backtrace.txt
( 1) locks_x.txt
( 2) locks_x2.txt
( 3) locks_x3.txt
( 4) locks_x4.txt
( 5) locks1.txt
( 6) locks2.txt

Description: Race condition. With this story, I am fighting for more than a month. I tried different options but the idea of an end. There is only a heavy load (more than 200 simultaneous calls WebRTC) after 50 000-100 000 calls.

Please help. It is difficult to disentangle this tangle. The error for a long time (many versions).

It may have an idea why this might happen? This is because rtp_stop or because rtp_destroy?

{code}
=ERROR: AddressSanitizer: heap-use-after-free on address 0x6290016c326c at pc 0x7f007c288b1b bp 0x7f00404275b0 sp 0x7f0040426d58
WRITE of size 64 at 0x6290016c326c thread T467
#0 0x7f007c288b1a (/lib64/libasan.so.3+0x5eb1a)
#1 0x7f0074fdea26 in memcpy /usr/include/bits/string3.h:51
#2 0x7f0074fdea26 in ast_srtp_protect /home/pbs.vbadalyan/asterisk-13.14.0/res/res_srtp.c:438
#3 0x7f005c6e65de in __rtp_sendto /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:2305
#4 0x7f005c6f5e8d in rtcp_sendto /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:2330
#5 0x7f005c6f5e8d in ast_rtcp_write_report /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:3286
#6 0x7f005c6f6ecc in ast_rtcp_write /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:3357
#7 0x740c64 in ast_sched_runq /home/pbs.vbadalyan/asterisk-13.14.0/main/sched.c:783
#8 0x7f00469ad02d in do_monitor /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:29514
#9 0x7b217f in dummy_start /home/pbs.vbadalyan/asterisk-13.14.0/main/utils.c:1235
#10 0x7f007a4c2dc4 in start_thread (/lib64/libpthread.so.0+0x7dc4)
#11 0x7f00797a273c in clone (/lib64/libc.so.6+0xf773c)
{code}

{code}
=ERROR: AddressSanitizer: heap-use-after-free on address 0x603002c6d820 at pc 0x7f55310f2ff3 bp 0x7f552b6f2240 sp 0x7f552b6f19e8
READ of size 3 at 0x603002c6d820 thread T88
#0 0x7f55310f2ff2 (/lib64/libasan.so.3+0x3cff2)
#1 0x7f552fbae4a4 (/lib64/libjansson.so.4+0x74a4)
#2 0x7f552fbad3f7 (/lib64/libjansson.so.4+0x63f7)
#3 0x7f552fbad8f3 (/lib64/libjansson.so.4+0x68f3)
#4 0x7f552fbad723 (/lib64/libjansson.so.4+0x6723)
#5 0x7f552fbadb8c in json_vpack_ex (/lib64/libjansson.so.4+0x6b8c)
#6 0x64fd24 in ast_json_vpack /home/pbs.vbadalyan/asterisk-13.14.0/main/json.c:823
#7 0x64fef8 in ast_json_pack /home/pbs.vbadalyan/asterisk-13.14.0/main/json.c:814
#8 0x7f5511e81f26 in ast_rtcp_write_report /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:3295
#9 0x7f5511e82ea4 in ast_rtcp_write /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:3360
#10 0x740c64 in ast_sched_runq /home/pbs.vbadalyan/asterisk-13.14.0/main/sched.c:783
#11 0x7f54fc11a02d in do_monitor /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:29514
#12 0x7b217f in dummy_start /home/pbs.vbadalyan/asterisk-13.14.0/main/utils.c:1235
#13 0x7f552f34edc4 in start_thread (/lib64/libpthread.so.0+0x7dc4)
#14 0x7f552e62e73c in clone (/lib64/libc.so.6+0xf773c)
{code}

{code}
ERROR: AddressSanitizer: heap-use-after-free on address 0x6030003739c0 at pc 0x7f30ccac6e05 bp 0x7f30bec2e2b0 sp 0x7f30bec2da58
READ of size 19 at 0x6030003739c0 thread T738
#0 0x7f30ccac6e04 in __asan_memcpy (/lib64/libasan.so.2+0x8ce04)
#1 0x7f30cb530571 (/lib64/libjansson.so.4+0x5571)
#2 0x7f30cb532887 in json_string_nocheck (/lib64/libjansson.so.4+0x7887)
#3 0x7f30cb531907 (/lib64/libjansson.so.4+0x6907)
#4 0x7f30cb531723 (/lib64/libjansson.so.4+0x6723)
#5 0x7f30cb531b8c in json_vpack_ex (/lib64/libjansson.so.4+0x6b8c)
#6 0x64c2e4 in ast_json_vpack /home/pbs.vbadalyan/asterisk-13.14.0/main/json.c:823
#7 0x64c4b8 in ast_json_pack /home/pbs.vbadalyan/asterisk-13.14.0/main/json.c:814
#8 0x7f30addd8d77 in ast_sockaddr_copy /home/pbs.vbadalyan/asterisk-13.14.0/include/asterisk/netsock2.h:145
#9 0x7f30addd8d77 in ast_rtcp_read /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:4217
#10 0x7f30addd93cc (/usr/lib/asterisk/modules/res_rtp_asterisk.so+0x2a3cc)
#11 0x73d1e3 in ast_sched_runq /home/pbs.vbadalyan/asterisk-13.14.0/main/sched.c:783
#12 0x7f30972c78d5 in do_monitor /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:29514
#13 0x7ae78f in dummy_start /home/pbs.vbadalyan/asterisk-13.14.0/main/utils.c:1235
#14 0x7f30cacd2dc4 in start_thread (/lib64/libpthread.so.0+0x7dc4)
#15 0x7f30c9fb373c in clone (/lib64/libc.so.6+0xf773c)
{code}

Comments: By: Asterisk Team (asteriskteam) 2017-03-01 17:17:24.372-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Joshua C. Colp (jcolp) 2017-03-02 08:41:00.606-0600

There is no immediate reason to explain why this is happening with the information provided. An actual full backtrace (if it were a crash) would show the state of everything, and console log with debug could show the progression. As you've said it does not happen very often it's likely a very very narrow race condition and one that is hard to isolate.
By: Badalian Vyacheslav (slavon) 2017-03-02 08:51:13.137-0600

Give me at least a tip-off on the idea. Because of what can be deleted heap ahead of time? ao2_ref (instance, -1)? Or it may be beyond the rtp engine occur?
By: Joshua C. Colp (jcolp) 2017-03-02 09:02:03.253-0600

Unreferencing the instance could cause it to be destroyed, it depends on the reference count. It could be that a scheduled item is not getting canceled when it should be or loses a reference when it shouldn't. There is a lot of stuff involved so and the information you've provided doesn't help narrow down how exactly it got there. That's the best I can do without doing an audit of the complete usage.
By: Joshua C. Colp (jcolp) 2017-03-02 09:04:04.096-0600

I'm assigning this back to you as there just isn't enough information here to try to isolate what is going on, without sinking a lot of time into looking over all of the involved code.
By: Badalian Vyacheslav (slavon) 2017-03-14 06:28:53.677-0500

i have 52GB refs logs.... now i try use refcounter.py and will post results here...
also i look to full debug log but it's not have any error or warnings. debug level 0 becouse it's porduction server and i can't increment level.

full log:
{code}
[2017-03-14 10:58:37] VERBOSE[6154][C-0000156b] netsock2.c: Using SIP RTP TOS bits 184
[2017-03-14 10:58:37] VERBOSE[6154][C-0000156b] netsock2.c: Using SIP RTP CoS mark 5
[2017-03-14 10:58:37] VERBOSE[6157][C-0000156c] pbx.c: Executing [s@macro-dialout-trunk:21] ExecIf("SIP/yuvchepizhnaya_wof-00002a28", "0?Set(CONNECTEDLINE(name,i)=CID:78007552771)") in new stack
[2017-03-14 10:58:37] VERBOSE[6157][C-0000156c] pbx.c: Executing [s@macro-dialout-trunk:22] GotoIf("SIP/yuvchepizhnaya_wof-00002a28", "0?customtrunk") in new stack
[2017-03-14 10:58:37] VERBOSE[6157][C-0000156c] pbx.c: Executing [s@macro-dialout-trunk:23] Dial("SIP/yuvchepizhnaya_wof-00002a28", "SIP/sbc/79064924469,300,") in new stack
[2017-03-14 10:58:37] VERBOSE[6157][C-0000156c] netsock2.c: Using SIP RTP TOS bits 184
[2017-03-14 10:58:37] VERBOSE[6157][C-0000156c] netsock2.c: Using SIP RTP CoS mark 5
[2017-03-14 10:58:38] VERBOSE[6125][C-00001566] res_rtp_asterisk.c: 0x62600094e100 -- Probation passed - setting RTP source address to 10.216.88.62:58296
[2017-03-14 10:58:38] VERBOSE[6125][C-00001566] res_rtp_asterisk.c: 0x62600094e100 -- Probation passed - setting RTP source address to 10.216.88.62:58296
[2017-03-14 10:58:56] Asterisk 13.14.0 built by root @ ds-asterisk05.tcsbank.ru on a x86_64 running Linux on 2017-03-13 13:54:32 UTC
{code}

On our load (200-300 calls) we get 1-3 crashes in day. Its critical bug for us . Also we have premium digium support.

Crush log:
{code}
==4251==ERROR: AddressSanitizer: SEGV on unknown address 0x60e0610000e4 (pc 0x7f442d1f1818 bp 0x6290003b826c sp 0x7f442e77e478 T81)
#0 0x7f442d1f1817 in srtp_get_stream (/lib64/libsrtp.so.1+0x6817)
#1 0x7f442d1f2c78 in srtp_protect_rtcp (/lib64/libsrtp.so.1+0x7c78)
#2 0x7f442d4212ad in ast_srtp_protect /home/pbs.vbadalyan/asterisk-13.14.0/res/res_srtp.c:440
#3 0x7f4416f35a6e in __rtp_sendto /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:2305
#4 0x7f4416f42d14 in rtcp_sendto /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:2330
#5 0x7f4416f42d14 in ast_rtcp_write_report /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:3282
#6 0x7f4416f4405c in ast_rtcp_write /home/pbs.vbadalyan/asterisk-13.14.0/res/res_rtp_asterisk.c:3349
#7 0x74c214 in ast_sched_runq /home/pbs.vbadalyan/asterisk-13.14.0/main/sched.c:783
#8 0x7f440119b4bd in do_monitor /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:29514
#9 0x7c22af in dummy_start /home/pbs.vbadalyan/asterisk-13.14.0/main/utils.c:1235
#10 0x7f44323e5dc4 in start_thread (/lib64/libpthread.so.0+0x7dc4)
#11 0x7f44316c573c in clone (/lib64/libc.so.6+0xf773c)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib64/libsrtp.so.1+0x6817) in srtp_get_stream
Thread T81 created by T0 here:
#0 0x7f44345624a8 in pthread_create (/lib64/libasan.so.3+0x314a8)
#1 0x7cc4b4 in ast_pthread_create_stack /home/pbs.vbadalyan/asterisk-13.14.0/main/utils.c:1288
#2 0x7f44010cf9cc in restart_monitor /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:29542
#3 0x7f440119c3af in restart_monitor /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:29529
#4 0x7f440119c3af in load_module /home/pbs.vbadalyan/asterisk-13.14.0/channels/chan_sip.c:35256
#5 0x65ad3f in start_resource /home/pbs.vbadalyan/asterisk-13.14.0/main/loader.c:1029
#6 0x65c2c7 in start_resource /home/pbs.vbadalyan/asterisk-13.14.0/main/loader.c:1022
#7 0x65c2c7 in load_resource_list /home/pbs.vbadalyan/asterisk-13.14.0/main/loader.c:1293
#8 0x65fd77 in load_modules /home/pbs.vbadalyan/asterisk-13.14.0/main/loader.c:1454
#9 0x438663 in asterisk_daemon /home/pbs.vbadalyan/asterisk-13.14.0/main/asterisk.c:4690
#10 0x438663 in main /home/pbs.vbadalyan/asterisk-13.14.0/main/asterisk.c:4446
#11 0x7f44315efb34 in __libc_start_main (/lib64/libc.so.6+0x21b34)

==4251==ABORTING
{code}
By: Badalian Vyacheslav (slavon) 2017-03-14 07:12:26.303-0500

Are there any ways to do a deferred purification of ao2 objects? We will tell in 30 seconds after the counter became 0. Simply in the form of a hack, that now it was possible to work?
By: Malcolm Davenport (mdavenport) 2017-03-14 08:18:03.726-0500

re: Support.

Digium offers several different levels of Support. The only level of support that includes Engineering level work, specifically regarding bug-fixes of a non-architectural nature, is the Support offered under our L5 category. If you do have an Support agreement with Digium, the proper channel for utilizing that is directly through Digium's Support department - www.digium.com/support - not through the open source issue tracker here.
By: Badalian Vyacheslav (slavon) 2017-03-14 15:47:33.901-0500

In my opinion, I understood what was happening. We had 15k records in sip.conf. This created 25k taskprocessors ("subp:" famaly). As a result, ref_cleanup is happening faster than taskprocessor will send the message through statis.

We switched to RealTime. I hope this helps, because We do not have more than 50-100 peers online on the server.
By: Mark Michelson (mmichelson) 2017-03-20 14:47:22.360-0500

In Josh's first comment, he asked for a full backtrace. So far, you have provided several stack traces showing the thread that crashed, but this isn't enough to help us. If there is a race condition, then we need to see what the other threads are doing at the time the crash happened. Perhaps we can see an RTP instance being destroyed by another thread. Plus a gdb "bt full" will show the values of stack variables at the time of the crash and can offer some indication of what the problem may be.

If you need instructions on how to get a full backtrace, please refer to the guide here: https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace . Once you have that information, please upload it as an attachment on this issue.
By: Asterisk Team (asteriskteam) 2017-04-04 12:00:01.526-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
By: Badalian Vyacheslav (slavon) 2017-04-11 16:55:32.766-0500

Sorry for the long answer, was on vacation. I am sending you the requested information. Asterisk is collected with optimizations. I hope this does not hurt.
By: Badalian Vyacheslav (slavon) 2017-04-12 03:45:20.062-0500

ReOpen Please

By: Asterisk Team (asteriskteam) 2017-04-12 03:45:21.980-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Rusty Newton (rnewton) 2017-04-12 09:40:35.719-0500

Re-attaching reporter's backtrace.txt uncompressed..
By: Rusty Newton (rnewton) 2017-04-12 10:35:07.532-0500

https://gerrit.asterisk.org/#/c/5343/

Turns out this fix could resolve the issue. Please wait until it is merged, then pull from git and test with that fix. Let us know if the problem still occurs and try to provide a new backtrace without optimizations.
By: Richard Mudgett (rmudgett) 2017-04-12 10:43:03.292-0500

The patch for ASTERISK-26835 and ASTERISK-26853 is likely to fix this as well.
By: Badalian Vyacheslav (slavon) 2017-04-12 14:38:00.623-0500

We try last patch version in Sunday
By: Rusty Newton (rnewton) 2017-04-18 09:38:50.865-0500

Badalian did you get a chance to try the patches Richard mentioned?
By: Badalian Vyacheslav (slavon) 2017-04-18 10:48:37.563-0500

now in test...
By: Badalian Vyacheslav (slavon) 2017-04-18 12:14:20.306-0500

bug in res_ari.c in branch with patch. Issued on module load..

{code}
=================================================================
==3243==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000048 (pc 0x7ffa27cf3569 bp 0x7ffd1c9370e0 sp 0x7ffd1c9370b0 T0)
#0 0x7ffa27cf3568 in ast_ari_add_handler /root/asterisk-git/res/res_ari.c:185
#1 0x667bef in start_resource /root/asterisk-git/main/loader.c:986
#2 0x669107 in start_resource /root/asterisk-git/main/loader.c:979
#3 0x669107 in load_resource_list /root/asterisk-git/main/loader.c:1238
#4 0x66ca37 in load_modules /root/asterisk-git/main/loader.c:1373
#5 0x43b4c7 in asterisk_daemon /root/asterisk-git/main/asterisk.c:4601
#6 0x43b4c7 in main /root/asterisk-git/main/asterisk.c:4350
#7 0x7ffa37232b34 in __libc_start_main (/usr/lib64/libc.so.6+0x21b34)
#8 0x43e9a6 (/usr/sbin/asterisk+0x43e9a6)
{code}
By: Richard Mudgett (rmudgett) 2017-04-18 14:43:22.016-0500

That backtrace has nothing to do with the rtp reentrancy patch. You have optimization enabled. It is difficult to see what is going on because the line numbers do not match with the source and start_resource() does not directly call ast_ari_add_handler().
By: Badalian Vyacheslav (slavon) 2017-04-20 08:39:27.653-0500

Compiled patch 7
{code}
git fetch https://gerrit.asterisk.org/asterisk refs/changes/43/5343/7 && git checkout FETCH_HEAD
{code}

1. Have deadlocked channels
{code}
ds-asterisk05*CLI> core show channels
Channel Location State Application(Data)
SIP/ndmargaryan_wof- (None) Down (None)
SIP/ivmakevit23393_w (None) Down (None)
SIP/ivmakevit23393_w (None) Down (None)
SIP/ndmargaryan_wof- (None) Down (None)
SIP/ndmargaryan_wof- (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/altumanyan_wof-0 (None) Down (None)
SIP/raburenkov_wof-0 (None) Down (None)
SIP/raburenkov_wof-0 (None) Down (None)
SIP/vyumex_wof-00003 (None) Down (None)
SIP/yuadomracheva_wo (None) Down (None)
SIP/1_stub-000035df (None) Down (None)
SIP/1_stub-000035de (None) Down (None)
SIP/1_stub-000035dd (None) Down (None)
SIP/1_stub-000035dc (None) Down (None)
SIP/1_stub-000035db (None) Down (None)
SIP/1_stub-000035da (None) Down (None)
SIP/1_stub-000035cd (None) Down (None)
SIP/1_stub-000035cc (None) Down (None)
SIP/1_stub-000035cb (None) Down (None)
SIP/1_stub-000035e0 (None) Down (None)
SIP/1_stub-000035d9 (None) Down (None)
SIP/1_stub-000035d8 (None) Down (None)
SIP/1_stub-000035d7 (None) Down (None)
SIP/1_stub-000035d6 (None) Down (None)
SIP/1_stub-000035d5 (None) Down (None)
SIP/1_stub-000035d4 (None) Down (None)
SIP/1_stub-000035d2 (None) Down (None)
SIP/1_stub-000035c9 (None) Down (None)
SIP/iadavidenko_wof- (None) Down AppDial2((Outgoing Line))
SIP/ndmargaryan_wof- (None) Down (None)
48 active channels
0 active calls
6798 calls processed
ds-asterisk05*CLI>

{code}

2. Have errors in logs
{code}
Apr 19 14:12:59 ds-asterisk05.tcsbank.ru asterisk[11220]: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x60b00005b588 (0) at line 122 in stasis_message_create_full of stasis_message.c
Apr 19 14:12:59 ds-asterisk05.tcsbank.ru asterisk[11220]: [2017-04-19 14:12:59] ERROR[2257][C-00000518]: stasis_message.c:122 stasis_message_create_full: Excessive refcount 100000 reached on ao2 object 0x60b00005b588
Apr 19 14:12:59 ds-asterisk05.tcsbank.ru asterisk[11220]: [2017-04-19 14:12:59] ERROR[2257][C-00000518]: stasis_message.c:122 stasis_message_create_full: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x60b00005b588 (0)

{code}

and asterisk restart randomly
{code}
Apr 20 16:50:52 ds-asterisk05.tcsbank.ru asterisk[32080]: [2017-04-20 16:50:52] ERROR[3649]: manager.c:6310 process_message: Unable to process manager action 'Originate'. Asterisk is shutting down.

{code}
By: Richard Mudgett (rmudgett) 2017-04-20 11:21:01.712-0500

# The deadlock is likely because of the patch for ASTERISK-26923 which has since been reverted. Patch set 8 is a rebase of the patch that is after the ASTERISK-26923 patch was reverted.
# Where is the short backtrace that is associated with that FRACK message? The short backtrace can help determine what ao2 object has the excessive ref count. It will also help if Asterisk is compiled with DONT_OPTIMIZE and BETTER_BACKTRACES enabled in menuselect.
# Asterisk is not randomly shutting down. It is shutting down because _you_ told it to restart or shutdown. You will get that message if you continue to try doing things while Asterisk is in the process of shutting down.
By: Badalian Vyacheslav (slavon) 2017-04-20 12:48:49.428-0500

Can I get a patch that will be applied to clean to 13.15.0?
By: Richard Mudgett (rmudgett) 2017-04-20 13:16:05.130-0500

Patch set 6 will apply to v13.15.0. Patch set 7 was just a merge conflict resolution from a change after v13.15.0. Patch set 8 was a rebase of patch set 7.
By: Badalian Vyacheslav (slavon) 2017-04-24 17:24:20.226-0500

How about https://gerrit.asterisk.org/#/c/5503/ ? I still have to apply it?
By: Richard Mudgett (rmudgett) 2017-04-24 17:51:33.180-0500

You could apply that patch after the RTP reentrancy patch set 6. It is for ASTERISK-26692 which is an unrelated crash with DTLS in the RTP code.
By: Badalian Vyacheslav (slavon) 2017-04-24 18:10:05.437-0500

Thanks, we just use WebRTC + DTLS.

The question is not on the topic, can tell me if it's not difficult? What are the ways to package the certificate in DTLS so that it goes without fragmentation (always get 2 packets)? (The certificate must be valid, therefore only public validators with intermediate certification centers and a long way/chain for the certificate).

The fact is that a lot of small home providers drop a fragmented UDP package, as in most cases it is considered DDOS (strange logic). Because of this, a large number of clients do not work communication. :( Users are "random and public", so there is no possibility to ask for their own certificates, but there is no possibility to use TCP transport. RTP and RTPC go through UDP and it is impossible to override this for some clients (use TCP) :( Stun servers also return only UDP ports and local TCP ports is ignored (we use port-to-port nat).

How would you solve such a problem?
By: Richard Mudgett (rmudgett) 2017-04-24 18:44:43.103-0500

See my last comment [1] on ASTERISK-26544 about fragmentation and OpenSSL BIO_s_mem needing to fragment the DTLS packet so a lower layer could send the packet without fragmentation.

[1] https://issues.asterisk.org/jira/browse/ASTERISK-26544?focusedCommentId=234312&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-234312
By: Badalian Vyacheslav (slavon) 2017-04-24 18:57:30.490-0500

Forgive me, but I do not understand from the commentary how it is possible to solve the problem at the moment (once again I apologize for the offtopic). Also it is not very clear, if even an asterisk will repack DTLS, then can Chrome understand this packet? As far as I remember RFC there it is stated that the "hello" packet should go in one packet (if necessary with fragmentation)...
By: Richard Mudgett (rmudgett) 2017-04-24 19:45:27.428-0500

It is not possible at this time because of OpenSSL. It is a limitation with the BIO_s_mem driver which is part of OpenSSL. The driver will need support for a MTU parameter added so the driver could generate smaller packets to go across the internet without fragmentation. The driver currently generates one large packet that gets fragmented when it goes across the internet. The driver is required by Asterisk. Only the smallest certificates can get sent without fragmentation.
By: Badalian Vyacheslav (slavon) 2017-04-24 19:50:19.210-0500

Thank you!
By: Badalian Vyacheslav (slavon) 2017-05-05 17:38:59.075-0500

News. On load testing we get new crash....

{code}
=================================================================
==164640==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x616005967c09 at pc 0x7fac8e08c31c bp 0x7fac47cfd490 sp 0x7fac47cfcc38
READ of size 1 at 0x616005967c09 thread T86
#0 0x7fac8e08c31b in strncasecmp (/usr/lib64/libasan.so.3+0x5331b)
#1 0x7fac72752ef6 in __get_header /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:8489
#2 0x7fac7275310a in sip_get_header /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:8517
#3 0x7fac72774224 in reqprep /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:12283
#4 0x7fac7279bd79 in transmit_request /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:16355
#5 0x7fac727f7523 in handle_response /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:24781
#6 0x7fac72816235 in handle_incoming /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:28734
#7 0x7fac728191eb in handle_request_do /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:29044
#8 0x7fac728186c3 in sipsock_read /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:28975
#9 0x65ce40 in ast_io_wait /home/pbs.vbadalyan/asterisk-13.15.0/main/io.c:295
#10 0x7fac7281d044 in do_monitor /home/pbs.vbadalyan/asterisk-13.15.0/channels/chan_sip.c:29610
#11 0x7f54d6 in dummy_start /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:1235
#12 0x7fac8b984dc4 in start_thread (/usr/lib64/libpthread.so.0+0x7dc4)
#13 0x7fac8ad2973c in clone (/usr/lib64/libc.so.6+0xf773c)
{code}
By: Joshua C. Colp (jcolp) 2017-05-09 05:49:07.695-0500

That is an issue in chan_sip, and is not in regards to RTCP. Instead of lumping everything into this issue please create separate issues. If you are still having problems with RTCP after the changes Richard has done then please attach the information for that.
By: Badalian Vyacheslav (slavon) 2017-05-09 07:01:30.227-0500

After tests on robots, we went into production and got a deadlock immediately after 40 calls. Now we think how to collect information.

Now compiled with DEBUG_THREADS. Will try to get deadlock again.
By: Badalian Vyacheslav (slavon) 2017-05-09 08:14:58.043-0500

after add DEBUG_THREADS gets ASAN errors:

{code}
==27083==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61b00148b95a at pc 0x7ff54c959a91 bp 0x7ff5070fd360 sp 0x7ff5070fcad8
READ of size 1 at 0x61b00148b95a thread T106
#0 0x7ff54c959a90 (/usr/lib64/libasan.so.3+0x8ea90)
#1 0x7ff54c95bcfd in __interceptor_vsnprintf (/usr/lib64/libasan.so.3+0x90cfd)
#2 0x7c4518 in __ast_str_helper /home/pbs.vbadalyan/asterisk-13.15.0/main/strings.c:76
#3 0x7fc98f in ast_str_append_va /home/pbs.vbadalyan/asterisk-13.15.0/include/asterisk/strings.h:1040
#4 0x7fcd19 in ast_str_append /home/pbs.vbadalyan/asterisk-13.15.0/include/asterisk/strings.h:1122
#5 0x801564 in append_backtrace_information /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:979
#6 0x801a8d in append_lock_information /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:1006
#7 0x802448 in ast_dump_locks /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:1121
#8 0x80268c in handle_show_locks /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:1165
#9 0x5a6862 in ast_cli_command_full /home/pbs.vbadalyan/asterisk-13.15.0/main/cli.c:2762
#10 0x5a6c67 in ast_cli_command_multiple_full /home/pbs.vbadalyan/asterisk-13.15.0/main/cli.c:2788
#11 0x484ffe in netconsole /home/pbs.vbadalyan/asterisk-13.15.0/main/asterisk.c:1569
#12 0x802b23 in dummy_start /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:1235
#13 0x7ff54a216dc4 in start_thread (/usr/lib64/libpthread.so.0+0x7dc4)
#14 0x7ff5495bb73c in clone (/usr/lib64/libc.so.6+0xf773c)
{code}

{code}
==26923==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fc2dcf1dae0 at pc 0x7fc2e31d5691 bp 0x7fc2dcf1da40 sp 0x7fc2dcf1d1e8
WRITE of size 56 at 0x7fc2dcf1dae0 thread T7
#0 0x7fc2e31d5690 in __interceptor_backtrace (/usr/lib64/libasan.so.3+0x50690)
#1 0x4acf8a in __ast_bt_get_addresses /home/pbs.vbadalyan/asterisk-13.15.0/main/backtrace.c:59
#2 0x679066 in __ast_rwlock_wrlock /home/pbs.vbadalyan/asterisk-13.15.0/main/lock.c:973
#3 0x47ecf2 in ast_unregister_thread /home/pbs.vbadalyan/asterisk-13.15.0/main/asterisk.c:593
#4 0x802adb in dummy_start /home/pbs.vbadalyan/asterisk-13.15.0/main/utils.c:1233
#5 0x7fc2e0ad0dc4 in start_thread (/usr/lib64/libpthread.so.0+0x7dc4)
#6 0x7fc2dfe7573c in clone (/usr/lib64/libc.so.6+0xf773c)
{code}

:(((((((
By: Joshua C. Colp (jcolp) 2017-05-09 08:20:50.579-0500

I understand why you run with ASAN but in this case it's really not helping trying to figure out what is going on with your RTCP deadlock. We need to see a backtrace if we have any hope of understanding what is going on.
By: Badalian Vyacheslav (slavon) 2017-05-09 08:40:27.064-0500

Locks 1
By: Badalian Vyacheslav (slavon) 2017-05-09 08:44:45.393-0500

added locks 1 and lock2 txt. after calls go out and wait 5 mins all locks go out.
System have about 30 concturent calls. after 100 call we have many channels on DOWN state, many transmit errors and can't do any new calls
By: Badalian Vyacheslav (slavon) 2017-05-09 09:12:15.816-0500

If we do only 10 concurent calls no lock in "core show locks". System before patch was can do 200-300 concurent calls normaly.
By: Badalian Vyacheslav (slavon) 2017-05-09 09:43:55.597-0500

locks_x*.txt - 20-40 concurent calls.

We do sipmle webrtc call to MOH

2853% CPU!!
{code}
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
72545 asterisk 20 0 18.558g 1.812g 21448 S 2853 1.4 117:04.45 asterisk
{code}

{code}
m1-asterisk01t*CLI> core show channels
Channel Location State Application(Data)
SIP/user36_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user16_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user20_stub-0000 s@macro-hangupcall:3 Up GotoIf(1?theend)
SIP/user40_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user3_stub-00000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user22_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user1_stub-00000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user19_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user14_stub-0000 s@macro-hangupcall:7 Up Hangup()
SIP/user12_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user5_stub-00000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user41_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user38_stub-0000 (None) Ringing AppDial2((Outgoing Line))
SIP/user34_stub-0000 (None) Down (None)
SIP/user11_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user8_stub-00000 (None) Down AppDial2((Outgoing Line))
SIP/user31_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user1000_stub-00 766@from-internal:2 Up MusicOnHold(,5)
SIP/user15_stub-0000 (None) Down AppDial2((Outgoing Line))
SIP/user28_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user44_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user32_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user37_stub-0000 s@macro-hangupcall:5 Up ExecIf(0?Set(CDR(recordingfile
SIP/user18_stub-0000 (None) Ringing AppDial2((Outgoing Line))
SIP/user26_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user21_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user35_stub-0000 (None) Down AppDial2((Outgoing Line))
SIP/user9_stub-00000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user4_stub-00000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user2_stub-00000 s@from-internal:1 Down (None)
SIP/user43_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user30_stub-0000 (None) Down AppDial2((Outgoing Line))
SIP/user13_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user39_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user33_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user17_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user42_stub-0000 s@macro-hangupcall:7 Up Hangup()
SIP/user6_stub-00000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user29_stub-0000 s@macro-hangupcall:5 Up ExecIf(0?Set(CDR(recordingfile
SIP/user27_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
SIP/user10_stub-0000 766@from-internal:2 Up MusicOnHold(,5)
42 active channels
34 active calls
{code}
By: Joshua C. Colp (jcolp) 2017-05-09 09:52:20.298-0500

If you built with DEBUG_THREADS then there is a large performance penalty. You can still build without it and acquire a backtrace of the running Asterisk process.
By: Badalian Vyacheslav (slavon) 2017-05-09 10:39:33.455-0500

Everything looks also. It just happens not on 20 parallel calls but on 100.
1. Writes that the manager taskprocessor is overflowing. Then, too, write about the aggregated topic, then about cel.
2. Start transmitting xmit errors.
3. Ceases to accept new calls at all.
By: Richard Mudgett (rmudgett) 2017-05-09 12:28:14.156-0500

* The taskprocessor queue reached X scheduled tasks indicate that the system is temporarily overloaded. The queues allow the system to handle traffic bursts. However, a sustained high activity level will eventually push the system over the edge. To deal with this overload res_pjsip stops accepting new work (e.g., new calls) when any taskprocessor queue reaches its high water mark. New work is accepted once all taskprocessor queues that went over their high water mark go back down to their low water mark. (See the output of "core show taskprocessors".)

* The "core show locks" output files are not showing a deadlock. They are showing strange and sometimes corrupt backtraces. The address sanitizer may actually be causing problems by interfering with the stack probes Asterisk uses for its debug tools. You are the only one we are aware of that is using the address sanitizer.
By: Badalian Vyacheslav (slavon) 2017-05-09 12:31:17.712-0500

If we switch to pjsip (now all test in chan_sip), then rtp logic will be used from pjsip or from rtp_asterisk? We have been fighting for half a year with this bug. We need at last to work stably :( We only have 200-300 parallel calls, the equipment is very powerful: Asterisk used to die from RTCP bug, now it can barely pull 100 parallel calls for all tests FreeSwithc processes 300+ without problems.

> ... "core show locks" output...
We compile asterisk without ASAN becouse it's cause break and can't get result. Output is without ASAN compiled.
By: Richard Mudgett (rmudgett) 2017-05-09 12:43:16.998-0500

Both chan_pjsip and chan_sip use the same RTP packet handling code rtp_engine.c and res_rtp_asterisk.c.
By: Joshua C. Colp (jcolp) 2017-05-22 07:13:44.815-0500

If the problem right now is that performance is not what you need but no deadlock occurs then profiling has to be done to understand for your scenario where time is being spent and what can be done to improve that. We're always trying to improve performance but such a thing can take a substantial amount of time to track down, and may only be applicable to your use case.
By: Asterisk Team (asteriskteam) 2017-06-05 12:00:01.222-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines