ASTERISK-24440: Call leak in Confbridge

[Home]

Summary: ASTERISK-24440: Call leak in Confbridge

Reporter: Ben Klang (bklang) Labels: Security

Date Opened: 2014-10-18 23:30:52 Date Closed: 2014-11-20 09:47:54.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Applications/app_confbridge

Versions: 11.13.0 Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) confbridge.conf
( 1) extensions.conf
( 2) jira_asterisk_24440_v11_partial_fix.patch
( 3) running_bt.txt
( 4) simple_call.xml
( 5) sip.conf

Description: I've been testing various Asterisk call behaviors with SIPp and with a no-media timeout set to 10 seconds. In every case, if I kill SIPp, Asterisk will eventually terminate the call for lack of media and eventually reach zero calls. However, if I have SIPp call into ConfBridge and then overload Asterisk, those calls are not cleaned up, even 30 minutes after SIPp has been killed.

There is corresponding memory growth which makes me think that the channels aren't getting cleaned up like they should. In "core show channels" output I see many thousands of calls, even though SIPp is limited to 1,000 concurrent calls.

Some additional notes:

* I do see messages indicating that calls should be getting disconnected, like this: [Oct 19 00:10:09] NOTICE[5874]: chan_sip.c:28942 check_rtp_timeout: Disconnecting call 'SIP/10.3.18.108-00000bda' for lack of RTP activity in 11 seconds
* Many minutes after the test has ended and SIP traffic has stopped, the console is still flooded with messages like this: [Oct 19 00:12:26] WARNING[5874]: chan_sip.c:4258 __sip_autodestruct: Autodestruct on dialog '2472-31726@10.3.18.108' with owner SIP/10.3.18.108-000009a7 in place (Method: BYE). Rescheduling destruction for 10000 ms

I've attached the SIPp XML that I'm using for this test.

Comments: By: Ben Klang (bklang) 2014-10-18 23:32:41.801-0500

One more thing I wanted to mention: when I run this at low volumes (5cc/1cps) I don't see the issue. At the end of a small test, all calls are hung up. It apparently only happens under load, or perhaps when something in the SIP dialog breaks down.
By: Ben Klang (bklang) 2014-10-18 23:48:37.727-0500

"core show channels" output after a test run of 20,000 calls (limited to 30cps and 1,000cc):

19467 active channels
19465 active calls
20000 calls processed
By: Ben Klang (bklang) 2014-10-22 11:49:02.730-0500

I've reproduced this issue: it also exists in Asterisk 13
By: Richard Mudgett (rmudgett) 2014-11-03 14:41:53.632-0600

I'm trying to replicate the issue you are seeing on my setup. Can you attach the sip.conf, extenthion.conf for the confbridge exten, and confbridge.conf that you are using please. Also what is the sipp command you are using?
By: Ben Klang (bklang) 2014-11-03 20:52:02.519-0600

Files attached. SIPp command line:

sudo $(which sipp) -p 8836 -sf simple_call.xml -l 1000 -m 20000 -r 5 -s conference -i 10.3.18.108 -mp 17862 -no_rate_quit -rate_max 35 -rate_increase 1 -rate_interval 30 -trace_stat -stf stats_file.txt -fd 1 -trace_screen -screen_file screen_file.txt -trace_err -error_file errors_file.txt -t u1 10.3.18.134
By: Ben Klang (bklang) 2014-11-03 20:53:04.591-0600

The rate_increase option is from this SIPp PR: https://github.com/SIPp/sipp/pull/107

If you don't want to do that, just change the -rate_increase option to -fd for the same effect.
By: Richard Mudgett (rmudgett) 2014-11-06 16:20:24.691-0600

[^jira_asterisk_24440_v11_partial_fix.patch] - Partial fix of the problem for v11 as v12+ already protects the bridge_channel->state from reentrancy and prevents inappropriate state changes.

With the above patch on -r427464 of the v11 branch, I made my test box load the asterisk instance with a lot of calls into {{ConfBridge}} using sipp: {{sudo sipp -sf simple_call.xml -i 127.0.0.2 -r 10 -l 1000 -s 735 127.0.0.1}}. The patch does prevent the announcer channel {{Bridge/0xXXXXXXX-output}} from getting stuck in the bridge, but one of the threads got stuck trying to send a packet to a socket as seen in thread 67 of the [^running_bt.txt] file. The stuck thread has the bridge lock which deadlocks any other thread from leaving the bridge.

* The dialplan exten 735 just throws the channel into a ConfBridge room.
* The asterisk instance and sipp were run on the same box with the result of the system getting sluggish.

{noformat}
Thread 67 (Thread 0xaac0eb70 (LWP 16567)):
#0 0x00391422 in __kernel_vsyscall ()
#1 0x00ba4ac8 in sendto () at ../sysdeps/unix/sysv/linux/i386/socket.S:97
#2 0x0817cd17 in ast_sendto (sockfd=1717, buf=0xaadb0dfc, len=172, flags=0, dest_addr=0xaac05a18) at netsock2.c:494
#3 0x00e5e646 in __rtp_sendto (instance=0xaad4efcc, buf=0xaadb0dfc, size=172, flags=0, sa=0xaac05a18, rtcp=0, ice=0xaac059f8, use_srtp=1) at res_rtp_asterisk.c:2121
#4 0x00e5e6d7 in rtp_sendto (instance=0xaad4efcc, buf=0xaadb0dfc, size=172, flags=0, sa=0xaac05a18, ice=0xaac059f8) at res_rtp_asterisk.c:2131
#5 0x00e622a8 in ast_rtp_raw_write (instance=0xaad4efcc, frame=0xaadaed38, codec=0) at res_rtp_asterisk.c:3188
#6 0x00e630db in ast_rtp_write (instance=0xaad4efcc, frame=0xa7c24e20) at res_rtp_asterisk.c:3360
#7 0x081a6dfa in ast_rtp_instance_write (instance=0xaad4efcc, frame=0xa7c24e20) at rtp_engine.c:309
#8 0x0118391c in sip_write (ast=0xaad58394, frame=0xa7c24e20) at chan_sip.c:7182
#9 0x080cc8ee in ast_write (chan=0xaad58394, fr=0xa7c24e20) at channel.c:5254
#10 0x01f18980 in softmix_bridge_poke (bridge=0xb6a4a1bc, bridge_channel=0xacab697c) at bridge_softmix.c:606
#11 0x0809ec8d in ast_bridge_handle_trip (bridge=0xb6a4a1bc, bridge_channel=0xacab697c, chan=0x0, outfd=-99999) at bridging.c:345
#12 0x080a0d81 in bridge_channel_join_multithreaded (bridge_channel=0xacab697c) at bridging.c:780
#13 0x080a1d00 in bridge_channel_join (bridge_channel=0xacab697c) at bridging.c:1013
#14 0x080a2632 in ast_bridge_join (bridge=0xb6a4a1bc, chan=0xaad58394, swap=0x0, features=0xaac09b00, tech_args=0xaac09b1c) at bridging.c:1135
#15 0x00a2b354 in confbridge_exec (chan=0xaad58394, data=0xaac0bedc "rpm_testing,,,sample_user_menu") at app_confbridge.c:1743
#16 0x0817dcb1 in pbx_exec (c=0xaad58394, app=0xb6d8e8c8, data=0xaac0bedc "rpm_testing,,,sample_user_menu") at pbx.c:1622
#17 0x081887e5 in pbx_extension_helper (c=0xaad58394, con=0x0, context=0xaad58f54 "sip_site_b", exten=0xaad58fa4 "735", priority=2, label=0x0, callerid=0xadd2dfe8 "sipp",
action=E_SPAWN, found=0xaac0e1d8, combined_find_spawn=1) at pbx.c:4915
#18 0x0818c1a5 in ast_spawn_extension (c=0xaad58394, context=0xaad58f54 "sip_site_b", exten=0xaad58fa4 "735", priority=2, callerid=0xadd2dfe8 "sipp", found=0xaac0e1d8,
combined_find_spawn=1) at pbx.c:6037
#19 0x0818dc3e in __ast_pbx_run (c=0xaad58394, args=0x0) at pbx.c:6512
#20 0x0818febd in pbx_thread (data=0xaad58394) at pbx.c:6842
#21 0x081eea83 in dummy_start (data=0xaccf9c70) at utils.c:1223
#22 0x002c796e in start_thread (arg=0xaac0eb70) at pthread_create.c:300
#23 0x00ba359e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130
{noformat}

With the {{--enable-dev-mode}} build I was also seeing a lot of {{ast_read() on chan '%s' called with no recorded file descriptor.}} errors. I suspect these may be because the channel is awakened by {{ast_bridge_change_state()}} for talk detection events on the bridge and the code thinks it needs to read the channel for a frame.
By: Joshua C. Colp (jcolp) 2014-11-12 13:43:33.921-0600

I've put Richard's patch up for review and will talk to him tomorrow about it. I think really there's two issues here:

1. What Richard has found
2. Under higher call levels signaling handling can be delayed resulting in retransmissions and delayed responses - this confuses sipp and causes it to mark some calls as dead when in reality they are actually up and running fine. As a result it doesn't send the BYE as part of the scenario, so those calls remain up.

With Richard's patch I still saw those but the count was consistent on both sides. Dead calls in sipp, active in Asterisk.
By: Ben Klang (bklang) 2014-11-12 15:39:56.535-0600

> 2. Under higher call levels signaling handling can be delayed resulting in retransmissions and delayed responses - this confuses sipp and causes it to mark some calls as dead when in reality they are actually up and running fine. As a result it doesn't send the BYE as part of the scenario, so those calls remain up.

I have seen this behavior as well. As long as the ultimate lack of audio causes Asterisk to eventually clean up the call, such that when SIPp stops the call count eventually reaches zero, then this behavior doesn't necessarily bother me (though I can't help but wonder whether the signaling timing delay is a bug in SIPp or Asterisk?)
By: Joshua C. Colp (jcolp) 2014-11-12 15:43:01.926-0600

It's a design limitation of chan_sip - it's single threaded for UDP, so the more traffic and the longer each thing takes... the more delay. In the case of chan_pjsip it suffers from this much less because it dispatches.