ASTERISK-28342: Ast-to-Ast setup using the same rtcpinterval crashes RTCP and audio stream

[Home]

Summary: ASTERISK-28342: Ast-to-Ast setup using the same rtcpinterval crashes RTCP and audio stream

Reporter: Speed Dial Dave (speeddialdave) Labels: pjsip

Date Opened: 2019-03-21 15:59:37 Date Closed: 2019-03-29 13:33:56

Priority: Minor Regression?

Status: Closed/Complete Components: Resources/res_rtp_asterisk

Versions: 13.21.1 16.2.1 Frequency of
Occurrence Occasional

Related
Issues:

Environment: CentOS 7 (7.6-1810 x86_64) Attachments:

Description: When stress-testing PJSIP and Asterisk 16.2.1 by doing direct calls between two identically configured Asterisk setups (over both WAN and LAN) I noticed a specific error message that kept showing up in the logs:

res_rtp_asterisk.c: RTCP SR transmission error to 1.2.3.4:12345, rtcp halted Operation not permitted

This started showing up when my tests reached about 200 simultaneously connected calls between the two Asterisk setups. At this level, after a total of 15k calls, the error showed up about 70-80 times on the Asterisk doing the dialing, and about 10-15 times on the one receiving the calls; at ~350 simultaneous calls the number of errors was much higher. When this error popped up the outgoing audio stream ("us") also dropped out from the affected calls, but the channels didn't crash and the calls kept going and finalized normally without any other apparent error.

As "operation not permitted"/EPERM should be a permission error I first fiddled around with increasing various ulimits as I thought it was some resource starvation, but without luck. After experimenting with various config changes for Asterisk I at one point had different rtcpinterval (rtp.conf) values on the two machines - one was set to 3000 and the other to 6000 - and this completely cleared the problem. Running my test several times more with the two machines having an rtcpinterval difference of only 500 I couldn't get the error to show up again, even when stepping it up to 600 simultaneous calls.

Comments: By: Asterisk Team (asteriskteam) 2019-03-21 15:59:38.467-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.
By: Sean Bright (seanbright) 2019-03-25 14:56:47.142-0500

So there aren't any documented reasons that {{sendto}} should be returning {{EPERM}}. [I have seen anecdotally through a bit of googling|https://stackoverflow.com/questions/23859164/linux-udp-socket-sendto-operation-not-permitted] that this can be caused when {{iptables}} decides to drop a packet for one reason or another, but I don't _think_ that Asterisk is doing anything wrong here.
By: Corey Farrell (coreyfarrell) 2019-03-25 15:03:21.200-0500

Possibly SELinux? I'd suggest installing {{policycoreutils-python-utils}} and running {{audit2allow -a}} or {{audit2why -a}} to check this.
By: Joshua C. Colp (jcolp) 2019-03-26 04:36:57.298-0500

Per the comments from [~seanbright] and [~coreyfarrell] have you looked into those?
By: Speed Dial Dave (speeddialdave) 2019-03-26 18:41:28.807-0500

[~jcolp], SELinux is disabled on the two machines used for this test but iptables is running - with a rather permissive and simple ruleset, passing all UDP traffic to the relevant ports between the machines. I'm a bit confused as to how an identical rtcpinterval configured on the two machines would somehow cause iptables to hiccup, because as stated I can solve the problem by just increasing or decreasing the rtcpinterval on *one* of the Asterisks with f.e. 500ms. I will perform a few more tests in the coming days, with iptables out of the picture, and let you know how that went.
By: Sean Bright (seanbright) 2019-03-26 18:59:51.232-0500

[~speeddialdave] - I don't disagree that it seems odd. However, because {{sendto}} isn't documented to set {{EPERM}}, it appears to be something outside of Asterisk that is causing the issue.
By: Speed Dial Dave (speeddialdave) 2019-03-29 12:55:49.473-0500

I found some time to do more tests, and it indeed seems to be iptables causing the problem. Specifically its connection tracking for UDP traffic appears to be the culprit. Disabling connection tracking for UDP on the relevant ports makes the errors go away:

(this might not be the correct or most suitable way to disable conntrack for this purpose, I don't really know as I avoid iptables as much as I can)

{noformat}
*raw
-A PREROUTING -p udp -m multiport --dports 5060,10000:20000 -j NOTRACK
-A OUTPUT -p udp -m multiport --sports 5060,10000:20000 -j NOTRACK
COMMIT
{noformat}

{noformat}
[root@ast_a blah]# iptables -L -n -t raw
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
CT udp -- 0.0.0.0/0 0.0.0.0/0 multiport dports 5060,10000:20000 NOTRACK

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
CT udp -- 0.0.0.0/0 0.0.0.0/0 multiport sports 5060,10000:20000 NOTRACK
{noformat}
By: Sean Bright (seanbright) 2019-03-29 13:05:57.192-0500

[~speeddialdave], are these memory constrained systems by any chance?
By: Speed Dial Dave (speeddialdave) 2019-03-29 13:26:50.190-0500

[~seanbright] Not really. They're equipped with 8GB of RAM, and the current test and setup never reaches over 400MB of total RAM use. There might be some global system limit configured too narrowly, but I wouldn't know which or where to look.
By: Sean Bright (seanbright) 2019-03-29 13:33:56.515-0500

Thanks [~speeddialdave]. It would be nice to know why connection tracking is causing this problem and if Asterisk is doing something to trigger it, but because we have a workaround (disabling connection tracking for ports Asterisk uses) I think we are safe to close this. We can re-open if someone stumbles on additional information in the future or this becomes a larger problem.

Thanks for following up!