ASTERISK-28962: Asterisk Memory Leak after SIP Reply Flood

[Home]

Summary: ASTERISK-28962: Asterisk Memory Leak after SIP Reply Flood

Reporter: Jonas Swiatek (jonasswiatek) Labels:

Date Opened: 2020-06-24 06:46:29 Date Closed: 2020-10-19 12:00:01

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_pjsip

Versions: 16.6.1 Frequency of
Occurrence Frequent

Related
Issues:

Environment: Amazon Linux 2 (CentOS based) Attachments: ( 0) invite_recv.xml

Description: I've got a user with a SIP Client (MicroSIP) that some times goes haywire, and responds to the initial INVITE from Asterisk by retransmitting it's 180 Ringing on a loop, seemingly unbounded. We're talking thousands of the same reply. I haven't been able to retrieve all of them from Homer as there are simply too many for their UI to cooperate. But the 100 or so I did manage to pull was all sent within a handful of milliseconds.

When Asterisk received these, it's memory usage shots up significantly, and the log file started printing hundreds of these lines:

WARNING[13337] channel.c: Exceptionally long queue length queuing to PJSIP/registrar-0000d96a

The channel noted is the outgoing call to the user with the defective SIP Client.

There was nothing else in the messages log file related to this at all.

Asterisk CPU spiked (thought not to 100%), which is not a surprise as it does need to process all these SIP Replies, however the memory usage did not go down afterwards, even 12 hours after.

Normally it'll sit around 5 mb or something, but after this it had allocated a good 6GB of ram and there was no sign it was about to give any of it up 12 hours later when I restarted it.

I've got a PCAP file, though it contains sensitive information so I'd prefer not to share it publicly

I've also moved this customer to a separate Asterisk instance running 16.11.1, and I can probably enable whatever other logging might be helpful on it, in case this client goes haywire again and takes out that instance.

Comments: By: Asterisk Team (asteriskteam) 2020-06-24 06:46:34.814-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.
By: Jonas Swiatek (jonasswiatek) 2020-06-24 06:53:53.860-0500

PCAP has been emailed to asteriskteam.
By: Kevin Harwell (kharwell) 2020-06-24 15:09:59.743-0500

{quote}
PCAP has been emailed to asteriskteam.
{quote}
Thanks we got it.

I've tried throwing a lot of 180's at Asterisk, but was unable to replicate any memory issues so far.

I don't have confidence the Asterisk log will turn up much but you could try sending it to us and maybe we'll get lucky. You could try enabling debug and/or verbose during the incident and it might turn up more.

Can you also post, or send the endpoint's configuration and the relevant executing dialplan? Also, are you using Asterisk realtime?
By: Kevin Harwell (kharwell) 2020-06-24 15:19:15.109-0500

Other things to try if you are able. One of the better ways to track a possible memory issue in Asterisk is by enabling MALLOC_DEBUG [1], or running Asterisk with Valgrind [2]. Both have drawbacks though. Valgrind comes with a performance cost, and MALLOC_DEBUG requires a recompile.

If you are able to run with MALLOC_DEBUG enabled please be sure to enable the "atexit summary" via the Asterisk CLI command. See [1] for more information. Also during an occurrence it'd be worthwhile to execute the "memory show summary" or "memory show allocations" CLI commands several times over time and record the results. This would give a memory allocations snapshot over time that might point to where memory is "rising".

[1] https://wiki.asterisk.org/wiki/display/AST/MALLOC_DEBUG+Compiler+Flag
[2] https://wiki.asterisk.org/wiki/display/AST/Valgrind
By: Jonas Swiatek (jonasswiatek) 2020-06-24 15:22:44.195-0500

The logs literally contains nothing else than the "WARNING[23382] channel.c: Exceptionally long queue length queuing to PJSIP/registrar-0000ab85" entries

The only thing prior to it in /var/log/asterisk/messages is a few hours before:
NOTICE[8467][C-00004431] app_dial.c: Not accepting call completion offers from call-forward recipient Local/0544445@dial-00003ee2;1

I'll see if I can crank up the log verbosity on the server. It's only serving that specific customer right now because I don't want his crazy SIP client to nuke my server again ;)

Endpoint config:
{code}
[registrar]
type=identify
endpoint=registrar
match=<redacted>
match=<redacted>

[registrar]
type=aor
contact=<redacted>

[registrar]
type=endpoint
transport=transport-udp
context=dial
disallow=all
allow=ulaw,alaw,opus
aors=registrar
force_rport=yes
direct_media=no
rtp_symmetric=yes
refer_blind_progress=no
media_address=<redacted>
media_encryption=sdes
media_encryption_optimistic=yes
rtp_timeout=120
{code}

Dialplan:
{code}
[dial]
exten => _.,1,NoOP(AGI: ${agiUrl})
same => n,GosubIf($[${LEN(${agiUrl})} = 0]?init-agi,SetAgi,1)
same => n,AGI(agi://${agiUrl}/dial)
same => n,Hangup()
{code}
We use Fast AGI for everything, so the call would've been handled by that at the time it crashed out. Maybe that's a relevant factor?

I am not using realtime.
By: Kevin Harwell (kharwell) 2020-06-24 15:24:39.374-0500

another question! Have you had an occurrence of the problem running on 16.11.1 yet?
By: Jonas Swiatek (jonasswiatek) 2020-06-24 15:29:34.888-0500

It's no problem, it's already compiled from code, so it's easy enough to change the compiler flags and recompile. Do you prefer either Valgrind or MALLOC_DEBUG, or is it all the same to you?

Haven't seen anything on version 16.11.1 yet. Only just moved this offending customer to that version yesterday - but so far we've seen this flood happen three times total, roughly with 5-7 days in between. I think it depends on something blowing up on the clients side that provokes the issue.

Perhaps I should see if I can replicate the issue with SIPP.

By: Jonas Swiatek (jonasswiatek) 2020-06-24 15:45:00.452-0500

A quick thought:

The PCAP is incomplete, likely because homer simply wasn't designed to shove that much data into one. One thing that I could've happened, since the user had like 4 or 5 devices registered at the time is that one of the other devices answered the call. When that happened, Kamailio would've sent out a CANCEL to the other devices, but the defective SIP device would've kept just spamming 180 Ringing like it was the end of the world. Perhaps Asterisk receiving a metric f...ton of those to a call that has already been answered would be provoking it?
By: Kevin Harwell (kharwell) 2020-06-24 16:11:33.767-0500

Regarding AGI it's possible of course if a leak is there, or possibly in the AGI script itself.

If you can I'd start with MALLOC_DEBUG as it should have less effect on system performance.

If you can, and are SIPp savvy then yeah attempting a scenario against that system to replicate the problem would be good. I thought Asterisk receiving all the 180's might be the cause but using SIPp I tried doing that and the memory on my system didn't go up really at all for Asterisk.

I'll attach the modified scenario I used if you want to use that as a starting point.
By: Kevin Harwell (kharwell) 2020-06-24 16:16:08.024-0500

Attached SIPp scenario [^invite_recv.xml]. I used the following to run it:
{noformat}
$ sipp 127.0.0.1 -p 5061 -m 1 -sf invite_recv.xml -s alice
{noformat}
I then originated a call from Asterisk to "alice" and the scenario receives the invite, and then basically loops forever on sending the 180.

Asterisk did however eventually timeout the scenario after 32 seconds or so (whatever the timer b is set to).
By: Jonas Swiatek (jonasswiatek) 2020-06-24 16:19:39.893-0500

I don't think the memory leak can be in the AGI script itself. It's Fast AGI (the TCP Connection thing) to another server entirely, and it didn't register a blip on the radar at all.

As for the quantity of 180 Replies. Judging from the AWS graps, we're probably talking about several hundred megabytes of 180 Ringing replies over the span of maybe 10 seconds.
By: Jonas Swiatek (jonasswiatek) 2020-06-24 16:21:52.255-0500

Cool! I'll see if I can make anything happen with SIPP.
By: Asterisk Team (asteriskteam) 2020-07-10 12:00:00.570-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
By: Jonas Swiatek (jonasswiatek) 2020-09-30 15:04:39.596-0500

We've had this happen again a few times (latest just 15 minutes ago from now).

One thing that I've been thinking about last, is that our calls are being managed through Fast AGI - could that be a possible factor?

Running 16.11.1
By: Asterisk Team (asteriskteam) 2020-09-30 15:04:39.992-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Joshua C. Colp (jcolp) 2020-10-05 05:56:26.760-0500

I wouldn't expect FastAGI to contribute, but without logs and such it's purely a guess. For example a debug level log would likely show why things are doing what they're doing.
By: Asterisk Team (asteriskteam) 2020-10-19 12:00:00.909-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines