ASTERISK-27616: chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)

[Home]

Summary: ASTERISK-27616: chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)

Reporter: Gregory Massel (gmza) Labels:

Date Opened: 2018-01-23 18:44:28.000-0600 Date Closed:

Priority: Major Regression? Yes

Status: Open/New Components: Channels/chan_sip/General Core/Stasis

Versions: 13.19.0 15.2.0 Frequency of
Occurrence Frequent

Related
Issues:

Environment: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64) Asterisk versions 11.25.3, 13.19.0, 15.2.0 chan_sip - 14783 sip peers [Monitored: 8181 online, 6602 offline Unmonitored: 0 online, 0 offline] 4567 active SIP subscriptions Attachments:

Description: Since upgrading from Asterisk 11 to 13, when issuing a "sip reload" (either via CLI or asterisk -rx "sip reload"), chan_sip locks up for a period of time ranging from 3 to 70 seconds (depending on various factors), during which time it is completely unresponsive to any SIP packets received. Aggravating the issue is that, if qualify=yes is set, sip peers become LAGGED/UNREACHABLE as a result of chan_sip becoming unresponsive which triggers a flood of device state updates that further backlogs chan_sip.

This issue does not exist on Asterisk 11.25.3. Irrespective of the system load, volume of peers (we have >8,000 online) or volume of SIP subscriptions (we have anywhere from 4,000 to 8,000), Asterisk 11 can do a "sip reload" cleanly and instantaneously with zero disruption. It remains responsive throughout. However, Asterisk 13.19.0 and 15.2.0 both exhibit the module slow-down / lock-up.

On Asterisk 13/15, with qualify=no and allowsubscribe=no, a "sip reload" can typically take 3 to 5 seconds. A debug log shows that, despite there being a considerable number of peers, the process of reloading the configuration files is almost instantaneous, but things slow afterwards. e.g.:

[Jan 13 03:04:01] VERBOSE[6032] chan_sip.c: Reloading SIP
[Jan 13 03:04:01] DEBUG[6032] chan_sip.c: --------------- Done destroying pruned peers
[Jan 13 03:04:06] DEBUG[6032] chan_sip.c: do_reload finished. peer poke/prune reg contact time = 5 sec.
[Jan 13 03:04:06] DEBUG[6032] chan_sip.c: --------------- SIP reload done

In the above example, the poke/prune took 5 seconds, however, this can vary dramatically per the following factors:
- If a global sip config setting is changed (e.g. qualify=yes -> qualify=no) acting on all peers, the poke/prune can take up to a minute.
- With no config changes at all, it can take 3-20 seconds, depending on call volume and whether qualify is enabled and allowsubscribe is enabled.
- Even in best case, with qualify=no, keepalive=no, allowsubscribe=no, it will take at least 3 seconds. With zero active calls and qualify=yes, typically 5-6 seconds. With zero active calls and qualify=yes and allowsubscribe=yes, typically 8-10 seconds.

I've verified using VoIPmonitor and packet sniffing that the Asterisk server is completely unresponsive to SIP during the reload. Active calls do, however, remain active, as RTP continues to flow. Other Asterisk threads unrelated to chan_sip continue to function.

Looking at the chan_sip source code in sip_do_reload(), the code appears similar/identical from Asterisk 11 to 13. We have no outbound registrations nor outbound MWI subscriptions. Although the SIP reload is slow and unresponsive, it would appear that another section of code is what is causing it to become unresponsive...perhaps an issue with locking?

It's not clear to me why this happens with Asterisk 13/15 but not with 11.

It's also worth pointing out that we don't see this behavior on other Asterisk 13 servers that have zero hints / device state subscriptions but still have thousands of peers with qualify=yes. It is potentially possible that, at least in part, the issue relates to device state.

N.B. We have ruled out DNS failure by replacing all references to hostnames with IP addresses. I am also confident that, if there was a DNS issue here, Asterisk 11.25.3 would be affected, yet it is not.

Comments: By: Asterisk Team (asteriskteam) 2018-01-23 18:44:28.986-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: George Joseph (gjoseph) 2018-01-24 07:41:43.331-0600

chan_sip is no longer supported as a core module but your issue is in the queue. Your patience is appreciated as a community developer may work the issue when time and resources become available.

Asterisk is an open source project and community members work the issues on a voluntary basis. You are welcome to develop your own patches and submit them to the project.[1]

If you are not a programmer and you are in a hurry to see a patch provided then you might try rallying support on the Asterisk users mailing list or forums.[2] Another alternative is offering a bug bounty on the asterisk-dev mailing list.[3] Often a little incentive can go a long way.

[1]: https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process
[2]: http://www.asterisk.org/community/discuss
[3]: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties

By: Gregory Massel (gmza) 2018-02-02 08:16:07.429-0600

It appears that this is a memory leak of sorts.
Essentially, the longer Asterisk 13+ is running and the more times "sip reload" is run, the more memory is consumed and the longer the reload process takes.
This can grow until the system's memory usage is exhausted (and it starts swapping) after which reloads take exponentially longer, locking up chan_sip for minutes.

If I run a "pmap `cat /var/run/asterisk/asterisk.pid ` | sort -n -k 2", I get lots of entries such as:
00007fdac0000000 130884K rw--- [ anon ]
00007fdaa8000000 130920K rw--- [ anon ]
00007fdb18000000 130964K rw--- [ anon ]
00007fdaf0000000 131036K rw--- [ anon ]

Even removing most of the sip peers from sip.conf and reloading doesn't reduce active memory usage.

By: Gregory Massel (gmza) 2018-02-02 08:42:47.942-0600

ctnsip1*CLI> core show sysinfo

System Statistics
-----------------
System Uptime: 516 hours
Total RAM: 4032804 KiB
Free RAM: 212324 KiB
Buffer RAM: 30672 KiB
Total Swap Space: 11717628 KiB
Free Swap Space: 10385896 KiB

Number of Processes: 290

ctnsip1*CLI> module unload chan_sip.so
Unloaded chan_sip.so
ctnsip1*CLI> core show sysinfo

System Statistics
-----------------
System Uptime: 516 hours
Total RAM: 4032804 KiB
Free RAM: 213020 KiB
Buffer RAM: 30704 KiB
Total Swap Space: 11717628 KiB
Free Swap Space: 10386108 KiB

Number of Processes: 288

[Memory is still in a leaked state, despite unload of chan_sip]

At this point I kill -9 the asterisk process and restart asterisk and:

ctnsip1*CLI> core show sysinfo

System Statistics
-----------------
System Uptime: 516 hours
Total RAM: 4032804 KiB
Free RAM: 3528008 KiB
Buffer RAM: 34972 KiB
Total Swap Space: 11717628 KiB
Free Swap Space: 11643084 KiB

Number of Processes: 294

Suddenly memory usage is down from ~4GB to ~0.5GB, using the exact same configurations, merely by killing and restarting Asterisk.

On further review, I may have been incorrect in saying that memory is leaked during a "sip reload". It may be that memory is constantly being leaked (e.g. with call setup and tear-down), however, that the reload process slows down as, when a "sip reload" is executed, it is going through all the leaked memory.
By: Gregory Massel (gmza) 2018-02-02 09:58:36.384-0600

I am starting to think that this is not in chan_sip at all, but in stasis

In chan_sip, unlink_peers_from_tables() was modified from Ast 11 to Ast 13 and the comment reads:
/*
* We must remove the ref outside of the peers container to prevent
* a deadlock condition when unsubscribing from stasis while it is
* invoking a subscription event callback.
*/

stasis in new in 13 versus 11 and this could also explain why the delays are aggravated when presence is enabled.
By: Gregory Massel (gmza) 2018-02-02 18:42:49.993-0600

Some more observations:

1. If one removes the sip peers and does a "sip reload", zero memory appears to be recovered, despite thousands of peers being pruned. It appears that the sip peers are still kept in memory indefinitely, just marked as pruned/inactive.
2. If one then re-adds the sip peers and does a "sip reload", there is an immediate and significant increase in memory use. It appears that the entire peer list is now twice as long.

Using a "memory show summary" before and after:

31642795 bytes in 51503 allocations in file chan_sip.c
17943596 bytes in 171564 allocations in file stasis.c
117356836 bytes in all allocations

31697870 bytes in 51566 allocations in file chan_sip.c
28899257 bytes in 222323 allocations in file stasis.c
138646601 bytes in all allocations

This shows that the memory leak is actually in stasis.c. This explains why the issue isn't present in Asterisk 11.25.3.

By: Joshua C. Colp (jcolp) 2018-02-02 18:50:36.870-0600

That doesn't mean the memory leak is in stasis.c. Things can be allocated in there and will show up as stasis.c, while the problem can be the lifetime management of it by the module (such as chan_sip). You'd want to use the CLI command which shows you the actual functions which allocated things to determine what was allocated.
By: Gregory Massel (gmza) 2018-02-13 15:46:48.746-0600

My memory detail would indicate that this may be the same issue as ASTERISK-27656 so I'm installing 13.19.1 now and will re-test over the next couple of days to see whether the fix for ASTERISK-27656 helps solve this.
By: Gregory Massel (gmza) 2018-02-13 16:43:04.315-0600

No, issue persists with Asterisk 13.19.1.
It is easily repeatable. It sip.conf, I #include "sip-peers.inc" (which contains ~8000 peers).
If I comment out the #include, issue "sip reload", memory is freed, however, I uncomment it, "sip reload", and MORE memory is in use then at first. If I comment it again, "sip reload", memory is freed, but still more memory is in use relative to the previous time it was commented. Each time a "sip reload" will leak memory, provided that peers are removed and added.

I've done a "memory show allocations", however, the output is over 700,000 lines. So I did this after each reload, and then diff'ed the output when the peers were removed.

The repeat offenders appear to be:

cache_entry_create() line 201 of stasis_cache.c
cache_entry_create() line 196 of stasis_cache.c
subscription_change_alloc() line 962 of stasis.c
subscription_change_alloc() line 963 of stasis.c
By: Gregory Massel (gmza) 2018-02-13 16:55:39.055-0600

On one reload I got the following:

ctnsip1*CLI> sip reload
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x169b620
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x169b620 (0)
Got 15 backtrace records
#0: [0x45b122] /usr/sbin/asterisk() [0x45b122]
#1: [0x5c68ad] /usr/sbin/asterisk() [0x5c68ad]
#2: [0x5c6d6f] /usr/sbin/asterisk() [0x5c6d6f]
#3: [0x5c129a] /usr/sbin/asterisk() [0x5c129a]
#4: [0x5c177b] /usr/sbin/asterisk() [0x5c177b]
#5: [0x5c2888] /usr/sbin/asterisk(stasis_subscribe_pool+0x138) [0x5c2888]
#6: [0x7f1029a2d6c7] /usr/lib/asterisk/modules/chan_sip.so(+0x716c7) [0x7f1029a2d6c7]
#7: [0x7f1029a336d0] /usr/lib/asterisk/modules/chan_sip.so(+0x776d0) [0x7f1029a336d0]
#8: [0x7f1029a3846a] /usr/lib/asterisk/modules/chan_sip.so(+0x7c46a) [0x7f1029a3846a]
#9: [0x5edffd] /usr/sbin/asterisk() [0x5edffd]
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x169b620
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x169b620 (0)
Got 15 backtrace records
#0: [0x45b122] /usr/sbin/asterisk() [0x45b122]
#1: [0x5ce750] /usr/sbin/asterisk(stasis_message_create+0x70) [0x5ce750]
#2: [0x5c250e] /usr/sbin/asterisk() [0x5c250e]
#3: [0x5c7c40] /usr/sbin/asterisk(stasis_caching_topic_create+0xb0) [0x5c7c40]
#4: [0x5c7f37] /usr/sbin/asterisk(stasis_cp_single_create+0x47) [0x5c7f37]
#5: [0x504229] /usr/sbin/asterisk() [0x504229]
#6: [0x7f1029a2e408] /usr/lib/asterisk/modules/chan_sip.so(+0x72408) [0x7f1029a2e408]
#7: [0x7f1029a336d0] /usr/lib/asterisk/modules/chan_sip.so(+0x776d0) [0x7f1029a336d0]
#8: [0x7f1029a3846a] /usr/lib/asterisk/modules/chan_sip.so(+0x7c46a) [0x7f1029a3846a]
#9: [0x5edffd] /usr/sbin/asterisk() [0x5edffd]
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x169b620
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x169b620 (0)
Got 14 backtrace records
#0: [0x45b122] /usr/sbin/asterisk() [0x45b122]
#1: [0x5ce750] /usr/sbin/asterisk(stasis_message_create+0x70) [0x5ce750]
#2: [0x5c2873] /usr/sbin/asterisk(stasis_subscribe_pool+0x123) [0x5c2873]
#3: [0x5cec02] /usr/sbin/asterisk(stasis_message_router_create_pool+0x62) [0x5cec02]
#4: [0x504246] /usr/sbin/asterisk() [0x504246]
#5: [0x7f1029a2e408] /usr/lib/asterisk/modules/chan_sip.so(+0x72408) [0x7f1029a2e408]
#6: [0x7f1029a336d0] /usr/lib/asterisk/modules/chan_sip.so(+0x776d0) [0x7f1029a336d0]
#7: [0x7f1029a3846a] /usr/lib/asterisk/modules/chan_sip.so(+0x7c46a) [0x7f1029a3846a]
#8: [0x5edffd] /usr/sbin/asterisk() [0x5edffd]
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x169b620
[Feb 14 00:15:05] ERROR[21685]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x169b620 (0)
Got 12 backtrace records
#0: [0x45b122] /usr/sbin/asterisk() [0x45b122]
#1: [0x5ce750] /usr/sbin/asterisk(stasis_message_create+0x70) [0x5ce750]
#2: [0x5c2873] /usr/sbin/asterisk(stasis_subscribe_pool+0x123) [0x5c2873]
#3: [0x7f1029a2d6c7] /usr/lib/asterisk/modules/chan_sip.so(+0x716c7) [0x7f1029a2d6c7]
#4: [0x7f1029a336d0] /usr/lib/asterisk/modules/chan_sip.so(+0x776d0) [0x7f1029a336d0]
#5: [0x7f1029a3846a] /usr/lib/asterisk/modules/chan_sip.so(+0x7c46a) [0x7f1029a3846a]
#6: [0x5edffd] /usr/sbin/asterisk() [0x5edffd]

By: Richard Mudgett (rmudgett) 2018-02-13 17:44:40.959-0600

If you enable BETTER_BACKTRACES and DONT_OPTIMIZE then the FRACK backtraces could be more helpful to you.

There is only one place in chan_sip that calls stasis_subscribe_pool() and that is add_peer_mwi_subs().
The large number of leaked items you are seeing in stasis.c and stasis_cache.c are controlled by code in chan_sip.c. It is chan_sip that is leaking the controlling references.
By: Gregory Massel (gmza) 2018-03-26 16:44:03.810-0500

That makes sense. If I comment out all the "mailbox=xxx" lines in sip.conf, it would appear to solve the memory leak.
Of course that breaks all the MWI, but it does confirm what you're saying.
Unfortunately it's beyond my abilities to work out a fix for this and moving to pjsip has been tricky because it still lacks feature parity. I am therefore offering a bug bounty of $800 for a fix by 12 April 2018.
By: Gregory Massel (gmza) 2018-06-20 23:56:25.472-0500

memory leaks aside, I've edited chan_sip.c to log more information about time elapsing between various calls within sip_do_reload().

unlink_marked_peers_from_tables(); - 0 seconds
sip_poke_all_peers(); - 5 seconds if qualify=on; 0 seconds if qualify=off
sip_keepalive_all_peers(); - 4 seconds
sip_send_all_registers(); - 0 seconds (note: I have no outbound registrations configured)
sip_send_all_mwi_subscriptions(); - 0 seconds (note: I have no outbound MWI subscriptions configured)

The BIG culprit here is sip_keepalive_all_peers(); it takes a huge amount of time, even though I have "keepalive=no" in my sip.conf. The only way to get the reload under control is to comment out sip_keepalive_all_peers() from sip_do_reload.

sip_poke_all_peers() is also terribly slow, however, at least it can be controlled by configuring "qualify=no".

Looking at the code for both sip_poke_all_peers() and sip_keepalive_all_peers(), it doesn't appear to have changed from 11.25.3 to 13.21.1 and the two functions are very similar.

This makes me think that AST_SCHED_REPLACE_UNREF() is where the delay is coming in.

Whilst here, sip_keepalive_all_peers makes reference to "poke peer ref" instead of "keepalive peer ref" (should be harmless, but nonetheless wrong) and it seems to try and schedule keepalives for all peers immediately (0ms into the future) which will result in a flood. Almost all the other related code (e.g. sip_poke_all_peers, sip_send_all_registers) staggers the scheduling to avoid flooding. This is all ancillary though as, even if keepalives are disabled, sip_keepalive_all_peers still iterates through every peer and manages to hang chan_sip for a number of seconds.
By: Gregory Massel (gmza) 2018-06-21 15:25:47.611-0500

Also worth noting that within AST_SCHED_REPLACE_UNREF() (include/asterisk/sched.h) is a call to ast_sched_find_data() (main/sched.c) of which there was a change between Asterisk 12.8.2 and 13.0.0. It seems to be using a heap (wrapped in ast_mutex_lock) in 13 vs a hashtab (with its own locking) in 12.
By: Steve Sether (stevesether) 2020-04-01 16:20:29.306-0500

We've recently found this same issue, or something extremely similar on Asterisk 16.9.0. We also don't experience this issue on Asterisk 11.

Essentially when I do a sip reload, Asterisk stops processing SIP for some time. In production we had this occur for several minutes, I'm not sure why. I'm able to easily reproduce this in testing.

Is there any way to mitigate this situation? It seems to have died about two years ago. I'd hate to go to pjsip at this point since like the above poster, it requires some work, and we'd rather not switch to this new stack with all the other changes in 11->16.
By: Gregory Massel (gmza) 2020-04-04 10:35:35.331-0500

In response to Steve Sether's comment: I was unable to mitigate it adequately and, as more and more security issues with older versions of Asterisk materialised (and were not fixed owing to the associated versions being end-of-life), I was forced to migrate to res_pjsip.

A few suggestions re mitigatation:
1. Shut down Asterisk entirely and restart it routinely at a quiet time (e.g. 4am in the morning). This will manage the amount of memory leaked over time [and reduce reload times].
2. Disable MWI (mailbox=xxx) if you can afford to lose the functionality.
3. Disable all NAT keepalives (keepalive=no) and qualification (qualify=no). [Note: This can cause profound issues if many of your endpoints are behind NAT and don't generate keepalives themselves].
4. Disable any presence (allowsubscribe=no).
5. If you're making frequent changes to sip.conf, keep reloads to the minimum. E.g. Rather than reloading ("sip reload") every time a change is made (to sip.conf), trigger reloads via a cron job that runs every 15 / 30 / 60 minutes and issues a "sip reload" if the modification timestamp of sip.conf has changed.

In the long term, however, there are numerous reasons aside from this issue to consider migrating to res_pjsip. One of the biggest problems with chan_sip is that it cannot deal with high volumes of registration requests (max. ~50 per second even on the best of harware) and this means that even one buggy SIP stack (or malicious users) can overwhelm it. There are tons of Huawei fibre and LTE CPE routers that, if fail to register (e.g. wrong password or SIP account disabled), retry instantly without any delay, generating hundreds of registrations per second. A single one of these, on its own, can overwhelm chan_sip, let alone a few hundred endpoints or someone with malicious intent using a tool like SIPP. Sure, you can try mitigate this by using iptables to rate-limit the volume of REGISTER packets per IP, however, even then chan_sip is extremely easy to flood.

By comparison, res_pjsip (on recent versions of Asterisk), on reasonably modest hardware can deal with well over 2,000 registrations per second and won't try and write the same contact information over and over again (the sqlite3 AstDB is the weakest link with both sip stacks) if it hasn't changed. Coupled with iptables rate limiting (to manage the really malicious endpoints), it's much more difficult to overwhelm res_pjsip.

With above said, res_pjsip will scale FAR better than chan_sip (think 10,000 active endpoints, all registering, NAT keepalives, presence, etc.), but has its limitations as well. If you're facing an endpoint population of more than 10,000, it may be worth putting the effort into fronting Asterisk with OpenSIPS or Kamailio. If considering this, keep in mind that res_pjsip works much better with proxies (it can publish presence to them, match on SIP headers, etc.) whereas chan_sip isn't as suited. Migrating to res_pjsip wasn't trivial but also wasn't anywhere near as complex as I'd thought it would be. I was able to write scripts and use tools like 'sed' to bulk-modify existing configurations. Your bigger issue now is that, having left it so late, if you upgrade from Asterisk 11 to 16, there are so many other things that have changed that you may spend more time planning around deprecated functions, changed functions and channel variables. From a security and stability perspective though, it's probably worth the effort of migrating to pjsip and Asterisk 16, particularly if you're dealing with a few thousand endpoints.
By: Steve Sether (stevesether) 2020-04-06 09:59:30.660-0500

Thanks for your advice. We've come to similar conclusions as yourself and are in the process of migrating to pjsip largely because chan_sip is unsupported by digium, and 2 years have gone by without what's a major bug being fixed.

We already have a proxy solution in place, so the other problems with chan_sip aren't so much of an issue.
By: Andreas Schuler (schulerdev) 2022-07-08 20:00:08.996-0500

Hello,

to all the other poor devils that have the same problem. With the excellent work of Gregory Massel it was relatively easy to find a solution. Someone of the developers removed the hashmap that speedups the search through the scheduler list to tweak it for 'performance reasons' :D

This is the commit:
https://github.com/asterisk/asterisk/commit/6bdf2c4eab6524f85fa2245b9d796f5f4c46911a

I patched it in again with some small cosmetic changes and now 'sip reload' works like a charm.
Have Fun!