Summary: | ASTERISK-27985: PJSIP: Does not respond to INVITES when any taskprocessor queue length is > high water level | ||
Reporter: | xrobau (xrobau) | Labels: | pjsip |
Date Opened: | 2018-07-24 22:16:00 | Date Closed: | 2018-07-26 04:06:13 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | CEL/cel_odbc |
Versions: | 13.22.0 15.4.1 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | FreePBX 14, SNG7 | Attachments: | ( 0) pjsip-ignored.txt |
Description: | We've had sporadic reports of AGIs causing audio glitches on heavy systems (which we believe is related to ASTERISK-26257 but is NOT limited to confbridge, MOH and Playback/Background is also affected).
{code} exten => 998,1,Answer same => n,Dial(Local/999@from-internal-custom/n,300,gm(default)) same => n,Playback(beep) same => n,Goto(1) exten => 999,1,Set(COUNT=0) same => n(loop),GotoIf($[ ${COUNT} > 1000 ]?toomany) same => n,Gosub(testtrigger) same => n,Set(COUNT=$[ ${COUNT} + 1 ]) same => n,Goto(loop) same => n(toomany),Hangup same => n(testtrigger),Dial(SIP/8675309@127.0.0.1:5160,1,g) same => n,Return {code} The above dialplan simulates a call entering the 'from-pstn' context of FreePBX, which then does - after the standard dialplan - a Hangup. This causes Asterisk to create about 20-30 calls per second (but does not trigger the audio issue, unfortunately). It does however trigger a new issue where PJSIP channel INVITES are ignored until the cel_aggregation_topic queue is empty: {code} Processor Processed In Queue Max Depth Low water High water ... skipped lines ... subm:cdr_engine-00000003 5780891 0 2347 4500 5000 subm:cel_aggregation_topic-00000006 4968215 786689 2961396 2700 3000 subm:endpoint_topic_all-cached-00000008 794 0 13 450 500 subm:endpoint_topic_all-cached-0000005a 778 0 13 450 500 subm:manager_topic-00000007 5962603 0 344 2700 3000 {code} While the queue is being processed, pjsip calls are ignored. chan_sip calls are answered and processed correctly, increasing the queue size. | ||
Comments: | By: Asterisk Team (asteriskteam) 2018-07-24 22:16:02.099-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: xrobau (xrobau) 2018-07-24 22:17:59.277-0500 Note in the attachment the INVITE is ignored, but the CANCEL is responded to with a 'Call/Transaction Does Not Exist' error. By: xrobau (xrobau) 2018-07-24 22:23:19.386-0500 Asterisk 15.4.1 also shows the same symptoms. At the time of this command, a PJSIP call was trying to be established {code} Asterisk 15.4.1, Copyright (C) 1999 - 2016, Digium, Inc. and others. Created by Mark Spencer <markster@digium.com> Asterisk comes with ABSOLUTELY NO WARRANTY; type 'core show warranty' for details. This is free software, with components licensed under the GNU General Public License version 2 and other licenses; you are welcome to redistribute it under certain conditions. Type 'core show license' for details. ========================================================================= Connected to Asterisk 15.4.1 currently running on freepbx (pid = 14914) freepbx*CLI> core show taskprocessors Processor Processed In Queue Max Depth Low water High water app_voicemail 0 0 0 450 500 ast_msg_queue 0 0 0 450 500 CCSS_core 0 0 0 450 500 dns_system_resolver_tp 0 0 0 450 500 hep_queue_tp 0 0 0 450 500 iax2_transmit 0 0 0 450 500 pjsip/default-0000000c 8 0 1 450 500 pjsip/default-0000000d 8 0 1 450 500 pjsip/default-0000000e 6 0 1 450 500 pjsip/default-0000000f 6 0 1 450 500 pjsip/default-00000010 7 0 1 450 500 pjsip/default-00000011 7 0 1 450 500 pjsip/default-00000012 7 0 1 450 500 pjsip/default-00000013 7 0 1 450 500 pjsip/distributor-00000020 2 0 1 450 500 pjsip/distributor-00000021 2 0 1 450 500 pjsip/distributor-00000022 0 0 0 450 500 pjsip/distributor-00000023 4 0 1 450 500 pjsip/distributor-00000024 2 0 1 450 500 pjsip/distributor-00000025 0 0 0 450 500 pjsip/distributor-00000026 2 0 1 450 500 pjsip/distributor-00000027 2 0 1 450 500 pjsip/distributor-00000028 4 0 1 450 500 pjsip/distributor-00000029 0 0 0 450 500 pjsip/distributor-0000002a 0 0 0 450 500 pjsip/distributor-0000002b 6 0 1 450 500 pjsip/distributor-0000002c 0 0 0 450 500 pjsip/distributor-0000002d 2 0 1 450 500 pjsip/distributor-0000002e 4 0 1 450 500 pjsip/distributor-0000002f 2 0 1 450 500 pjsip/distributor-00000030 2 0 1 450 500 pjsip/distributor-00000031 0 0 0 450 500 pjsip/distributor-00000032 4 0 1 450 500 pjsip/distributor-00000033 0 0 0 450 500 pjsip/distributor-00000034 0 0 0 450 500 pjsip/distributor-00000035 8 0 2 450 500 pjsip/distributor-00000036 4 0 1 450 500 pjsip/distributor-00000037 0 0 0 450 500 pjsip/distributor-00000038 2 0 1 450 500 pjsip/distributor-00000039 4 0 1 450 500 pjsip/distributor-0000003a 2 0 1 450 500 pjsip/distributor-0000003b 2 0 1 450 500 pjsip/distributor-0000003c 0 0 0 450 500 pjsip/distributor-0000003d 0 0 0 450 500 pjsip/distributor-0000003e 4 0 1 450 500 pjsip/exten_state 0 0 0 450 500 pjsip/messaging 0 0 0 450 500 pjsip/mwi-00000046 0 0 0 450 500 pjsip/mwi-00000047 0 0 0 450 500 pjsip/mwi-00000048 0 0 0 450 500 pjsip/mwi-00000049 0 0 0 450 500 pjsip/mwi-0000004a 0 0 0 450 500 pjsip/mwi-0000004b 0 0 0 450 500 pjsip/mwi-0000004c 0 0 0 450 500 pjsip/mwi-0000004d 0 0 0 450 500 SIP 119 0 2 450 500 SIP-control 243 0 2 450 500 Sorcery 47 0 2 450 500 Sorcery-control 96 0 2 450 500 sorcery/acl-00000053 0 0 0 450 500 sorcery/aor-00000019 2 0 1 450 500 sorcery/asterisk-publication-0000004f 0 0 0 450 500 sorcery/auth-00000014 1 0 1 450 500 sorcery/bucket-00000000 0 0 0 450 500 sorcery/contact-00000018 2 0 1 1350 1500 sorcery/contact_status-0000001a 31 0 1 1350 1500 sorcery/domain_alias-0000001b 0 0 0 450 500 sorcery/endpoint-00000015 2 0 1 450 500 sorcery/file-00000001 0 0 0 450 500 sorcery/global-0000001c 6 0 1 450 500 sorcery/identify-00000045 0 0 0 450 500 sorcery/inbound-publication-00000042 0 0 0 450 500 sorcery/log_mappings-0000000a 0 0 0 450 500 sorcery/nat_hook-00000016 0 0 0 450 500 sorcery/opus-00000009 0 0 0 450 500 sorcery/outbound-publish-0000003f 0 0 0 450 500 sorcery/registration-00000054 3 0 1 450 500 sorcery/resource_list-00000041 0 0 0 450 500 sorcery/subscription_persistence-00000040 0 0 0 450 500 sorcery/system-0000000b 0 0 0 450 500 sorcery/transport-00000017 0 0 0 450 500 stasis-core 1527 0 3 450 500 stasis-core-control 3066 0 4 450 500 subm:ast_bridge_topic_all-cached-00000056 1 0 1 450 500 subm:ast_channel_topic_all-0000005a 196633 0 136 450 500 subm:ast_channel_topic_all-00000060 196632 0 132 450 500 subm:ast_channel_topic_all-cached-00000057 307924 0 242 450 500 subm:ast_channel_topic_all-cached-00000058 307923 0 223 450 500 subm:ast_channel_topic_all-cached-00000059 307922 0 209 450 500 subm:ast_device_state_topic-00000002 11024 0 5 450 500 subm:ast_device_state_topic-00000004 11023 0 8 450 500 subm:ast_device_state_topic-0000005f 11008 0 8 450 500 subm:ast_parking-00000044 1 0 1 450 500 subm:ast_presence_state_topic_all-00000005 1 0 1 450 500 subm:ast_security-0000005d 144 0 1 450 500 subm:ast_system-00000050 40 0 4 450 500 subm:ast_system-00000051 39 0 3 450 500 subm:ast_system-00000055 7 0 1 450 500 subm:ast_system-00000061 1 0 1 450 500 subm:cdr_engine-00000003 309307 0 2126 4500 5000 subm:cel_aggregation_topic-00000006 216908 91018 182802 2700 3000 subm:endpoint_topic_all-cached-00000008 1472 0 17 450 500 subm:endpoint_topic_all-cached-0000005b 1450 0 13 450 500 subm:manager_topic-00000007 319136 0 368 2700 3000 subm:stasis_mwi_topic-0000005c 2 0 1 450 500 subm:stasis_mwi_topic-0000005e 1 0 1 450 500 subp:PJSIP/500-0000001d 1449 0 14 450 500 subp:PJSIP/anonymous-0000001e 3 0 1 450 500 subp:PJSIP/dpma_endpoint-0000001f 3 0 2 450 500 subp:SIP/501-00000052 4 0 1 450 500 110 taskprocessors {code} By: Richard Mudgett (rmudgett) 2018-07-25 07:12:36.966-0500 This is not a bug. It is intentional that chan_pjsip stop accepting new work such as incoming calls when *any* task processor queue depth goes over the high water trigger limit. This condition remains in effect until all queues go below the low water limit. Because chan_pjsip is multi-threaded, it is the *only* defense mechanism that chan_pjsip has against an overloaded system. If chan_pjsip does not do this then incoming requests for new calls could pile up work in the queues until the system ran out of memory. In this test situation you have created dialplan that bypasses the feedback mechanism and floods the system with CEL events. If you leave it running you will eventually run out of memory. By: xrobau (xrobau) 2018-07-25 15:34:01.664-0500 > If chan_pjsip does not do this then incoming requests for new calls could pile up work in the queues until the system ran out of memory. I understand your reasoning there, but, that means that something as simple as a large number of listeners to AMI Events can cause PJSIP to lock up. Here's a production customer machine (that is suffering from the audio issue that i'm trying to replicate) {code} subm:manager_topic-00000006 8537986 0 5449 2700 3000 {code} That has breached the high water mark at some time in the last 12 hours (last restart), which means that pjsip was effectively frozen for a period of time. According to https://wiki.asterisk.org/wiki/display/AST/Asterisk+13+Configuration_stasis it doesn't look like that's a configurable parameter - but this machine isn't doing anything unusual APART from servicing a large number of endpoints (which were recently moved FROM PJSIP to chan_sip because of strange happenings). By: Richard Mudgett (rmudgett) 2018-07-25 16:50:49.719-0500 To be clear: PJSIP is not frozen nor is it locked up while the system is in a taskprocessor alert condition because a queue reached the high water mark. *Existing* calls will continue to respond to SIP messages. It will not accept incoming *new* work such as new calls at this time. When all taskprocessor queues go below the low water mark then the taskprocessor alert condition is cleared and PJSIP resumes normal operation. The taskprocessor alert condition cannot affect audio as it only discards incoming SIP messages that are not associated with existing dialogs or transactions. This is *not* the cause of the audio issue you are trying to replicate. I have seen the subm:manager_topic hit the high water mark in load testing where it had a much higher queue length than what you show. It clears quickly (within a second) when the load is removed. The subm:cel_aggregation_topic also clears quickly when you stop flooding it. However, these two specific topics could take upwards of a minute if the queue length is around a million. Queues give elasticity to the system to handle activity bursts. If the activity doesn't let up the system will fall over. FYI: ASTERISK-26806 Recently went out in a release which greatly speeds up PJSIP when there are a lot of endpoints in the system. By: xrobau (xrobau) 2018-07-25 16:55:49.258-0500 After thinking about this a bit more, I am 100% on board with the concept, but, I would like (and this is probably a feature request) to be able to configure the hard-coded numbers, or even add a 'Machine Size' multiplier in stasis.conf - or wherever you guys think it is appropriate. For example, something running on a RPi would be massively overloaded way before a modern 32 core machine is, but the hard-coded numbers aren't able to be configured to represent that (or are they? I couldn't find any place, but I only RTFM'ed, I didn't look at the code). However, there's no extremely visible notification of this to the end user. I would expect that asterisk should be syslog'ing at a crit or emerg level that it's overloaded and stuff is broken, so that it can be picked up and fixed/managed/whatever. At the moment, there's only a single line about taskprocessors in the log, which is super easy to miss. By: xrobau (xrobau) 2018-07-25 17:49:43.593-0500 > This is not the cause of the audio issue you are trying to replicate. Just to confirm, yes. This has nothing to do with the audio issue. Would you like me to create a new ticket about that, to braindump what I currently know, or, would you prefer me to comment on ASTERISK-26257? (Basically, asterisk freezes briefly when AGI() is called, causing audio glitches, but I'm having great difficulty actually getting a machine into that state artificially) By: Richard Mudgett (rmudgett) 2018-07-25 18:32:08.508-0500 Allowing fine tuning adjustments to stasis taskprocessors starts getting into the realm of the absurd in configurability. If you are able to tweak the levels you also have to consider how long the system may need to work off the excess tasks in the queue to get back to normal operation. If you subsequently upgrade to a new version you may have to re-tweak the levels. Also taskprocessors can come and go from release to release. There are internal routines available to set the levels on taskprocessors. Only select taskprocessors tend to need their levels adjusted from the default 450/500 levels. As you can see in the CLI "core show taskprocessors" output a few of them do have their levels adjusted by the code to higher levels. A couple patches were submitted by the community to adjust some taskprocessor levels by configuration. The app_voicemail taskprocessor is set by the tps_queue_high and tps_queue_low options in voicemail.conf. The pjsip/mwi taskprocessor levels are set by mwi_tps_queue_high and mwi_tps_queue_low in pjsip.conf. (I'm not sure if the submitter would still need to tweak the pjsip/mwi levels after ASTERISK-26806 came out in a release.) None of the other taskprocessor levels are currently user configurable. As far as the speed of the machine vs the hard-coded queue levels are concerned. I think the levels would be nearly the same among various machines anyway. A faster machine can put more tasks into the queue but it can also clear the queue faster. I think a more likely reason there could be different levels needed between installations is the what the installation is used for. (Different calling patterns, applications, etc.) As far as notifying the user, there is no "crit" or "emerg" log level defined. There is the WARNING message when a taskprocessor hits the high water trigger level. There is a DEBUG level 3 message when a taskprocessor drops to the low water trigger level after hitting the high water trigger. There is no message when all taskprocessors have dropped below the low water trigger levels. As far as missing the notification is concerned that's what scripts grepping the logs do best. At any rate, patches welcome. :) By: Richard Mudgett (rmudgett) 2018-07-25 18:53:43.943-0500 {quote} Just to confirm, yes. This has nothing to do with the audio issue. Would you like me to create a new ticket about that, to braindump what I currently know, or, would you prefer me to comment on ASTERISK-26257? (Basically, asterisk freezes briefly when AGI() is called, causing audio glitches, but I'm having great difficulty actually getting a machine into that state artificially) {quote} Yes create a new issue. You haven't provided much information about how and where the AGI is being used on this issue and this issue has turned into a discussion about PJSIP not responding to new INVITEs when it detects the system is overloaded. Don't hijack a long dead issue that is against a no longer supported version and closed with a system level work around. However, from the little you have mentioned about the AGI issue it sounds more like a system level problem than an Asterisk problem. AGI's are expensive as they spawn a new process to execute the script. |