[Home]

Summary:ASTERISK-27985: PJSIP: Does not respond to INVITES when any taskprocessor queue length is > high water level
Reporter:xrobau (xrobau)Labels:pjsip
Date Opened:2018-07-24 22:16:00Date Closed:2018-07-26 04:06:13
Priority:MajorRegression?
Status:Closed/CompleteComponents:CEL/cel_odbc
Versions:13.22.0 15.4.1 Frequency of
Occurrence
Constant
Related
Issues:
Environment:FreePBX 14, SNG7Attachments:( 0) pjsip-ignored.txt
Description:We've had sporadic reports of AGIs causing audio glitches on heavy systems (which we believe is related to ASTERISK-26257 but is NOT limited to confbridge, MOH and Playback/Background is also affected).

{code}
exten => 998,1,Answer
same => n,Dial(Local/999@from-internal-custom/n,300,gm(default))
same => n,Playback(beep)
same => n,Goto(1)

exten => 999,1,Set(COUNT=0)
same => n(loop),GotoIf($[ ${COUNT} > 1000 ]?toomany)
same => n,Gosub(testtrigger)
same => n,Set(COUNT=$[ ${COUNT} + 1 ])
same => n,Goto(loop)
same => n(toomany),Hangup
same => n(testtrigger),Dial(SIP/8675309@127.0.0.1:5160,1,g)
same => n,Return
{code}

The above dialplan simulates a call entering the 'from-pstn' context of FreePBX, which then does - after the standard dialplan - a Hangup.

This causes Asterisk to create about 20-30 calls per second (but does not trigger the audio issue, unfortunately).  It does however trigger a new issue where PJSIP channel INVITES are ignored until the cel_aggregation_topic queue is empty:

{code}
Processor                                      Processed   In Queue  Max Depth  Low water High water
... skipped lines ...
subm:cdr_engine-00000003                         5780891          0       2347       4500       5000
subm:cel_aggregation_topic-00000006              4968215     786689    2961396       2700       3000
subm:endpoint_topic_all-cached-00000008              794          0         13        450        500
subm:endpoint_topic_all-cached-0000005a              778          0         13        450        500
subm:manager_topic-00000007                      5962603          0        344       2700       3000
{code}

While the queue is being processed, pjsip calls are ignored.  chan_sip calls are answered and processed correctly, increasing the queue size.

Comments:By: Asterisk Team (asteriskteam) 2018-07-24 22:16:02.099-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: xrobau (xrobau) 2018-07-24 22:17:59.277-0500

Note in the attachment the INVITE is ignored, but the CANCEL is responded to with a 'Call/Transaction Does Not Exist' error.

By: xrobau (xrobau) 2018-07-24 22:23:19.386-0500

Asterisk 15.4.1 also shows the same symptoms.  At the time of this command, a PJSIP call was trying to be established

{code}
Asterisk 15.4.1, Copyright (C) 1999 - 2016, Digium, Inc. and others.
Created by Mark Spencer <markster@digium.com>
Asterisk comes with ABSOLUTELY NO WARRANTY; type 'core show warranty' for details.
This is free software, with components licensed under the GNU General Public
License version 2 and other licenses; you are welcome to redistribute it under
certain conditions. Type 'core show license' for details.
=========================================================================
Connected to Asterisk 15.4.1 currently running on freepbx (pid = 14914)
freepbx*CLI> core show taskprocessors

Processor                                      Processed   In Queue  Max Depth  Low water High water
app_voicemail                                          0          0          0        450        500
ast_msg_queue                                          0          0          0        450        500
CCSS_core                                              0          0          0        450        500
dns_system_resolver_tp                                 0          0          0        450        500
hep_queue_tp                                           0          0          0        450        500
iax2_transmit                                          0          0          0        450        500
pjsip/default-0000000c                                 8          0          1        450        500
pjsip/default-0000000d                                 8          0          1        450        500
pjsip/default-0000000e                                 6          0          1        450        500
pjsip/default-0000000f                                 6          0          1        450        500
pjsip/default-00000010                                 7          0          1        450        500
pjsip/default-00000011                                 7          0          1        450        500
pjsip/default-00000012                                 7          0          1        450        500
pjsip/default-00000013                                 7          0          1        450        500
pjsip/distributor-00000020                             2          0          1        450        500
pjsip/distributor-00000021                             2          0          1        450        500
pjsip/distributor-00000022                             0          0          0        450        500
pjsip/distributor-00000023                             4          0          1        450        500
pjsip/distributor-00000024                             2          0          1        450        500
pjsip/distributor-00000025                             0          0          0        450        500
pjsip/distributor-00000026                             2          0          1        450        500
pjsip/distributor-00000027                             2          0          1        450        500
pjsip/distributor-00000028                             4          0          1        450        500
pjsip/distributor-00000029                             0          0          0        450        500
pjsip/distributor-0000002a                             0          0          0        450        500
pjsip/distributor-0000002b                             6          0          1        450        500
pjsip/distributor-0000002c                             0          0          0        450        500
pjsip/distributor-0000002d                             2          0          1        450        500
pjsip/distributor-0000002e                             4          0          1        450        500
pjsip/distributor-0000002f                             2          0          1        450        500
pjsip/distributor-00000030                             2          0          1        450        500
pjsip/distributor-00000031                             0          0          0        450        500
pjsip/distributor-00000032                             4          0          1        450        500
pjsip/distributor-00000033                             0          0          0        450        500
pjsip/distributor-00000034                             0          0          0        450        500
pjsip/distributor-00000035                             8          0          2        450        500
pjsip/distributor-00000036                             4          0          1        450        500
pjsip/distributor-00000037                             0          0          0        450        500
pjsip/distributor-00000038                             2          0          1        450        500
pjsip/distributor-00000039                             4          0          1        450        500
pjsip/distributor-0000003a                             2          0          1        450        500
pjsip/distributor-0000003b                             2          0          1        450        500
pjsip/distributor-0000003c                             0          0          0        450        500
pjsip/distributor-0000003d                             0          0          0        450        500
pjsip/distributor-0000003e                             4          0          1        450        500
pjsip/exten_state                                      0          0          0        450        500
pjsip/messaging                                        0          0          0        450        500
pjsip/mwi-00000046                                     0          0          0        450        500
pjsip/mwi-00000047                                     0          0          0        450        500
pjsip/mwi-00000048                                     0          0          0        450        500
pjsip/mwi-00000049                                     0          0          0        450        500
pjsip/mwi-0000004a                                     0          0          0        450        500
pjsip/mwi-0000004b                                     0          0          0        450        500
pjsip/mwi-0000004c                                     0          0          0        450        500
pjsip/mwi-0000004d                                     0          0          0        450        500
SIP                                                  119          0          2        450        500
SIP-control                                          243          0          2        450        500
Sorcery                                               47          0          2        450        500
Sorcery-control                                       96          0          2        450        500
sorcery/acl-00000053                                   0          0          0        450        500
sorcery/aor-00000019                                   2          0          1        450        500
sorcery/asterisk-publication-0000004f                  0          0          0        450        500
sorcery/auth-00000014                                  1          0          1        450        500
sorcery/bucket-00000000                                0          0          0        450        500
sorcery/contact-00000018                               2          0          1       1350       1500
sorcery/contact_status-0000001a                       31          0          1       1350       1500
sorcery/domain_alias-0000001b                          0          0          0        450        500
sorcery/endpoint-00000015                              2          0          1        450        500
sorcery/file-00000001                                  0          0          0        450        500
sorcery/global-0000001c                                6          0          1        450        500
sorcery/identify-00000045                              0          0          0        450        500
sorcery/inbound-publication-00000042                   0          0          0        450        500
sorcery/log_mappings-0000000a                          0          0          0        450        500
sorcery/nat_hook-00000016                              0          0          0        450        500
sorcery/opus-00000009                                  0          0          0        450        500
sorcery/outbound-publish-0000003f                      0          0          0        450        500
sorcery/registration-00000054                          3          0          1        450        500
sorcery/resource_list-00000041                         0          0          0        450        500
sorcery/subscription_persistence-00000040              0          0          0        450        500
sorcery/system-0000000b                                0          0          0        450        500
sorcery/transport-00000017                             0          0          0        450        500
stasis-core                                         1527          0          3        450        500
stasis-core-control                                 3066          0          4        450        500
subm:ast_bridge_topic_all-cached-00000056              1          0          1        450        500
subm:ast_channel_topic_all-0000005a               196633          0        136        450        500
subm:ast_channel_topic_all-00000060               196632          0        132        450        500
subm:ast_channel_topic_all-cached-00000057        307924          0        242        450        500
subm:ast_channel_topic_all-cached-00000058        307923          0        223        450        500
subm:ast_channel_topic_all-cached-00000059        307922          0        209        450        500
subm:ast_device_state_topic-00000002               11024          0          5        450        500
subm:ast_device_state_topic-00000004               11023          0          8        450        500
subm:ast_device_state_topic-0000005f               11008          0          8        450        500
subm:ast_parking-00000044                              1          0          1        450        500
subm:ast_presence_state_topic_all-00000005             1          0          1        450        500
subm:ast_security-0000005d                           144          0          1        450        500
subm:ast_system-00000050                              40          0          4        450        500
subm:ast_system-00000051                              39          0          3        450        500
subm:ast_system-00000055                               7          0          1        450        500
subm:ast_system-00000061                               1          0          1        450        500
subm:cdr_engine-00000003                          309307          0       2126       4500       5000
subm:cel_aggregation_topic-00000006               216908      91018     182802       2700       3000
subm:endpoint_topic_all-cached-00000008             1472          0         17        450        500
subm:endpoint_topic_all-cached-0000005b             1450          0         13        450        500
subm:manager_topic-00000007                       319136          0        368       2700       3000
subm:stasis_mwi_topic-0000005c                         2          0          1        450        500
subm:stasis_mwi_topic-0000005e                         1          0          1        450        500
subp:PJSIP/500-0000001d                             1449          0         14        450        500
subp:PJSIP/anonymous-0000001e                          3          0          1        450        500
subp:PJSIP/dpma_endpoint-0000001f                      3          0          2        450        500
subp:SIP/501-00000052                                  4          0          1        450        500

110 taskprocessors
{code}

By: Richard Mudgett (rmudgett) 2018-07-25 07:12:36.966-0500

This is not a bug.  It is intentional that chan_pjsip stop accepting new work such as incoming calls when *any* task processor queue depth goes over the high water trigger limit.  This condition remains in effect until all queues go below the low water limit.  Because chan_pjsip is multi-threaded, it is the *only* defense mechanism that chan_pjsip has against an overloaded system.  If chan_pjsip does not do this then incoming requests for new calls could pile up work in the queues until the system ran out of memory.

In this test situation you have created dialplan that bypasses the feedback mechanism and floods the system with CEL events.  If you leave it running you will eventually run out of memory.

By: xrobau (xrobau) 2018-07-25 15:34:01.664-0500

>  If chan_pjsip does not do this then incoming requests for new calls could pile up work in the queues until the system ran out of memory.

I understand your reasoning there, but, that means that something as simple as a large number of listeners to AMI Events can cause PJSIP to lock up.  Here's a production customer machine (that is suffering from the audio issue that i'm trying to replicate)

{code}
subm:manager_topic-00000006                      8537986          0       5449       2700       3000
{code}

That has breached the high water mark at some time in the last 12 hours (last restart), which means that pjsip was effectively frozen for a period of time. According to https://wiki.asterisk.org/wiki/display/AST/Asterisk+13+Configuration_stasis it doesn't look like that's a configurable parameter - but this machine isn't doing anything unusual APART from servicing a large number of endpoints (which were recently moved FROM PJSIP to chan_sip because of strange happenings).



By: Richard Mudgett (rmudgett) 2018-07-25 16:50:49.719-0500

To be clear: PJSIP is not frozen nor is it locked up while the system is in a taskprocessor alert condition because a queue reached the high water mark.  *Existing* calls will continue to respond to SIP messages.  It will not accept incoming *new* work such as new calls at this time.  When all taskprocessor queues go below the low water mark then the taskprocessor alert condition is cleared and PJSIP resumes normal operation.  The taskprocessor alert condition cannot affect audio as it only discards incoming SIP messages that are not associated with existing dialogs or transactions.  This is *not* the cause of the audio issue you are trying to replicate.

I have seen the subm:manager_topic hit the high water mark in load testing where it had a much higher queue length than what you show.  It clears quickly (within a second) when the load is removed.  The subm:cel_aggregation_topic also clears quickly when you stop flooding it.  However, these two specific topics could take upwards of a minute if the queue length is around a million.  Queues give elasticity to the system to handle activity bursts.  If the activity doesn't let up the system will fall over.

FYI: ASTERISK-26806 Recently went out in a release which greatly speeds up PJSIP when there are a lot of endpoints in the system.

By: xrobau (xrobau) 2018-07-25 16:55:49.258-0500

After thinking about this a bit more, I am 100% on board with the concept, but, I would like (and this is probably a feature request) to be able to configure the hard-coded numbers, or even add a 'Machine Size' multiplier in stasis.conf - or wherever you guys think it is appropriate.

For example, something running on a RPi would be massively overloaded way before a modern 32 core machine is, but the hard-coded numbers aren't able to be configured to represent that (or are they? I couldn't find any place, but I only RTFM'ed, I didn't look at the code).

However, there's no extremely visible notification of this to the end user. I would expect that asterisk should be syslog'ing at a crit or emerg level that it's overloaded and stuff is broken, so that it can be picked up and fixed/managed/whatever.  At the moment, there's only a single line about taskprocessors in the log, which is super easy to miss.



By: xrobau (xrobau) 2018-07-25 17:49:43.593-0500

>  This is not the cause of the audio issue you are trying to replicate.

Just to confirm, yes. This has nothing to do with the audio issue. Would you like me to create a new ticket about that, to braindump what I currently know, or, would you prefer me to comment on ASTERISK-26257? (Basically, asterisk freezes briefly when AGI() is called, causing audio glitches, but I'm having great difficulty actually getting a machine into that state artificially)

By: Richard Mudgett (rmudgett) 2018-07-25 18:32:08.508-0500

Allowing fine tuning adjustments to stasis taskprocessors starts getting into the realm of the absurd in configurability.  If you are able to tweak the levels you also have to consider how long the system may need to work off the excess tasks in the queue to get back to normal operation.  If you subsequently upgrade to a new version you may have to re-tweak the levels.  Also taskprocessors can come and go from release to release.  There are internal routines available to set the levels on taskprocessors.  Only select taskprocessors tend to need their levels adjusted from the default 450/500 levels.  As you can see in the CLI "core show taskprocessors" output a few of them do have their levels adjusted by the code to higher levels.

A couple patches were submitted by the community to adjust some taskprocessor levels by configuration.  The app_voicemail taskprocessor is set by the tps_queue_high and tps_queue_low options in voicemail.conf.  The pjsip/mwi taskprocessor levels are set by mwi_tps_queue_high and mwi_tps_queue_low in pjsip.conf.  (I'm not sure if the submitter would still need to tweak the pjsip/mwi levels after ASTERISK-26806 came out in a release.)  None of the other taskprocessor levels are currently user configurable.

As far as the speed of the machine vs the hard-coded queue levels are concerned.  I think the levels would be nearly the same among various machines anyway.  A faster machine can put more tasks into the queue but it can also clear the queue faster.  I think a more likely reason there could be different levels needed between installations is the what the installation is used for.  (Different calling patterns, applications, etc.)

As far as notifying the user, there is no "crit" or "emerg" log level defined.  There is the WARNING message when a taskprocessor hits the high water trigger level.  There is a DEBUG level 3 message when a taskprocessor drops to the low water trigger level after hitting the high water trigger.  There is no message when all taskprocessors have dropped below the low water trigger levels.  As far as missing the notification is concerned that's what scripts grepping the logs do best.

At any rate, patches welcome. :)

By: Richard Mudgett (rmudgett) 2018-07-25 18:53:43.943-0500

{quote}
Just to confirm, yes. This has nothing to do with the audio issue. Would you like me to create a new ticket about that, to braindump what I currently know, or, would you prefer me to comment on ASTERISK-26257? (Basically, asterisk freezes briefly when AGI() is called, causing audio glitches, but I'm having great difficulty actually getting a machine into that state artificially)
{quote}

Yes create a new issue.  You haven't provided much information about how and where the AGI is being used on this issue and this issue has turned into a discussion about PJSIP not responding to new INVITEs when it detects the system is overloaded.  Don't hijack a long dead issue that is against a no longer supported version and closed with a system level work around.  However, from the little you have mentioned about the AGI issue it sounds more like a system level problem than an Asterisk problem.  AGI's are expensive as they spawn a new process to execute the script.