Summary: | ASTERISK-27821: SEGV in res_pjsip_mwi/stasis | ||
Reporter: | Purchasing Agent (hawkeye) | Labels: | pjsip |
Date Opened: | 2018-04-20 06:22:37 | Date Closed: | |
Priority: | Major | Regression? | |
Status: | Open/New | Components: | pjproject/pjsip |
Versions: | 13.19.1 | Frequency of Occurrence | Frequent |
Related Issues: | |||
Environment: | SNG7/FreePBX 14 with asterisk 13.19.1 | Attachments: | ( 0) backtrace.txt ( 1) backtrace.txt ( 2) backtrace-4-22-18.txt ( 3) backtrace-4-23-18-asterisk-13.19.1.txt ( 4) backtrace-threads.txt ( 5) core.host.domain.tld-2018-04-22T03-32-25-0400-full.txt ( 6) full.log ( 7) musiconhold_additional.conf ( 8) taskprocessors.txt |
Description: | Intermittently all PJSIP extensions loose registration. There are < 20 extensions and the server is not doing much
. Server has 3GB memory and 4 vCPUs In the evening our music on hold server shuts down until 8am. When it does cli> will spew non stop warnings because it cant get to the server: {noformat} WARNING[30830]: res_musiconhold.c:775 monmp3thread: poll() failed: Interrupted system call When the above happens the system load goes much higher. {noformat} Running core debug log it does display: {noformat} DEBUG[1469]: res_pjsip/pjsip_distributor.c:538 distributor: Taskprocessor overload alert: Ignoring 'Request msg SUBSCRIBE/cseq=1 (rdata0x7fcd5d5d9ed0)'. {noformat} After running # fwconsole restart, all PJSIP endpoints register. Ran the following after fwconsole restart {noformat} *CLI> core show taskprocessors Processor Processed In Queue Max Depth Low water High water app_voicemail 639 0 37 450 500 ast_msg_queue 0 0 0 450 500 CCSS_core 0 0 0 450 500 hep_queue_tp 0 0 0 450 500 iax2_transmit 0 0 0 450 500 pjsip/default-0000000c 132 0 1 450 500 pjsip/default-0000000d 133 0 1 450 500 pjsip/default-0000000e 134 0 1 450 500 pjsip/default-0000000f 133 0 1 450 500 pjsip/default-00000010 135 0 1 450 500 pjsip/default-00000011 136 0 1 450 500 pjsip/default-00000012 132 0 1 450 500 pjsip/default-00000013 128 0 1 450 500 pjsip/distributor-00000028 308 0 4 450 500 pjsip/distributor-00000029 151 0 3 450 500 pjsip/distributor-0000002a 214 0 4 450 500 pjsip/distributor-0000002b 201 0 3 450 500 pjsip/distributor-0000002c 170 0 4 450 500 pjsip/distributor-0000002d 533 0 8 450 500 pjsip/distributor-0000002e 233 0 3 450 500 pjsip/distributor-0000002f 159 0 4 450 500 pjsip/distributor-00000030 244 0 4 450 500 pjsip/distributor-00000031 255 0 5 450 500 pjsip/distributor-00000032 415 0 8 450 500 pjsip/distributor-00000033 369 0 8 450 500 pjsip/distributor-00000034 239 0 3 450 500 pjsip/distributor-00000035 157 0 6 450 500 pjsip/distributor-00000036 226 0 3 450 500 pjsip/distributor-00000037 228 0 3 450 500 pjsip/distributor-00000038 132 0 4 450 500 pjsip/distributor-00000039 203 0 3 450 500 pjsip/distributor-0000003a 123 0 2 450 500 pjsip/distributor-0000003b 271 0 5 450 500 pjsip/distributor-0000003c 196 0 4 450 500 pjsip/distributor-0000003d 159 0 3 450 500 pjsip/distributor-0000003e 291 0 5 450 500 pjsip/distributor-0000003f 207 0 6 450 500 pjsip/distributor-00000040 130 0 2 450 500 pjsip/distributor-00000041 144 0 4 450 500 pjsip/distributor-00000042 266 0 5 450 500 pjsip/distributor-00000043 235 0 8 450 500 pjsip/distributor-00000044 239 0 4 450 500 pjsip/distributor-00000045 457 0 6 450 500 pjsip/distributor-00000046 280 0 5 450 500 pjsip/messaging 0 0 0 450 500 pjsip/mwi-0000004f 324 0 3 450 500 pjsip/mwi-00000050 9 0 1 450 500 pjsip/mwi-00000051 1 0 1 450 500 pjsip/mwi-00000052 1 0 1 450 500 pjsip/mwi-00000053 1 0 1 450 500 pjsip/mwi-00000054 1 0 1 450 500 pjsip/mwi-00000055 1 0 1 450 500 pjsip/mwi-00000056 1 0 1 450 500 SIP 6732 0 21 450 500 SIP-control 13578 0 23 450 500 Sorcery 1064 0 2 450 500 Sorcery-control 2134 0 3 450 500 sorcery/acl-00000084 0 0 0 450 500 sorcery/aor-00000019 1 0 1 450 500 sorcery/asterisk-publication-0000004e 0 0 0 450 500 sorcery/auth-00000014 0 0 0 450 500 sorcery/bucket-00000000 0 0 0 450 500 sorcery/contact-00000018 167 0 3 1350 1500 sorcery/contact_status-0000001a 934 0 3 1350 1500 sorcery/domain_alias-0000001b 0 0 0 450 500 sorcery/endpoint-00000015 1 0 1 450 500 sorcery/file-00000001 0 0 0 450 500 sorcery/global-0000001c 5 0 1 450 500 sorcery/identify-0000004d 0 0 0 450 500 sorcery/inbound-publication-00000049 0 0 0 450 500 sorcery/log_mappings-0000000a 0 0 0 450 500 sorcery/nat_hook-00000016 0 0 0 450 500 sorcery/opus-00000009 0 0 0 450 500 sorcery/outbound-publish-0000004b 0 0 0 450 500 sorcery/phoneprov-00000081 0 0 0 450 500 sorcery/registration-00000082 1 0 1 450 500 sorcery/resource_list-00000048 0 0 0 450 500 sorcery/subscription_persistence-00000047 0 0 0 450 500 sorcery/system-0000000b 0 0 0 450 500 sorcery/transport-00000017 0 0 0 450 500 stasis-core 14755 0 47 450 500 stasis-core-control 29765 0 136 450 500 subm:ast_bridge_topic_all-cached-00000086 1 0 1 450 500 subm:ast_channel_topic_all-00000089 2 0 1 450 500 subm:ast_channel_topic_all-0000008f 1 0 1 450 500 subm:ast_channel_topic_all-cached-00000085 5 0 2 450 500 subm:ast_channel_topic_all-cached-00000087 4 0 2 450 500 subm:ast_channel_topic_all-cached-00000088 3 0 2 450 500 subm:ast_device_state_topic-00000002 1233 0 4 450 500 subm:ast_device_state_topic-00000004 1232 0 16 450 500 subm:ast_device_state_topic-0000008e 835 0 2 450 500 subm:ast_parking-0000004c 1 0 1 450 500 subm:ast_presence_state_topic_all-00000005 4 0 3 450 500 subm:ast_security-0000008a 1160 0 6 450 500 subm:ast_system-00000060 37 0 1 450 500 subm:ast_system-00000080 36 0 3 450 500 subm:ast_system-00000083 4 0 1 450 500 subm:ast_system-00000090 1 0 1 450 500 subm:cdr_engine-00000003 8 0 1 4500 5000 subm:cel_aggregation_topic-00000006 8 0 1 2700 3000 subm:endpoint_topic_all-cached-00000008 533 0 10 450 500 subm:endpoint_topic_all-cached-0000008c 381 0 4 450 500 subm:manager_topic-00000007 3283 0 6 2700 3000 subm:stasis_mwi_topic-0000008b 922 0 6 450 500 subm:stasis_mwi_topic-0000008d 916 0 3 450 500 subp:192@default-00000062 2 0 1 450 500 subp:260@device-000001a4 11 0 1 450 500 subp:260@device-000001a5 10 0 1 450 500 subp:260@device-000001ae 7 0 1 450 500 subp:260@device-000001c0 5 0 1 450 500 subp:270@device-000000bd 60 0 2 450 500 subp:270@device-000000c2 57 0 2 450 500 subp:270@device-000000c4 56 0 2 450 500 subp:270@device-000000c6 55 0 2 450 500 subp:270@device-000000c7 54 0 2 450 500 subp:270@device-000000ca 53 0 2 450 500 subp:270@device-000000da 52 0 2 450 500 subp:270@device-000000dc 22 0 2 450 500 subp:270@device-000000f1 21 0 2 450 500 subp:270@device-000000f2 20 0 2 450 500 subp:270@device-000000f3 19 0 2 450 500 subp:270@device-000000f5 18 0 2 450 500 subp:270@device-000000f8 17 0 2 450 500 subp:270@device-00000103 16 0 2 450 500 subp:270@device-00000141 15 0 2 450 500 subp:270@device-00000142 14 0 2 450 500 subp:270@device-00000143 13 0 2 450 500 subp:270@device-00000144 12 0 2 450 500 subp:270@device-00000145 11 0 2 450 500 subp:270@device-00000146 10 0 2 450 500 subp:270@device-00000150 9 0 2 450 500 subp:270@device-000001a8 8 0 2 450 500 subp:270@device-000001aa 7 0 2 450 500 subp:270@device-000001ab 5 0 1 450 500 subp:270@device-000001ac 6 0 2 450 500 subp:270@device-000001ad 5 0 1 450 500 subp:270@device-000001b1 3 0 1 450 500 subp:270@device-000001b2 2 0 1 450 500 subp:270@device-000001c4 1 0 1 450 500 subp:290@device-000001bb 48 0 2 450 500 subp:290@device-000001bc 46 0 1 450 500 subp:290@device-000001bd 45 0 1 450 500 subp:290@device-000001be 44 0 1 450 500 subp:290@device-000001c1 43 0 1 450 500 subp:290@device-000001c3 42 0 1 450 500 subp:290@device-000001c5 41 0 1 450 500 subp:290@device-000001c6 41 0 1 450 500 subp:290@device-000001d8 20 0 1 450 500 subp:290@device-000001da 18 0 1 450 500 subp:290@device-000001db 16 0 1 450 500 subp:290@device-000001dc 15 0 1 450 500 subp:290@device-000001dd 14 0 1 450 500 subp:290@device-000001de 13 0 1 450 500 subp:290@device-000001df 12 0 1 450 500 subp:290@device-000001e0 11 0 1 450 500 subp:290@device-000001e1 10 0 1 450 500 subp:290@device-000001e2 4 0 1 450 500 subp:301@default-00000064 2 0 1 450 500 subp:306@device-00000057 1 0 1 450 500 subp:308@device-000001d1 61 0 2 450 500 subp:308@device-000001d2 60 0 2 450 500 subp:308@device-000001d3 60 0 2 450 500 subp:308@device-000001d4 60 0 2 450 500 subp:308@device-000001d5 56 0 1 450 500 subp:308@device-000001d6 50 0 1 450 500 subp:308@device-000001d7 24 0 1 450 500 subp:308@device-000001d9 22 0 1 450 500 subp:308@device-000001e3 20 0 1 450 500 subp:308@device-000001e4 18 0 1 450 500 subp:308@device-000001e5 16 0 1 450 500 subp:308@device-000001e6 15 0 1 450 500 subp:308@device-000001e7 14 0 1 450 500 subp:308@device-000001e8 13 0 1 450 500 subp:308@device-000001e9 12 0 1 450 500 subp:308@device-000001ea 7 0 1 450 500 subp:308@device-000001eb 4 0 1 450 500 subp:308@device-000001ec 2 0 1 450 500 subp:310@default-00000066 2 0 1 450 500 subp:312@default-00000068 2 0 1 450 500 subp:315@device-0000005e 2 0 1 450 500 subp:370@device-000001a6 18 0 1 450 500 subp:370@device-000001a7 17 0 1 450 500 subp:370@device-000001a9 16 0 1 450 500 subp:370@device-000001af 15 0 1 450 500 subp:370@device-000001b0 14 0 1 450 500 subp:370@device-000001bf 8 0 1 450 500 subp:651@device-0000005c 1 0 1 450 500 subp:90321@device-0000006b 1 0 1 450 500 subp:PJSIP/125-00000020 7 0 3 450 500 subp:PJSIP/260-0000001e 35 0 3 450 500 subp:PJSIP/270-0000001f 36 0 2 450 500 subp:PJSIP/290-0000001d 98 0 3 450 500 subp:PJSIP/306-00000021 9 0 3 450 500 subp:PJSIP/308-00000026 69 0 3 450 500 subp:PJSIP/315-00000022 3 0 3 450 500 subp:PJSIP/360-00000023 7 0 2 450 500 subp:PJSIP/370-00000027 57 0 3 450 500 subp:PJSIP/650-00000024 14 0 3 450 500 subp:PJSIP/651-00000025 3 0 3 450 500 subp:SIP/192-00000061 2 0 2 450 500 subp:SIP/301-00000063 2 0 1 450 500 subp:SIP/310-00000065 2 0 2 450 500 subp:SIP/312-00000067 2 0 2 450 500 subp:SIP/321-00000069 2 0 1 450 500 subp:SIP/90321-0000006a 2 0 1 450 500 subp:SIP/bp-0000006c 4 0 1 450 500 subp:SIP/bp2-0000006d 4 0 2 450 500 subp:SIP/bp3-0000006e 4 0 2 450 500 subp:SIP/client1-0000006f 4 0 2 450 500 subp:SIP/client2-00000070 4 0 2 450 500 subp:SIP/client3-00000071 4 0 2 450 500 subp:SIP/itsp1-00000072 4 0 2 450 500 subp:SIP/client4-00000073 4 0 2 450 500 subp:SIP/client5-00000074 4 0 2 450 500 subp:SIP/client6-00000075 4 0 2 450 500 subp:SIP/client7-00000076 4 0 2 450 500 subp:SIP/client8-00000077 4 0 2 450 500 subp:SIP/client9-00000078 4 0 2 450 500 subp:SIP/client10-00000079 4 0 2 450 500 subp:SIP/client11-0000007a 4 0 2 450 500 subp:SIP/client12-0000007b 4 0 2 450 500 subp:SIP/client13-0000007c 4 0 2 450 500 subp:SIP/client14-0000007d 4 0 2 450 500 subp:SIP/client15-0000007e 4 0 2 450 500 subp:SIP/client16-0000007f 4 0 2 450 500 225 taskprocessors {noformat} | ||
Comments: | By: Asterisk Team (asteriskteam) 2018-04-20 06:22:38.902-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Richard Mudgett (rmudgett) 2018-04-20 11:13:47.619-0500 * Please follow the asterisk issue guidelines https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines We'll need the asterisk logs showing the start of the {{WARNING\[30830]: res_musiconhold.c:775 monmp3thread: poll() failed: Interrupted system call}} messages. Also the {{musiconhold.conf}} file showing how you have MOH configured. * Running "core show taskprocessors" after restarting does not show which taskprocessor has hit the high water mark levels or is currently in overload alert because you reset the statistics. Please *attach* a core show taskprocessors output captured while the problem is in progress. * Please attach debug output to the issue as .txt files instead of pasting a few hundred lines in comments or the issue description. By: Purchasing Agent (hawkeye) 2018-04-20 13:40:27.985-0500 Hi Richard, added musiconhold_additional.conf and taskprocessors.txt. tasprocessors.was taken while pjsip is ignoring reg requests. Not sure what other things you want but the 30830 is not found now. The full.log shows the last reg attempt and then the first pjsip ignoring reg. By: Richard Mudgett (rmudgett) 2018-04-20 14:48:12.445-0500 * The taskprocessors output shows symptoms of blocked threads on the pjsip/distributor-xxxxxxxx taskprocessors. It could be the ASTERISK-27706 deadlock where the fix just went out in 13.21.0-rc1 yesterday. However, I suspect the threads are blocked waiting on frames from your mp3 MOH class. To confirm we will need a full backtrace \[1]\[2] (thread apply all bt full) while Asterisk is running when it is ignoring the requests. Since you are using FreePBX you will probably need to follow their guide to getting symbols installed or compile and install asterisk yourself \[3]. A backtrace without symbols is useless. The backtrace files attached to ASTERISK-27706 provides some examples of what a single thread with symbols should look like. If the treads in your file have a bunch of question marks ?? then you don't have symbols. * The 30830 number you mention from the warning message is the thread identifier assigned by the OS. That identifier will be different for every run. \[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace \[2] https://wiki.freepbx.org/display/SUP/Providing+Great+Debug \[3] https://wiki.asterisk.org/wiki/display/AST/Installing+Asterisk+From+Source By: Purchasing Agent (hawkeye) 2018-04-21 21:07:50.612-0500 Left MOH source on. Early morning of 4/21/18 asterisk produced a few core. files. Uploading backtrace-threads.txt and backtrace.txt Have NOT run fwconsole restart before producing the above files. By: Purchasing Agent (hawkeye) 2018-04-21 21:10:31.785-0500 The attached files were created before restarting asterisk. Note that MOH source has been enabled for > 24hrs now. By: Purchasing Agent (hawkeye) 2018-04-21 22:43:17.465-0500 Hi Richard, Have checked our up and running backup pbx which also uses the same MOH source. That PBX does not have any core. files. All phones are registered to it as well and do not become unregistered. The backup pbx is running Asterisk 13.18.3. Just ran yum update on problematic pbx and now it's running Asterisk 13.19.2 Thanks. By: Purchasing Agent (hawkeye) 2018-04-21 23:20:26.825-0500 since uploading other files, another core. has been created. By: Purchasing Agent (hawkeye) 2018-04-21 23:34:43.855-0500 Added more debug packages to server. This log may contain more useful information. By: Purchasing Agent (hawkeye) 2018-04-23 11:53:32.396-0500 This backtrace is from FreePBX 13 server running Asterisk 13.19.1 By: Purchasing Agent (hawkeye) 2018-04-24 09:35:18.358-0500 Hi, have another pbx server which is running Asterisk 13.18.4. We have put this server into production as our primary sip server. Its been running for ~ 14 hours and overnight the MOH server was not operational. There has NOT been any core. files generated, and pjsip accounts do NOT loose registration. By: Richard Mudgett (rmudgett) 2018-04-25 18:01:12.426-0500 I don't think the MP3 MOH streaming messages is directly related to the problem. I think the streaming log messages make the problem more likely to happen because it increases the load on the system processing those spamming log messages. From the [^core.host.domain.tld-2018-04-22T03-32-25-0400-full.txt] and [^backtrace-threads.txt] files, the crashes and your system losing registrations appear to be the same thing being expressed differently depending upon timing. The crashes happen when a solicited MWI subscription replaces the corresponding unsolicited MWI subscription and a stasis topic pointer is NULL. The losing registrations problem appears to be similar but instead of crashing those threads are waiting for a stasis MWI subscription to terminate but nothing gets the terminate message so they wait forever. These blocked threads then build up a backlog of taskprocessor tasks which eventually hit the high water mark which then blocks processing new requests in chan_pjsip. Unsolicited MWI subscriptions are enabled by defining a mailbox on an endpoint. Solicited MWI subscriptions are enabled by defining a mailbox on an AOR. If the unsolicited MWI setting turns out to be incorrect, there is a mwi_subscribe_replaces_unsolicited endpoint option that can be enabled to replace the unsolicited subscription with the solicited subscription. The mwi_subscribe_replaces_unsolicited option defaults to no. Based on the crash backtraces you are enabling both solicited and unsolicited MWI with the ability to replace the unsolicited subscription with the solicited subscription. Do you really need to configure unsolicited MWI? As a possible work around you could enable only the solicited MWI since your devices appear to be soliciting for MWI. By: Purchasing Agent (hawkeye) 2018-04-29 11:46:11.221-0500 Hi Richard, thank you for your investigation to this matter. Your solution of changing MWI Subscription Type to Solicited from the FreePBX default of Auto did the trick. The PJSIP extensions have stopped loosing registration and that includes when our MOH server is not running. Thanks again! By: Fraser Allan (fraserallan) 2018-09-10 10:43:14.163-0500 Hello, Is there any update on this bug? I am having an issue where all PJSIP endpoints drop off and PJSIP does not respond to registration attempts. I then have to issue fwconsole restart. After speaking with Sangoma support (we use FreePBX), they found the following in my Asterisk logs and believe that this is related to this bug. {quote} [2018-09-08 20:35:27] DEBUG[33443]: res_pjsip/pjsip_distributor.c:538 distributor: Taskprocessor overload alert: Ignoring 'Request msg DEBUG[1469]: res_pjsip/pjsip_distributor.c:538 distributor: Taskprocessor overload alert: ignoring 'Request msg SUBSCRIBE/cseq=1 (rdata0x7fcd5d5d9ed0)' {quote} By: Richard Mudgett (rmudgett) 2018-09-10 11:18:50.847-0500 The issue is in queue, please be patient, and we will get to it as time permits and developer resources become available. Any updates are posted on the issue. In the mean time you could try the work around described in an earlier comment that worked for the original reporter. |