Summary: | ASTERISK-26941: ARI WebSocket forcibly closed due to fatal write error on repeated mute/unmute requests | ||
Reporter: | Colin (dixoncb) | Labels: | |
Date Opened: | 2017-04-12 05:36:41 | Date Closed: | 2019-04-17 15:52:31 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | Resources/res_http_websocket |
Versions: | 14.3.0 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Ubuntu 16.04.2 LTS on Windows Azure VM | Attachments: | |
Description: | We're testing a Stasis conference application and seeing forced WebSocket closes from res_http_websocket.c. Testing with SIPp, we're calling mute/unmute asynchronously for batches of callers (POST /channels/\{channelId}/mute).
We're running with about 60 SIPp connections. Other batch operations (for example move callers between bridges) complete correctly. This issue seems to be specific to mute and unmute commands. Asterisk does not crash, but the Stasis app effectively becomes moribund. We're muting and unmuting callers as one group, using async tasks which we marshal before any further requests are sent to ARI. We set up a test server to get some forensic information, using SIPp to simulate 60 connections. We are finding that about 10-15 successful mute/unmute operations take place from each group, before the WebSocket and HTTP connections become moribund. I can provide a debug log file. Some salient events that we've picked out when we do a group mute/unmute, however, are: 1) A large number of webocket writes (in the order of n x n, where n is the number of clients being muted/unmuted), like: {noformat} [Apr 10 14:36:29] DEBUG[35342] taskprocessor.c: Taskprocessor 'subm:voice_2-00000065' cleared the high water alert. [Apr 10 14:36:29] DEBUG[35342] res_http_websocket.c: Writing websocket string of length 668 [Apr 10 14:36:29] DEBUG[35342] res_http_websocket.c: Writing websocket text frame, length 668 {noformat} 2) After a small number of successful mute/unmute operations, the websocket faults, and closes {noformat} Apr 10 14:36:30] DEBUG[35342] res_http_websocket.c: Closing WS with 1011 because we can't fulfill a write request [Apr 10 14:36:30] DEBUG[35342] utils.c: Timed out trying to write [Apr 10 14:36:30] DEBUG[35342] res_http_websocket.c: WebSocket connection from 'xxx.xxx.xxx.xxx:18133' forcefully closed due to fatal write error {noformat} 3) A large number of attempted writes, followed by deactivation of the Stasis application {noformat} [Apr 10 14:36:30] DEBUG[35342] res_http_websocket.c: Writing websocket text frame, length 668 [Apr 10 14:36:30] NOTICE[35342] ari/ari_websockets.c: Problem occurred during websocket write to :18133, websocket closed [Apr 10 14:36:30] WARNING[35340] ari/ari_websockets.c: WebSocket read error: Software caused connection abort [Apr 10 14:36:30] VERBOSE[35340] stasis/app.c: Deactivating Stasis app 'voice_2' {noformat} We can replicate this at will, and we have extensive verbose/debug logs if these are needed. | ||
Comments: | By: Asterisk Team (asteriskteam) 2017-04-12 05:36:42.510-0500 The severity of this issue has been automatically downgraded from "Blocker" to "Major". The "Blocker" severity is reserved for issues which have been determined to block the next release of Asterisk. This severity can only be set by privileged users. If this issue is deemed to block the next release it will be updated accordingly during the triage process. By: Asterisk Team (asteriskteam) 2017-04-12 05:36:42.818-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Rusty Newton (rnewton) 2017-04-13 11:22:26.318-0500 Are you attempting to reconnect to the websocket? I'm not the best for debugging ARI, but after talking with the developers it seems the behavior you describe is expected. It is the responsibility of your application to reconnect if it gets disconnected, and a disconnect is expected if the write timeout is exceeded. {noformat} ; Default write timeout to set on websockets. This value may need to be adjusted ; for connections where Asterisk must write a substantial amount of data and the ; receiving clients are slow to process the received information. Value is in ; milliseconds; default is 100 ms. ;websocket_write_timeout = 100 {noformat} You should examine the rate at which you are reading and handling events and adjust the write timeout if necessary. Your client should reconnect to the websocket if disconnected. Let me know if I'm making sense. Also, feel free to jump on #asterisk, #asterisk-dev, #asterisk-ari on irc.freenode.net if you want to talk about this in real-time. There are a number of regular ARI users and developers in those rooms. By: Colin (dixoncb) 2017-04-13 11:45:21.451-0500 Thank you for your fast response. We'll take a look at the timeout. We are reconnecting on disconnection, but Stasis has become inactive by then, and our reconnect never happens (see the last line in (3) above). We're don't know if this is a result of the WebSocket error, or if both are symptoms of something else. What is noticeable (see block 1 above - "res_http_websocket.c: Writing websocket string of length 668") is that as we increase the number of callers we're muting, we see a rise in the number of WebSocket writes which is not linear - it's more exponential. So we see huge swathes of this line in our log files immediately before the disconnection. We'll be back to you when we've changed the timeout and re-run the tests. By: Rusty Newton (rnewton) 2017-04-13 17:00:34.373-0500 Thanks, let us know. By: Asterisk Team (asteriskteam) 2017-04-28 12:00:00.578-0500 Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1]. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines By: Colin (dixoncb) 2017-04-30 05:03:14.465-0500 We've spent some time experimenting. Increasing the timeout seems to be effective - we can mute and unmute 60 callers without a crash if we increase the ARI WebSocket timeout to 1 second - but only when we don't write channel variables. In our app we write additional state back to channel variables, so in the case of a disconnection of the app from Stasis, our app can reconnect and 'audit' the channels in Asterisk by reading the additional state, and thus dynamically restore functionality. So, for example, if we mute a channel, we write a variable to that channel with some additional service information. This is read on reconnection, or on the start of a new instance of our app to resume the service. When we mute 60 channels, in bridges, we write 60 channel variables, and this crashes Stasis. We're looking at working around this by using a different caching mechanism - for example a Redis cloud cache. It seems that bulk serial writing of channel variables following a bulk channel operation (mute, move between bridges) is the cause of the crash. When we write the channel variables, we're still seeing the pattern in (1), (2) and (3) above - a large number of WebSocket writes on the Asterisk side, followed by a WebSocket fail in websockets.c, followed by the deactivation of the Stasis app. We are then entirely unable to reconnect via REST or WebSocket without restarting the entire Asterisk instance. Given that Asterisk is still functioning, and we have live callers, this is not workable - our aim to reconnect, not to restart. Is this normal behaviour - i.e. is there any way we can avoid the Stasis application becoming moribund if the WebSocket faults in websockets.c? The question here is that in the case of a WebSocket fault, our application will attempt to reconnect at 3-second intervals. How does ARI behave? Finally, some of the ARI event messages we receive back when we carry out bridge operations are extremely verbose. "ChannelEnteredBridge", for example sends the Channel IDs of all other channels present in the bridge. When we are moving 30 calls between bridges, each new event sends effectively its own ID, and all the others that are already there - so by the thirtieth message we have a lot of superfluous information. Is there any way we can 'unregister' for events from Stasis? We've looked at 'decline' in stasis.conf, but it doesn't seem granular enough to correspond with the events we are receiving. By: Asterisk Team (asteriskteam) 2017-04-30 05:03:15.022-0500 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Joshua C. Colp (jcolp) 2017-05-03 14:18:31.536-0500 There is no ability to filter Stasis messages at a granular level. As for this issue - is your application consuming the events at a slow pace, such that things build up? In other cases this has been determined to be a problem and bumping up the timeout just increases the tolerance. By: Joshua C. Colp (jcolp) 2017-05-03 14:20:00.887-0500 As for your question about reconnection - this is something that is being done by others without issue. A new connection will replace an existing connection, but you will not receive any missed events. By: Colin (dixoncb) 2017-05-04 09:49:30.718-0500 Thank you. For most events, our app is simply doing a log trace (we only handle talking, DTMF and Statis start/end events). I am however replacing the .Net Task-based dispatcher that handles the events with a Producer/Consumer queue. This will allow the WebSocket read thread to return immediately. The strategies of increasing the timeout, and of not writing Channel Variables seems to stop the disconnection, so thanks for that. Just to confirm, when we see the log message " stasis/app.c: Deactivating Stasis app 'voice_2' ", we can't reconnect on the WebSocket connection. Are you saying that we should be able to do this without restarting Asterisk or any of its modules? We're using the .Net WebSockets4Net library, and it may be that there's a problem there... By: Joshua C. Colp (jcolp) 2017-05-04 10:19:10.379-0500 Yes, you should be able to reconnect without doing anything to Asterisk. By: Asterisk Team (asteriskteam) 2017-05-19 12:00:01.932-0500 Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1]. [1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines |