[Home]

Summary:ASTERISK-27385: channel.c: Exceptionally long queue length queuing
Reporter:Abhay Gupta (agupta)Labels:pjsip
Date Opened:2017-10-31 09:59:05Date Closed:2017-12-22 04:57:44.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Core/Bridging
Versions:13.18.2 Frequency of
Occurrence
Related
Issues:
Environment:Ubuntu 16.04 with asterisk 13.17.2 using Staisis Attachments:( 0) abhay.txt
( 1) asterisk_top.png
( 2) asterisk_logs.txt
( 3) core-asterisk-running-2017-11-08T12-30-19+0530-brief.txt
( 4) core-asterisk-running-2017-11-08T12-30-19+0530-full.txt
( 5) core-asterisk-running-2017-11-08T12-30-19+0530-locks.txt
( 6) core-asterisk-running-2017-11-08T12-30-19+0530-thread1.txt
( 7) full
( 8) gdb.txt
( 9) pjsip_settings.txt
(10) server_info.txt
(11) stasis.conf
(12) stucked_call.txt
(13) task_noproblem.txt
(14) taskprocessors
Description:Asterisk becomes unstable with message like

[Oct 31 10:37:37] WARNING[5608][C-00001b6b] channel.c: Exceptionally long queue length queuing to Local/2048@sxxxxxxg-000016ba;1

There are two staisis customer-dial and agent-dial . The steps seems to be as follows
1. A Sip channel is added in customer-dial
2. A manager originate command adds a channel between customer-dial and agent-dial while the existing channel is hanging up in customer-dial
3. Both channels of customer-dial are bridged while some clearing is happening on the channel which is getting hangup / deleted .

This problem appears
Comments:By: Asterisk Team (asteriskteam) 2017-10-31 09:59:05.342-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Abhay Gupta (agupta) 2017-10-31 10:00:47.374-0500

Full log with ari debug just before the issue till the time issue occurs and the gdb attached at the time of occurrence of this issue .

By: George Joseph (gjoseph) 2017-10-31 15:23:52.132-0500

When you say "unstable" what does that mean?  Will it recover if left alone to catch up?

More debugging info will help...
"pjsip show settings"
"core show taskprocessors"  (this may be big but look for any taskprocessors with high queue depths)
The contents of stasis.conf.

What's the system load when this happens?
How many CPUs, how much memory on the system.
While the issue is happening, run "top" and capture the stats for the asterisk process.






By: Abhay Gupta (agupta) 2017-10-31 23:43:08.030-0500

The system has 12 cores and 48GB RAM . All the details are attached when the problem is not there .
File pjsip_settings has output of pjsip show settings
File stasis.conf is attached and ithas default settings
File server_info has of core and RAM
File tast_noproblem is how the taskprocessor looks like when the problem is not there .
Rest information i will send at the time of issue .

By: Abhay Gupta (agupta) 2017-10-31 23:49:16.609-0500

If the server is continued to run with the long queue error it aborts with the error

Excessive refcount 100000 reached on ao2 object 0x560c507ac6c8", file=0x560c4f6d447b "astobj2.c", line=518

gdb is attached .

By: Abhay Gupta (agupta) 2017-11-01 02:50:17.891-0500

Log with exact sequence of ARI DEBUG which lead to the issue .

By: Abhay Gupta (agupta) 2017-11-01 12:50:37.476-0500

top commands does not change much during problem . taskprocess also does not show queue  and the same is uploaded taken at the time of the issue .

By: Abhay Gupta (agupta) 2017-11-08 01:18:43.723-0600

We were able to simulate the problem . The steps are as follows

1. Originate a number and put that in stasis customer-dial
2. Bridge created and channel inserted in bridge
3. Originate local channel to connect to another stasis agent-dial
4. Originate a SIP extension and put both PJSIP and Local channel in bridge
5. Now do not hangup the channel
6. Repeat the same process and problem of Exceptionally long queue length appears .

Attaching full logs and all the files generated from ast_coredumper


By: Abhay Gupta (agupta) 2017-11-08 03:26:28.019-0600

At the time of issue , bridge looked like this
{
 "technology": "softmix",
 "id": "1001",
 "bridge_type": "mixing",
 "bridge_class": "stasis",
 "video_mode": "talker",
 "channels": [
   "1510127272.150",
   "1510127273.152",
   "1510127383.157"
 ],
 "creator": "Stasis",
 "name": ""
}

and the name of channels in bridge were Local/1001@xyz-0000001f;2, Local/agentmanual@xyz-00000020, Local/1001@xyz-00000022;2

The problem happens only when in a bridge two channels have a similar device id like Local/1001@xyz in this case .



By: Abhay Gupta (agupta) 2017-11-14 01:14:52.859-0600

We upgraded asterisk to 13.18.2 but though the frequency has reduced so certainly some condition which was occurring has been between 13.17.2 and 13.18.2 but we are still getting errors

[Nov 13 19:02:55] ERROR[26473] res_pjsip.c: Endpoint '2020': Could not create dialog to invalid URI '2020'.  Is endpoint registered and reachable?
[Nov 13 19:02:55] ERROR[26473] chan_pjsip.c: Failed to create outgoing session to endpoint '2020'
[Nov 13 19:02:55] WARNING[13840] app_dial.c: Unable to create channel of type 'PJSIP' (cause 3 - No route to destination)
[Nov 13 19:03:37] WARNING[16753][C-0001ba9e] res_stasis_playback.c: 1510599794.325307: Playback failed for sound:54//Redirect_1510552223
[Nov 13 19:03:41] WARNING[18268][C-0001bad0] file.c: Unexpected control subclass '1101'
[Nov 13 19:05:45] WARNING[26897][C-0001bcb6] res_stasis_playback.c: 1510599909.326771: Playback failed for sound:54/Redirect_1510552223
[Nov 13 19:05:53] WARNING[29159][C-0001bcd1] res_stasis_playback.c: 1510599921.326939: Playback failed for sound:54/Redirect_1510552223
[Nov 13 19:06:01] WARNING[28732][C-0001bce8] res_stasis_playback.c: 1510599927.326997: Playback failed for sound:54/Redirect_1510552223
[Nov 13 19:06:36] WARNING[32013][C-0001bd55] res_stasis_playback.c: 1510599952.327282: Playback failed for sound:54/Redirect_1510552223
[Nov 13 19:06:41] WARNING[905][C-0001bdc2] res_stasis_playback.c: 1510599979.327626: Playback failed for sound:54/Redirect_1510552223
[Nov 13 19:07:06] WARNING[5990][C-0001be69] channel.c: Exceptionally long queue length queuing to Local/2020@xyz-000182a1;1
[Nov 13 19:07:06] WARNING[5990][C-0001be69] channel.c: Exceptionally long queue length queuing to Local/2020@xyz-000182a1;1
[Nov 13 19:07:06] WARNING[5990][C-0001be69] channel.c: Exceptionally long queue length queuing to Local/2020@xyz-000182a1;1
[Nov 13 19:07:06] WARNING[5990][C-0001be69] channel.c: Exceptionally long queue length queuing to Local/2020@xyz-000182a1;1

By: Abhay Gupta (agupta) 2017-11-16 03:20:33.258-0600

I have identified a issue but do not know how to debug it further . The problem comes only when two channels try to enter a simple_bridge

T 2017/11/15 14:31:41.304479 10.10.200.46:5038 -> 10.10.200.46:58930 [AP]
 Event: BridgeEnter
 Privilege: call,all
 BridgeUniqueid: AD111520171431300246
..BridgeType: stasis
..BridgeTechnology: simple_bridge
..BridgeCreator: Stasis
..BridgeName: <unknown>
..BridgeNumChannels: 2
..BridgeVideoSourceMode: talker
..Channel: Local/2010@xyz-00002eca;1
..ChannelState: 6
..ChannelStateDesc: Up
..CallerIDNum: <unknown>
..CallerIDName: <unknown>
..ConnectedLineNum: <unknown>
..ConnectedLineName: <unknown>
..Language: en
..AccountCode: AD111520171431300246
..Context: xyz
..Exten: 2010
..Priority: 1
..Uniqueid: 1510756301.36555
..Linkedid: 1510756291.36433....

In the above when a channel Local/2010@xyz-00002eca;1 tries to enter a bridge AD111520171431300246 when BridgeNumChannels: 2 and the bridge does not change itself to a softmix bridge we get this error . Kindly let me know how to resolve the issue

By: Richard Mudgett (rmudgett) 2017-11-16 06:55:22.168-0600

How was the bridge created?  Was it created to be able to change?

By: Abhay Gupta (agupta) 2017-11-16 07:27:39.489-0600

Bridge is created by ARI command with POST bridge command giving only the bridgeId . There is no option of fixing the bridge type it seems , please guide if this is the right way to create a bridge and the monitoring is enabled on this channel . We then add channels to the bridge using the ari commands only


By: Benjamin Keith Ford (bford) 2017-11-30 15:09:25.230-0600

Hey [~agupta], I just tested adding 2 channels that called into the application via Stasis to a simple bridge and the bridge type changes to native_rtp. Both users can hear one another and no WARNING message appears. This was done on 13 branch. I created the bridge with client.bridges.create(bridgeId='bridgeName'). This is what you are doing to produce the WARNING message, correct?
If you enter 'module show like bridge' on the Asterisk CLI, what is the output? Before and after the interaction with ARI?

By: Abhay Gupta (agupta) 2017-12-06 06:09:53.847-0600

The steps that i used to reproduce this issue was resolved when Asterisk was updated to 13.18.2 but still this problem comes which i am unable to reproduce . Only thing that i could see was some locks that were resolved in 13.18.2 related to media on registration and so i suppose something similar is a issue pending in some other portion .

If you can help let me know any code that i can put in certain functions in source which could help trace the issue i can do so but i am not able to enable core show locks as the server is heavily loaded

By: Benjamin Keith Ford (bford) 2017-12-06 09:21:04.493-0600

Is it still the same issue? "Exceptionally long queue length queuing to..."? And it happens randomly now? Also, is there any particular point you notice it to happen more often? Like after a certain number of calls, or when a large number of calls are in progress?

By: Abhay Gupta (agupta) 2017-12-06 09:58:44.432-0600

Yes issue remains the same of Exceptionally long queue length . We are not able to establish what causes this issue but this happens anytime and number of calls are almost consistent and channels are being made and hangup at a high pace on this server .

By: Benjamin Keith Ford (bford) 2017-12-06 15:51:52.525-0600

We'll need some more information in order to proceed with the issue. Are you able to monitor your system and collect more debug to see if there's any pattern or condition that needs to be met in order for this issue to show up? If you're unable to locate the cause of the problem, we can try looking at some specific areas in the source code and determine what it could be.

By: Abhay Gupta (agupta) 2017-12-08 03:08:31.179-0600

I will try to get the debug logs . whenever the problem appears the amount of logs are very high and so will take some time to get them in the meanwhile can you please answer my query on a portion of code that i have not understood in channel.c .
We have a function
int __ast_answer(struct ast_channel *chan, unsigned int delay)
in that function case AST_STATE_RING: we have a do while , a infinite for loop(for (;;)) and in that for loop we have this code

while ((cur = AST_LIST_REMOVE_HEAD(&frames, frame_list))) {
                               if (res == 0) {
                                       ast_queue_frame_head(chan, cur);
                               }
                               ast_frfree(cur);
                       }

which will call ast_queue_frame and the lines Exceptionally long queue can come if we have frames in queue . So once it gets in there it remain in this while loop so how will this loop break ?

By: Joshua C. Colp (jcolp) 2017-12-08 05:28:06.821-0600

The logic you are referring to will not loop indefinitely.

"frames" is a list of frames that were read in while waiting for media to flow. The logic you've provided moves those frames from the "frames" list to the list on the channel. It removes frames from the "frames" list until there are none left, after which the loop will be exited.

By: Abhay Gupta (agupta) 2017-12-22 00:18:55.015-0600

This problem was found linked to the the issue 27299 with closure of socket due to use of websockets . The issue can be closed after closure of ASTERISK-27299

By: Joshua C. Colp (jcolp) 2017-12-22 04:57:44.706-0600

Closed per reporter.