[Home]

Summary:ASTERISK-27299: Asterisk Hangs with Bad file descriptor on read()
Reporter:Abhay Gupta (agupta)Labels:pjsip webrtc
Date Opened:2017-09-27 13:04:57Date Closed:2017-12-20 07:48:26.000-0600
Priority:MajorRegression?
Status:Closed/CompleteComponents:Bridges/bridge_native_rtp
Versions:13.17.1 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-27417 Endpoints not getting registered
Environment:Ubuntu 16.04 with asterisk 13.17.1 Attachments:( 0) 11.log
( 1) 12.log
( 2) 3858.threads.log
( 3) latestGDB
( 4) modules.conf.sample
( 5) output2.txt
( 6) sip-3-0002949d.log
( 7) taskprocessSterl.txt
Description:Asterisk stops responding and only have errors of alertpipe
{noformat}
Sep 26 15:36:05] WARNING[3133] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@asterisk-00013cae;1 to process?
[Sep 26 15:36:05] WARNING[3133] alertpipe.c: read() failed: Bad file descriptor
[Sep 26 15:36:05] WARNING[3133] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@asterisk-00013cae;1 to process?
[Sep 26 15:36:05] WARNING[3133] alertpipe.c: read() failed: Bad file descriptor
[Sep 26 15:36:05] WARNING[3133] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@asterisk-00013cae;1 to process?
[Sep 26 15:36:05] WARNING[3133] alertpipe.c: read() failed: Bad file descriptor
[Sep 26 15:36:05] WARNING[3133] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@asterisk-00013cae;1 to process?
{noformat}
Comments:By: Asterisk Team (asteriskteam) 2017-09-27 13:04:57.743-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Abhay Gupta (agupta) 2017-09-27 13:06:55.471-0500

The file shows the DEBUG Logs from the time it was working to the time it became unstable and the configuration has calls being originated and pass to staisis for bridging and recording .

By: Abhay Gupta (agupta) 2017-09-27 13:26:38.961-0500

Task process shows a queue with

subm:rtp_topic-000000a8                            93062       1963          3        450        500

and gdb taken at the time shows it to be deadlocked .

By: Rusty Newton (rnewton) 2017-09-27 17:43:13.786-0500

Is this happening somewhat randomly in production , or are you able to trigger it or reproduce?

By: Abhay Gupta (agupta) 2017-09-27 19:38:35.534-0500

Happening randomly in production.
Have not been able to identify the trigger as nothing different seems to be happening apart from call origination and bridging through staisis which heppens full day.

By: Rusty Newton (rnewton) 2017-10-06 10:04:18.664-0500

Please gather the locks output described in the following documentation:

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock

By: Asterisk Team (asteriskteam) 2017-10-20 12:00:01.760-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: shaurya jain (shaurya) 2017-11-14 05:48:41.733-0600

Same Issue has been encountered on 13.18.2 as well.

Attached are the error logs and gdb logs as well.After these error logs endpoints were not able to registered.

Regards,
Shaurya Jain

By: Joshua C. Colp (jcolp) 2017-11-14 05:56:11.233-0600

[~shaurya] As you are not the original reporter please file a new issue.

By: Abhay Gupta (agupta) 2017-11-16 05:33:41.239-0600

Continued to get such errors but asterisk recovered automatically . How to debug this now

[Nov 16 10:59:52] WARNING[8306] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@sterling-00003928;1 to process?
[Nov 16 10:59:52] WARNING[8306] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 10:59:52] WARNING[8306] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@sterling-00003928;1 to process?
[Nov 16 10:59:52] WARNING[8306] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 10:59:52] WARNING[8306] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@sterling-00003928;1 to process?
[Nov 16 10:59:52] WARNING[8306] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 10:59:52] WARNING[8306] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@sterling-00003928;1 to process?
[Nov 16 10:59:52] WARNING[8306] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 10:59:52] WARNING[8306] bridge_channel.c: Weird.  No frame from bridge for Local/agentmanual@sterling-00003928;1 to process?
[Nov 16 10:59:52] WARNING[8306] alertpipe.c: read() failed: Bad file descriptor

[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor
[Nov 16 11:00:01] WARNING[9297] alertpipe.c: read() failed: Bad file descriptor

By: Asterisk Team (asteriskteam) 2017-11-16 05:33:41.626-0600

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Abhay Gupta (agupta) 2017-11-16 05:34:28.068-0600

Now asterisk version is latest 13.18.2

By: Richard Mudgett (rmudgett) 2017-11-16 07:09:29.191-0600

Asking again.  You have a deadlock.  Please gather the locks output described in the following documentation:

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock


By: Abhay Gupta (agupta) 2017-11-16 07:17:39.816-0600

There is no way to be intimated and get the trace of Asterisk since the issue comes for a few seconds and then it recovers automatically . This error also came 10:59:52 to 11:00:01 so a total of 9 seconds .


By: Richard Mudgett (rmudgett) 2017-11-16 07:25:09.353-0600

Removed attachments that are now attached to ASTERISK-27417.

By: Aaron An (aaron) 2017-11-21 08:50:48.341-0600

It is not  a "dead lock", all lock will freed after a few minutes, and pjsip will rework normally. Today it takes 24 minutes to recover.

By: Aaron An (aaron) 2017-11-21 08:57:16.932-0600

related logs:

[Nov 21 16:42:53] ERROR[27334][C-00014a62] bridge_channel.c: FRACK!, Failed assertion !AST_LIST_EMPTY(&bridge_channel->wr_queue) (0)
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: Got 21 backtrace records
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #0: [0x747c21] /usr/sbin/asterisk(__ast_assert_failed+0xba) [0x747c21]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #1: [0x4b662e] /usr/sbin/asterisk() [0x4b662e]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #2: [0x4c0460] /usr/sbin/asterisk() [0x4c0460]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #3: [0x4c14a7] /usr/sbin/asterisk() [0x4c14a7]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #4: [0x4c20a8] /usr/sbin/asterisk(bridge_channel_internal_join+0x8d8) [0x4c20a8]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #5: [0x496362] /usr/sbin/asterisk(ast_bridge_join+0x465) [0x496362]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #6: [0x5a9e46] /usr/sbin/asterisk(ast_bridge_call_with_flags+0x349) [0x5a9e46]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #7: [0x5a9fde] /usr/sbin/asterisk(ast_bridge_call+0x3f) [0x5a9fde]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #8: [0x7f43398d546f] /usr/lib/asterisk/modules/app_dial.so(+0x1746f) [0x7f43398d546f]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #9: [0x7f43398d5adc] /usr/lib/asterisk/modules/app_dial.so(+0x17adc) [0x7f43398d5adc]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #10: [0x6751f3] /usr/sbin/asterisk(pbx_exec+0x1e5) [0x6751f3]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #11: [0x6549ae] /usr/sbin/asterisk() [0x6549ae]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #12: [0x65a741] /usr/sbin/asterisk(ast_spawn_extension+0x13b) [0x65a741]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #13: [0x65bf34] /usr/sbin/asterisk() [0x65bf34]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #14: [0x65ec41] /usr/sbin/asterisk() [0x65ec41]
[Nov 21 16:42:53] VERBOSE[27334][C-00014a62] logger.c: #15: [0x7428d6] /usr/sbin/asterisk() [0x7428d6]
[Nov 21 16:42:53] WARNING[27334][C-00014a62] bridge_channel.c: Weird.  No frame from bridge for PJSIP/sip-3-0002949d to process?
[Nov 21 16:42:53] WARNING[27334][C-00014a62] alertpipe.c: read() failed: Bad file descriptor

some other logs:
[Nov 21 16:43:10] ERROR[27348][C-00014a62] bridge_channel.c: We couldn't write alert pipe for 0x7f42d805b1b0(PJSIP/sip-3-0002949d)... something is VERY wrong
[Nov 21 16:43:10] ERROR[27348][C-00014a62] bridge_channel.c: We couldn't write alert pipe for 0x7f42d805b1b0(PJSIP/sip-3-0002949d)... something is VERY wrong
[Nov 21 16:43:10] ERROR[27348][C-00014a62] bridge_channel.c: We couldn't write alert pipe for 0x7f42d805b1b0(PJSIP/sip-3-0002949d)... something is VERY wrong
[Nov 21 16:43:10] ERROR[27348][C-00014a62] bridge_channel.c: We couldn't write alert pipe for 0x7f42d805b1b0(PJSIP/sip-3-0002949d)... something is VERY wrong

By: Abhay Gupta (agupta) 2017-11-21 20:55:35.437-0600

This only seems to occur when channels are being made and released at a high pace . Was not able to simulate the condition but most of the times asterisk recovers if the channel in question is released through a softhangup it seems . Recovery in 24 minutes were huge would have crashed under load in such a case could you confirm if the same happened in your case and recovery happened when PJSIP/sip-3-0002949d was hangup from logs .

By: Aaron An (aaron) 2017-11-21 21:25:51.626-0600

3858.threads.log is the gdb output of the command below:
gdb asterisk `pidof asterisk` --batch --ex "set height 0" --ex='thread apply all bt' >> `pidof asterisk`.threads.log

By: Aaron An (aaron) 2017-11-21 21:42:59.076-0600

When PJSIP hangs, opensips proxy all sip calls to other asterisks. so with in 24 minutes, there are almost none of new sip request to it. and in my case the load is not high, CPU is normal, CPS is normal, calls is about 210.

By: Aaron An (aaron) 2017-11-21 21:45:40.374-0600

grep 'PJSIP/sip-3-0002949d' /var/log/asterisk/full-20171122  >sip-3-0002949d.log

By: Aaron An (aaron) 2017-11-21 21:54:40.196-0600

It seems that at 17:07:06 when all the calls are ended, asterisk recovery.
[root@ip-10-10-59-250 asterisk]# grep 'PJSIP/700022580101074-0002965b' /var/log/asterisk/full-20171122
[Nov 21 16:43:48] VERBOSE[28054][C-00014b40] app_stack.c: PJSIP/700022580101074-0002965b Internal Gosub(webrtc_call_start_dest,700022580101074,1) start
[Nov 21 16:43:48] VERBOSE[28054][C-00014b40] pbx.c: Executing [700022580101074@webrtc_call_start_dest:1] Set("PJSIP/700022580101074-0002965b", "PJSIP_HEADER(add,X-Asterisk-call_type)=4") in new stack
[Nov 21 16:43:48] NOTICE[28054][C-00014b40] app_stack.c: PJSIP/700022580101074-0002965b Abnormal 'Gosub(webrtc_call_start_dest,700022580101074,1)' exit.  Popping routine return locations.
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] app_dial.c: PJSIP/700022580101074-0002965b is ringing
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] app_dial.c: PJSIP/700022580101074-0002965b answered PJSIP/sip-6-0002965a
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] app_stack.c: PJSIP/700022580101074-0002965b Internal Gosub(webrtc_bridge_dest,~~s~~,1) start
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] pbx.c: Executing [~~s~~@webrtc_bridge_dest:1] NoOp("PJSIP/700022580101074-0002965b", "webrtc_bridge_dest") in new stack
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] pbx.c: Executing [~~s~~@webrtc_bridge_dest:2] Return("PJSIP/700022580101074-0002965b", "") in new stack
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] app_stack.c: Spawn extension (default, 700022580101074, 1) exited non-zero on 'PJSIP/700022580101074-0002965b'
[Nov 21 16:43:51] VERBOSE[28054][C-00014b40] app_stack.c: PJSIP/700022580101074-0002965b Internal Gosub(webrtc_bridge_dest,~~s~~,1) complete GOSUB_RETVAL=
[Nov 21 16:43:51] VERBOSE[28078][C-00014b40] bridge_channel.c: Channel PJSIP/700022580101074-0002965b joined 'simple_bridge' basic-bridge <0beef8f6-1616-4914-b761-c0b8c7bac6d1>
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF begin '0' received on PJSIP/700022580101074-0002965b
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF begin passthrough '0' on PJSIP/700022580101074-0002965b
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF end '0' received on PJSIP/700022580101074-0002965b, duration 100 ms
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF end accepted with begin '0' on PJSIP/700022580101074-0002965b
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF end '0' detected to have actual duration 79 on the wire, emulation will be triggered on PJSIP/700022580101074-0002965b
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF end '0' has duration 79 but want minimum 80, emulating on PJSIP/700022580101074-0002965b
[Nov 21 16:44:04] DTMF[28078][C-00014b40] channel.c: DTMF end emulation of '0' queued on PJSIP/700022580101074-0002965b
[Nov 21 17:07:06] WARNING[28054][C-00014b40] channel.c: Unable to write to alert pipe on PJSIP/700022580101074-0002965b (qlen = 0): Bad file descriptor!
[Nov 21 17:07:06] WARNING[1305] channel.c: Unable to write to alert pipe on PJSIP/700022580101074-0002965b (qlen = 1): Bad file descriptor!
[Nov 21 17:07:06] VERBOSE[28078][C-00014b40] bridge_channel.c: Channel PJSIP/700022580101074-0002965b left 'simple_bridge' basic-bridge <0beef8f6-1616-4914-b761-c0b8c7bac6d1>

By: Aaron An (aaron) 2017-11-22 08:48:47.004-0600

today asterisk hangs again, this time it doesn't recovery. I have upload two gdb logs, 11.log is the first one and 10 second later dump the second one 12.log.

By: Aaron An (aaron) 2017-11-22 09:01:38.697-0600

The server is blocking now, I don't restart it to continue find out the root cause. any one who wants to trace this issue, can login this server by SSH. You can use GDB to see what happened.
Contact me with Email:anjb@ti-net.com.cn

By: Aaron An (aaron) 2017-11-22 09:17:14.825-0600

I see Thead  298 is blocked in read().  read() don't return and blocked all the time.

(gdb) break alertpipe.c:108 thread 298
Breakpoint 1 at 0x43f6c3: file alertpipe.c, line 108.
(gdb) thread 298
[Switching to thread 298 (Thread 0x7f3b9ae9f700 (LWP 16465))]
#0  0x00007f3cd3d3b49d in read () from /lib64/libpthread.so.0
(gdb) n
Single stepping until exit from function read,
which has no line number information.
... here never return.


info about Thread 298:

Thread 298 (Thread 0x7f3b9ae9f700 (LWP 16465)):

#0  0x00007f3cd3d3b49d in read () from /lib64/libpthread.so.0
#1  0x000000000043f6dc in ast_alertpipe_read (alert_pipe=0x7f3cac0e0c60) at alertpipe.c:108
#2  0x0000000000540b01 in ast_channel_internal_alert_read (chan=0x7f3cac0e0300) at channel_internal_api.c:1260
#3  0x00000000005162de in __ast_read (chan=0x7f3cac0e0300, dropaudio=0) at channel.c:3895
#4  0x0000000000519bb2 in ast_read (chan=0x7f3cac0e0300) at channel.c:4384
#5  0x00000000004c0b24 in bridge_handle_trip (bridge_channel=0x7f3ca027f610) at bridge_channel.c:2397
#6  0x00000000004c1414 in bridge_channel_wait (bridge_channel=0x7f3ca027f610) at bridge_channel.c:2567
#7  0x00000000004c20a8 in bridge_channel_internal_join (bridge_channel=0x7f3ca027f610) at bridge_channel.c:2713
#8  0x0000000000496362 in ast_bridge_join (bridge=0x7f3ca018a1f0, chan=0x7f3cac0e0300, swap=0x0, features=0x7f3b9ae9ae00,
   tech_args=0x0, flags=(AST_BRIDGE_JOIN_PASS_REFERENCE | AST_BRIDGE_JOIN_INHIBIT_JOIN_COLP)) at bridge.c:1712
#9  0x00000000005a9e46 in ast_bridge_call_with_flags (chan=0x7f3cac0e0300, peer=0x7f3ca038a590, config=0x7f3b9ae9bf10, flags=0)
   at features.c:672
#10 0x00000000005a9fde in ast_bridge_call (chan=0x7f3cac0e0300, peer=0x7f3ca038a590, config=0x7f3b9ae9bf10) at features.c:711

By: Abhay Gupta (agupta) 2017-11-22 11:25:31.765-0600

In __ast_read in channel.c we have ast_channel_lock(chan);

Once it is locked and the code is on read() then this channel will continue to be deadlocked and probably that is why we have issues from Exceptionally long queue to No frame from bridge and other alertpipe errors which continue to fill logger and cause performance issues till the channel is released .

By: Abhay Gupta (agupta) 2017-11-22 11:33:29.728-0600

We also have exactly the same issue in the GDB

Thread 222 (Thread 0x7f31d7f27700 (LWP 18843)):
#0  0x00007f33961ab6ad in read () at ../sysdeps/unix/syscall-template.S:84
No locals.


By: Richard Mudgett (rmudgett) 2017-11-22 12:10:16.842-0600

[~aaron] and [~agupta] please follow the issue guidelines \[1] and stop pasting large blocks of backtraces in the comments.  Attach them to the issue with a .txt exten with the More->Attach-Files pull down menu.  You can refer to files in the comments like this \[^filename] where \[^11.log] becomes a hot link like this [^11.log].

[~aaron] What Asterisk version are you using?  The line numbers do not match the version posted by [~agupta].  Turning off optimization helps a lot when making sense of what is happening.

\[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

By: Aaron An (aaron) 2017-11-22 19:44:39.179-0600

Hi Richard, The Asterisk version is 13.16.0.

By: Aaron An (aaron) 2017-12-03 21:26:07.466-0600

Asterisk hangs every 3-5/days, is there any solution? or what additional information should i supply?

By: Joshua C. Colp (jcolp) 2017-12-04 05:39:09.537-0600

This issue has been triaged and accepted. There is noone currently working on it so there are no additional comments.

By: Aaron An (aaron) 2017-12-07 07:53:33.782-0600

hi Joshua Colp, I think this may relate to this change.
https://gerrit.asterisk.org/#/c/5497/
I confirm that in my environment the eventfd is used. We Have HAVE_EVENTFD enabled.
Can you explain this change? and I see read() or write with size(uint64_t) instead of sizeof(char) why?

By: Joshua C. Colp (jcolp) 2017-12-07 07:59:39.330-0600

I did not write the original change and can not recall. [~seanbright] is the author.

By: Aaron An (aaron) 2017-12-07 08:28:17.761-0600

The hang is just the lost of nonblocking status with FD.
so before find out somewhere else bug in the code , it is better to keep the check there.

the comment on previous code of function channel_internal_alert_check_nonblock.
/* For some odd reason, the alertpipe occasionally loses nonblocking status,
* which immediately causes a deadlock scenario.  Detect and prevent this. */

Sean Bright says:

I've also removed what I believe to be an erroneous code block that
checked the non-blocking flag on the pipe ends for each read. If the
file descriptor is 'losing' its non-blocking mode, it is because of a
bug somewhere else in our code.

By: Aaron An (aaron) 2017-12-07 08:40:18.766-0600

I will add the nonblock_check back before the alertpipe read(). and see what happened. if it solved the problem I can patch it on the gerrit.

By: Sean Bright (seanbright) 2017-12-07 08:41:14.162-0600

[~aaron], the rationale for the change was detailed in the review that you linked. The change from {{sizeof(char)}} to {{sizeof(uint64_t)}} was made due to requirements of [the {{eventfd}} API|http://man7.org/linux/man-pages/man2/eventfd.2.html].

File descriptors do not magically lose their {{O_NONBLOCK}} flag, so if that is actually what is happening here, there is code elsewhere in Asterisk that is erroneously doing it. The correct solution is to find out when, where, and why that is happening and fix it, not to roll back in a hack.

If putting that code block back in your code fixes the issue, that's great, but I don't think we should be doing that for all users.

Have you actually confirmed that the FD is losing the {{O_NONBLOCK}} flag or are you just assuming?

By: Sean Bright (seanbright) 2017-12-07 08:46:58.014-0600

[~aaron], attach your {{modules.conf}} please. And what version of Asterisk are you using?

By: Abhay Gupta (agupta) 2017-12-07 10:09:45.823-0600

Are you looking at some specific modules , because i have all the modules except for cdr and cel now .

By: Aaron An (aaron) 2017-12-07 20:40:17.617-0600

modules.conf

By: Aaron An (aaron) 2017-12-07 20:43:22.117-0600

Sean Bright, I actually confirmed that the FD is losing the O_NONBLOCK flag, please see my older comments. When Asterisk hangs I use gdb to attach to the 'pidof asterisk' process, The Thread is blocked at read(), that cause the current channel locked and thus lead to other channels locked.
My asterisk version is just 13.16.0.
Yes Sean Bright you are right, it is better to find out the root cause of losing NONBLOCK flag of FD.
I have the production environment and may hangs every day, and I can leave the asterisk process there to keep the hang status for you. you can ssh login and gdb to see what happened.
BTW, many times when I use gdb to get the stack information, asterisk will recover immediately without restart, it is very strange, isn't it?

By: Abhay Gupta (agupta) 2017-12-07 21:13:01.489-0600

Aaron i know how it recovers . In my case i have seen that as soon as that rogue channel is question is Hangup the issue resolves . On hangup the socket is closed and so the blocking read immediately comes out causing the recovery .

By: Abhay Gupta (agupta) 2017-12-08 02:12:01.965-0600

Aaron can you try 13.18.3 and see if the issue is resolved .

By: Aaron An (aaron) 2017-12-18 03:07:44.010-0600

Hi, update information here: asterisk remain hangs every few days. I take a lot of investigation about  the lose of NONBLOCK flag, It seems that other thread closes the fd holding by alertpipe.
for example:
channel A has alertpipe read fd=3, but other thread B call close(3) by some unknown reason. and maybe thread C get a new fd of value=3, so I can see the "Bad File Descriptor" error and the "Losing NONBLOCK flag" randomly.
The key issue is some module closes the FD not belong to it.

By: Aaron An (aaron) 2017-12-18 05:42:36.300-0600

I found a place that will close the FD twice. Asterisk 13.16.0 support rtcp-mux feature so rtcp use the same UDP port of rtp. But in res_rtp_asterisk.c function ast_rtp_destroy, we don't check if rtcp->s == rtp->s and close it twice, if in heavy load, free the FD twice will lead the Bad file descriptor issue. I will fix it in my production platform, and patch it to the gerrit. Thanks all the people watching this.

look at this code.
{noformat}
/* Close our own socket so we no longer get packets */
if (rtp->s > -1) {
close(rtp->s);
}

/* Destroy RTCP if it was being used */
if (rtp->rtcp) {
/*
* It is not possible for there to be an active RTCP scheduler
* entry at this point since it holds a reference to the
* RTP instance while it's active.
*/
close(rtp->rtcp->s);
ast_free(rtp->rtcp->local_addr_str);
ast_free(rtp->rtcp);
}
{noformat}


By: Abhay Gupta (agupta) 2017-12-18 11:12:09.885-0600

Right Aaron , i am also using webrtc and rtcp-mux and the problem only comes when at high pace webrtc makes call and hangup the same and that is when this problem appears .

By: Aaron An (aaron) 2017-12-18 21:13:15.369-0600

https://gerrit.asterisk.org/#/c/7618/

By: Friendly Automation (friendly-automation) 2017-12-20 07:48:27.795-0600

Change 7669 merged by Jenkins2:
res_rtp_asterisk:  Avoid close the rtp/rtcp fd twice.

[https://gerrit.asterisk.org/7669|https://gerrit.asterisk.org/7669]

By: Friendly Automation (friendly-automation) 2017-12-20 07:56:36.261-0600

Change 7668 merged by Jenkins2:
res_rtp_asterisk:  Avoid close the rtp/rtcp fd twice.

[https://gerrit.asterisk.org/7668|https://gerrit.asterisk.org/7668]

By: Friendly Automation (friendly-automation) 2017-12-20 07:56:37.997-0600

Change 7618 merged by Jenkins2:
res_rtp_asterisk:  Avoid close the rtp/rtcp fd twice.

[https://gerrit.asterisk.org/7618|https://gerrit.asterisk.org/7618]