Summary: | ASTERISK-27205: res_rtp_asterisk: Asterisk crash with netlink error | ||||||
Reporter: | Abhay Gupta (agupta) | Labels: | |||||
Date Opened: | 2017-08-18 02:12:19 | Date Closed: | 2017-12-22 00:28:45.000-0600 | ||||
Priority: | Major | Regression? | |||||
Status: | Closed/Complete | Components: | Resources/res_pjsip Resources/res_rtp_asterisk | ||||
Versions: | 13.15.0 13.17.0 | Frequency of Occurrence | Frequent | ||||
Related Issues: |
| ||||||
Environment: | Linux Ubuntu kernel 4.4.0-83 with glibc 2.23 and asterisk using internal PJSIP | Attachments: | ( 0) 14_50.txt ( 1) 17_08.txt ( 2) 24_08_12_00_bt_log ( 3) bt.txt ( 4) filename.txt ( 5) full_100000 ( 6) gdb_18aug.txt ( 7) log.txt ( 8) logs.txt ( 9) Screen_Shot_2017-09-12_at_12.34.05_PM.png (10) Screen_Shot_2017-09-12_at_12.36.06_PM.png | ||||
Description: | There is segmentation fault that causes crash and bt is attached , it seems that Nelink error is the cause and can be seen from this changelog of glibc
https://abi-laboratory.pro/tracker/changelog/glibc/2.23/log.html getaddrinfo now detects certain invalid responses on an internal netlink socket. If such responses are received, an affected process will terminate with an error message of "Unexpected error <number> on netlink descriptor <number>" or "Unexpected netlink response of size <number> on descriptor <number>". The most likely cause for these errors is a multi-threaded application which erroneously closes and reuses the netlink file descriptor while it is used by getaddrinfo. this signal is not captured causing it to crash . | ||||||
Comments: | By: Asterisk Team (asteriskteam) 2017-08-18 02:12:21.038-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Abhay Gupta (agupta) 2017-08-18 02:15:02.380-0500 This is the bt which shows the netlink error and abort due to that . By: George Joseph (gjoseph) 2017-08-18 07:44:46.840-0500 Do you have a simple scenario that we can use to reproduce the issue? If you still have that core file, can you do a "thread apply all bt full" on it and attach the results (assuming you had DONT_OPTIMIZE and BETTER_BACKTRACES set in the asterisk compile options)? By: Abhay Gupta (agupta) 2017-08-18 08:18:30.343-0500 No we do not have a scenario as it happens on production server . The current core dump is without DONT_OPTIMIZE and BETTER_BACKTRACES and though i have attached it now because i had taken thread apply all bt since it had a huge number of threads for reason not known to me and thread apply all bt full would have been a huge file. On weekend will update the asterisk version and set both the flags to get better traces when calling is resumed and it crashes again . By: Abhay Gupta (agupta) 2017-08-23 01:51:28.229-0500 Two different crashes and attached is bt , bt full and thread apply all bt full . By: Rusty Newton (rnewton) 2017-08-23 15:43:32.457-0500 If possible, please also gather a debug log when you capture the new trace: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information Turning up logging may generate large log files so you will need to rotate them. We'll only need the last few thousand lines. No need to attach a huge file. Be sure to sanitize any private information if necessary since this is from a production system. By: Abhay Gupta (agupta) 2017-08-24 09:47:46.435-0500 Attached is bt, btfull , thread apply all bt full and full logs with debug By: Abhay Gupta (agupta) 2017-09-06 04:38:49.261-0500 At time of issue these lines as in attached logs are always visible res_timing_timerfd.c: Call to timerfd_gettime() using handle 197 error: Bad file descriptor res_timing_timerfd.c: Call to timerfd_gettime() using handle 197 error: Invalid argument [Sep 5 18:12:41] WARNING[25849][C-00028088] channel.c: Unable to write to alert pipe on Local/agentmanual@XYZ-00027bc1;1 (qlen = 94): Invalid argument! [Sep 5 18:12:41] WARNING[25849][C-00028088] channel.c: Unable to write to alert pipe on Local/agentmanual@XYZ-00027bc1;1 (qlen = 95): Invalid argument! and then taskprocessor queue full error . By: Abhay Gupta (agupta) 2017-09-12 01:17:52.908-0500 The issue seems to be linked to too many stun request and i have observed the same in tcpdump as mentioned in bug ASTERISK-25317 By: Abhay Gupta (agupta) 2017-09-12 02:10:20.971-0500 screenshots that shows a host of stun requests that are not serviced by pjnath resulting in the issue By: Abhay Gupta (agupta) 2017-10-08 20:41:51.945-0500 Upgraded asterisk to 13.17.2 but this error still remains . Attaching a latest gdb . By: Abhay Gupta (agupta) 2017-10-25 08:15:36.066-0500 Netlink error still appears By: George Joseph (gjoseph) 2017-10-30 07:52:48.222-0500 What happens if you compile without res_timing_timerfd (or use a "noload" statement in modules.conf)? Do you still get the crash? NOTE: You do need at least 1 timing interface like res_timing_dahdi or res_timing_phtread. By: Abhay Gupta (agupta) 2017-10-30 08:13:37.985-0500 We do not have dahdi installed and so timerfd is being used and we have not disabled it and tried . But yes this problem is only on the setup where we have no dahdi and only SIP trunks are working . By: George Joseph (gjoseph) 2017-10-30 09:25:25.693-0500 How about trying res_timing_phtread? By: Abhay Gupta (agupta) 2017-10-30 20:59:38.284-0500 Never changed the default setting of timing more so because on all digium blogs pthread is considered to be least stable and timerfd as most stable and is default on new kernels By: Abhay Gupta (agupta) 2017-10-31 03:47:36.947-0500 I have installed DAHDI and its timer is coming now . Will let you know if that helps in solving the issue . By: George Joseph (gjoseph) 2017-10-31 05:47:16.079-0500 OK, Thanks. By: Sebastian Gutierrez (sum) 2017-11-22 16:28:36.484-0600 any update on this? By: Joshua C. Colp (jcolp) 2017-11-22 16:35:20.116-0600 If there are any updates or someone needs additional information it will be posted here. The issue is in the "Open" state so it has been accepted. By: Sebastian Gutierrez (sum) 2017-12-20 08:26:56.409-0600 As an additional information I can confirm that in all my servers that are using webrtc chan_sip with rtcpmux=yes are having this crash so I have to move them to rtcpmux=no and the crash is not present anymore, tested on more than 15 servers with high traffic. By: Joshua C. Colp (jcolp) 2017-12-20 08:34:51.725-0600 [~sum] I have a feeling that the fix for ASTERISK-27299 may have also fixed this based on your comment you just left. The file descriptor that was in use by netlink may have gotten closed accidentally. By: Sebastian Gutierrez (sum) 2017-12-20 08:47:11.161-0600 I will try the latest 13 branch that has this fix and will update this issue By: Abhay Gupta (agupta) 2017-12-22 00:22:07.861-0600 This problem has not come from the time it was updated to 13.18.4 |