ASTERISK-27205: res_rtp_asterisk: Asterisk crash with netlink error

[Home]

Summary: ASTERISK-27205: res_rtp_asterisk: Asterisk crash with netlink error

Reporter: Abhay Gupta (agupta) Labels:

Date Opened: 2017-08-18 02:12:19 Date Closed: 2017-12-22 00:28:45.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Resources/res_pjsip Resources/res_rtp_asterisk

Versions: 13.15.0 13.17.0 Frequency of
Occurrence Frequent

Related
Issues:
is duplicated by ASTERISK-27376 Crash Asterisk chan_sip rtcp_mux

is duplicated by ASTERISK-27420 Asterisk crash - Unexpected error 9 on netlink descriptor 99

Environment: Linux Ubuntu kernel 4.4.0-83 with glibc 2.23 and asterisk using internal PJSIP Attachments: ( 0) 14_50.txt
( 1) 17_08.txt
( 2) 24_08_12_00_bt_log
( 3) bt.txt
( 4) filename.txt
( 5) full_100000
( 6) gdb_18aug.txt
( 7) log.txt
( 8) logs.txt
( 9) Screen_Shot_2017-09-12_at_12.34.05_PM.png
(10) Screen_Shot_2017-09-12_at_12.36.06_PM.png

Description: There is segmentation fault that causes crash and bt is attached , it seems that Nelink error is the cause and can be seen from this changelog of glibc

https://abi-laboratory.pro/tracker/changelog/glibc/2.23/log.html

getaddrinfo now detects certain invalid responses on an internal netlink
socket. If such responses are received, an affected process will
terminate with an error message of "Unexpected error <number> on netlink
descriptor <number>" or "Unexpected netlink response of size <number> on
descriptor <number>". The most likely cause for these errors is a
multi-threaded application which erroneously closes and reuses the netlink
file descriptor while it is used by getaddrinfo.

this signal is not captured causing it to crash .

Comments: By: Asterisk Team (asteriskteam) 2017-08-18 02:12:21.038-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: Abhay Gupta (agupta) 2017-08-18 02:15:02.380-0500

This is the bt which shows the netlink error and abort due to that .
By: George Joseph (gjoseph) 2017-08-18 07:44:46.840-0500

Do you have a simple scenario that we can use to reproduce the issue?
If you still have that core file, can you do a "thread apply all bt full" on it and attach the results (assuming you had DONT_OPTIMIZE and BETTER_BACKTRACES set in the asterisk compile options)?

By: Abhay Gupta (agupta) 2017-08-18 08:18:30.343-0500

No we do not have a scenario as it happens on production server . The current core dump is without DONT_OPTIMIZE and BETTER_BACKTRACES and though i have attached it now because i had taken thread apply all bt since it had a huge number of threads for reason not known to me and thread apply all bt full would have been a huge file. On weekend will update the asterisk version and set both the flags to get better traces when calling is resumed and it crashes again .
By: Abhay Gupta (agupta) 2017-08-23 01:51:28.229-0500

Two different crashes and attached is bt , bt full and thread apply all bt full .
By: Rusty Newton (rnewton) 2017-08-23 15:43:32.457-0500

If possible, please also gather a debug log when you capture the new trace:

https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Turning up logging may generate large log files so you will need to rotate them. We'll only need the last few thousand lines. No need to attach a huge file. Be sure to sanitize any private information if necessary since this is from a production system.
By: Abhay Gupta (agupta) 2017-08-24 09:47:46.435-0500

Attached is bt, btfull , thread apply all bt full and full logs with debug
By: Abhay Gupta (agupta) 2017-09-06 04:38:49.261-0500

At time of issue these lines as in attached logs are always visible

res_timing_timerfd.c: Call to timerfd_gettime() using handle 197 error: Bad file descriptor
res_timing_timerfd.c: Call to timerfd_gettime() using handle 197 error: Invalid argument

[Sep 5 18:12:41] WARNING[25849][C-00028088] channel.c: Unable to write to alert pipe on Local/agentmanual@XYZ-00027bc1;1 (qlen = 94): Invalid argument!
[Sep 5 18:12:41] WARNING[25849][C-00028088] channel.c: Unable to write to alert pipe on Local/agentmanual@XYZ-00027bc1;1 (qlen = 95): Invalid argument!

and then taskprocessor queue full error .

By: Abhay Gupta (agupta) 2017-09-12 01:17:52.908-0500

The issue seems to be linked to too many stun request and i have observed the same in tcpdump as mentioned in bug ASTERISK-25317

By: Abhay Gupta (agupta) 2017-09-12 02:10:20.971-0500

screenshots that shows a host of stun requests that are not serviced by pjnath resulting in the issue
By: Abhay Gupta (agupta) 2017-10-08 20:41:51.945-0500

Upgraded asterisk to 13.17.2 but this error still remains . Attaching a latest gdb .
By: Abhay Gupta (agupta) 2017-10-25 08:15:36.066-0500

Netlink error still appears
By: George Joseph (gjoseph) 2017-10-30 07:52:48.222-0500

What happens if you compile without res_timing_timerfd (or use a "noload" statement in modules.conf)?
Do you still get the crash?

NOTE: You do need at least 1 timing interface like res_timing_dahdi or res_timing_phtread.

By: Abhay Gupta (agupta) 2017-10-30 08:13:37.985-0500

We do not have dahdi installed and so timerfd is being used and we have not disabled it and tried . But yes this problem is only on the setup where we have no dahdi and only SIP trunks are working .
By: George Joseph (gjoseph) 2017-10-30 09:25:25.693-0500

How about trying res_timing_phtread?

By: Abhay Gupta (agupta) 2017-10-30 20:59:38.284-0500

Never changed the default setting of timing more so because on all digium blogs pthread is considered to be least stable and timerfd as most stable and is default on new kernels
By: Abhay Gupta (agupta) 2017-10-31 03:47:36.947-0500

I have installed DAHDI and its timer is coming now . Will let you know if that helps in solving the issue .
By: George Joseph (gjoseph) 2017-10-31 05:47:16.079-0500

OK, Thanks.
By: Sebastian Gutierrez (sum) 2017-11-22 16:28:36.484-0600

any update on this?
By: Joshua C. Colp (jcolp) 2017-11-22 16:35:20.116-0600

If there are any updates or someone needs additional information it will be posted here. The issue is in the "Open" state so it has been accepted.
By: Sebastian Gutierrez (sum) 2017-12-20 08:26:56.409-0600

As an additional information I can confirm that in all my servers that are using webrtc chan_sip with rtcpmux=yes are having this crash so I have to move them to rtcpmux=no and the crash is not present anymore, tested on more than 15 servers with high traffic.

By: Joshua C. Colp (jcolp) 2017-12-20 08:34:51.725-0600

[~sum] I have a feeling that the fix for ASTERISK-27299 may have also fixed this based on your comment you just left. The file descriptor that was in use by netlink may have gotten closed accidentally.
By: Sebastian Gutierrez (sum) 2017-12-20 08:47:11.161-0600

I will try the latest 13 branch that has this fix and will update this issue
By: Abhay Gupta (agupta) 2017-12-22 00:22:07.861-0600

This problem has not come from the time it was updated to 13.18.4