ASTERISK-24925: Crash within pjprojects(libpjnath) pj_stun_session_on_rx

[Home]

Summary: ASTERISK-24925: Crash within pjprojects(libpjnath) pj_stun_session_on_rx_pkt

Reporter: Stefan Engström (StefanEng86) Labels:

Date Opened: 2015-03-30 10:33:39 Date Closed: 2015-07-16 15:46:45

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_pjsip pjproject/pjsip

Versions: 13.1.0 Frequency of
Occurrence

Related
Issues:

Environment: pjprojects version 2.2, asterisk version 13.1.0, OS is 64 bit fedora 20. Attachments: ( 0) add_bt.txt
( 1) Backtraces_April_28_to_May_1st.zip
( 2) crash-coredump-with-debuginfo-toupload
( 3) crash-coredump-with-debuginfo-toupload.txt
( 4) webrtcstundebug.pdf
( 5) wiresharksnapshotstunburst.PNG

Description: Not yet reproducable. The use-case is a dial to a webrtc-peer, that is a chan_sip peer with transport wss and icesupport=yes.

Will try to debug this issue myself first, and add more data continuously.

Comments: By: Rusty Newton (rnewton) 2015-03-31 16:06:18.953-0500

Stefan I'll let the issue set in Waiting For Feedback. You can "Send Back" once you get backtraces and reproduction data on the issue. Thanks!

Even if you can't reproduce it - go ahead and post backtraces if you can get useful backtraces with the compiler flags DONT_OPTIMIZE and BETTER_BACKTRACES enabled.
By: Stefan Engström (StefanEng86) 2015-04-09 04:10:55.448-0500

I am uploading two sketches i made when trying to understand the underlying process for which the crash happend. It seems the stun-process (ice-process) behaves very weird (in my unprofessional opinion), and I don't know whose fault that is. There are at least two weird things I noticed:
a) When we call a webrtc peer that uses chrome, that peer sends 10-100 stun requests per second from the moment it receives the SDP offer until it answers the call with OK. Perhaps not a problem, but firefox does not do this. Does that happen for anyone else?
b) When we call a webrtc peer that uses chrome and that peer sends a OK with a SDP Answer, then upon receiving the answer, asterisk (through pjprojects) sends 10-100 _identical_ stun requests within the same millisecond (to the same RTP/RTCP ports which are later used for rtp-media). (see wireshark snapshot and the other sketch). Does this happen for anyone else, or is it just my environment that's bad? I tried both version 2.2 and 2.3 but i have only tried Fedora 20 with x86_64 and only on virtual machines

I finally got the debuginfo symbols to work for pjprojects so when next crash happens im gonna upload a coredump with useful info
By: Stefan Engström (StefanEng86) 2015-04-20 06:27:05.473-0500

Finally got another crash... I took a quick look at pjnath's stun_session.c it looks like in all versions of pjprojects (2.2,2.3,2.4), in function check_cached_response they do a while (t != &sess->cached_response_list) { ... } where t is not tested for NULL. adding t && in the while-condition might have prevented this crash?
By: Dade Brandon (dade) 2015-04-24 21:25:22.851-0500

I'm getting this same crash in 11.17.1; I haven't had the chance to get this crash with debug symbols enabled in pjproject, but wanted to me too this.

I don't think it's a problem in pjproject

Based on reviewing sess->cached_response_list and the pjproject list management inlines, t should never be null on that line, unless that list was not already initialized, however your backtrace clearly shows it as null. My gut feeling is that asterisk is sending two stun packets through this code path, from different threads, before pj_stun_session_create has a chance to run pj_list_init(&sess->cached_response_list) in the first thread.

Validating t != &sess->cached_response_list on the line you reference, aside from being outside of the scope of this jira, would be more likely just defer the crash, in to pj_stun_session_destroy at the latest.

I guess whether this is an asterisk issue or pjproject issue would more likely come down to whether or not pjproject has documented that per-comp RTP read callbacks are not thread safe.

I won't be the one fixing this unfortunately, i'm not experienced enough with multithreaded development to resolve this confidently.
By: Rusty Newton (rnewton) 2015-04-27 12:43:14.578-0500

Re-attaching Stefan's debug as .txt so it can be viewed via browser.
By: Dade Brandon (dade) 2015-04-27 19:39:06.440-0500

In case this helps, although I don't have pjproject debug symbols on this server instance, this backtrace appears to be related. It's in destroy_tdata which is the inline function used to release the pjproject lists, which is dependent on the same - that .next is never null - there's only two destroy_tdata calls in pj_stun_session_destroy. I'd like to assume this is on the same sess->cached_response_list and that the corruption of this list is occurring similarly to my previous debug references, but of course this could be for the pending_request_list. If you know of a way for me to retrieve more useful data from this core dump, let me know what commands to put in.
By: Dade Brandon (dade) 2015-04-27 19:39:54.040-0500

My last comment was in reference to the add_bt.txt that I attached; I didn't realize that Jira would not frame it adequately within the context of my comment.
By: Dade Brandon (dade) 2015-05-02 01:18:33.627-0500

I'm attaching 34 backtraces from the past week, I think they are all regarding the same issue. As I mentioned, there's not only the SEGV in check_cached_response called from pj_stun_session_on_rx_pkt, but also coming from destroy_tdata. There are also backtraces from SSL asserts, malloc asserts (double free errors, corrupt double-linked list, corrupted unsorted chunks).

These issues were not present before switching our userbase to WebRTC

If there's anything I can run on these pcaps to provide further asistance, or instructions someone can provide me to enable GDB support for the pjproject functions, I am open to help as much as possible. I'm keeping these cores and the original exes around.

By: Dade Brandon (dade) 2015-05-02 01:21:45.784-0500

as a side note, all ast_rtp_instance_set_read_format and ast_rtp_instance_get_stats crashes have instance->engine === NULL and are a null pointer dereference

LEt me know if you feel any group of these should be filed as a separate bug report, but given that it all started after our WebRTC migration, I'm suspecting it is a memory corruption somewhere causing all of these crash locations.
By: Rusty Newton (rnewton) 2015-07-09 18:43:38.878-0500

Before we dig into this issue, do these issue still occur after the fixes for ASTERISK-25103, or alternatively, testing with 13 branch and pjsip 2.4 ?

By: Stefan Engström (StefanEng86) 2015-07-11 10:29:23.013-0500

Our check_cached_response crash only ever happend in production, and we do not have a large webrtc user base at the moment. I believe we will upgrade to pjprojects 2.4 when 13.5 is released. Dade Brandon, does your pjproject also spam stun requests when answering calls, as in my wireshark screenshot? I can't figure out why, anyone got ideas?
By: Rusty Newton (rnewton) 2015-07-15 14:16:24.480-0500

[~dade] do the crashes you posted here still occur after the fixes for ASTERISK-25103?
By: Dade Brandon (dade) 2015-07-15 15:00:33.730-0500

We've caught one, using the latest patch. I haven't fully reviewed it but iirc it was in the destroy session function. It was an openssl assert, so someone on our end was going to upgrade openssl on that server. We are doing a very slow patch rollout this round. I'll be back from vacation after this weekend and will post everything I've got.
By: Rusty Newton (rnewton) 2015-07-16 15:46:37.698-0500

Well. I don't want this issue to turn into a dumping ground for general crash debug. We need to keep it clear which particular crash this issue is for. Since we are not sure if the original crashes for this issue in particular are still happening; I'm going to close this issue out and say that everyone open a new issue for any crashes you get after upgrading openssl, Asterisk or pjproject (2.4).

Stefan is the STUN request spam happens independent of any crashing then open a new issue for that as well.

By: Bobby Hakimi (bobbymc) 2015-11-03 16:39:55.137-0600

@dade what version of openssl did you upgrade to?