[Home]

Summary:ASTERISK-28793: Asterisk 13.32.0 crash in pjsip_tx_data_add_ref
Reporter:Josep B (icr)Labels:webrtc
Date Opened:2020-03-30 05:52:58Date Closed:
Priority:MinorRegression?
Status:Open/NewComponents:Channels/chan_pjsip Third-Party/pjproject
Versions:13.32.0 Frequency of
Occurrence
One Time
Related
Issues:
Environment:We are using pjsip 2.9 with asterisk 13.32.0, using webrtc transport with ‘rel100‘ activated. There are about 170 SIP endpoints connected and 150 simultaneous calls. Some end points are connected to WIFI networks. Attachments:( 0) 0001-sip_100rel-Additional-null-pointer-validation-to-avo.zip
( 1) SegFault_20200326.zip
( 2) segfault20200330.zip
( 3) segfault20200331.zip
( 4) segfault20200401.tar.gz
Description:Hi,

We are using asterisk 13.32.0 with pjsip 2.9 bundled, using webrtc transport with ‘rel100‘ activated. There are about 170 SIP endpoints connected and 150 simultaneous calls.

We get a crash last week and it wasn’t reproduced yet. Segfault thread stack is:

#0  pjsip_tx_data_add_ref (tdata=0x0) at ../src/pjsip/sip_transport.c:512
#1  0x00007f8bfd51ee12 in on_retransmit (timer_heap=<optimized out>, entry=0x7f8b9437d748) at ../src/pjsip-ua/sip_100rel.c:599
#2  0x00007f8bfd5d2fa7 in pj_timer_heap_poll (ht=0x36f0850, next_delay=next_delay@entry=0x7f8beb697ce0) at ../src/pj/timer.c:659
#3  0x00007f8bfd536dad in pjsip_endpt_handle_events2 (endpt=0x36f0568, max_timeout=max_timeout@entry=0x7f8beb697d40, p_count=p_count@entry=0x0) at ../src/pjsip/sip_endpoint.c:716
#4  0x00007f8bfd536ec7 in pjsip_endpt_handle_events (endpt=<optimized out>, max_timeout=max_timeout@entry=0x7f8beb697d40) at ../src/pjsip/sip_endpoint.c:777
#5  0x00007f8b877a6f30 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4465
#6  0x00007f8bfd5bc000 in thread_main (param=0x379a3a8) at ../src/pj/os_core_unix.c:541
#7  0x00007f8bfb609e65 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00007f8bfa9ab88d in clone () from /usr/lib64/libc.so.6

Seems its related with timers and/or rel100.

We attach additional information.

We ensured that related pjsip timer fixes (#2230 and #2172) were applied.

Additionally, we think the issue could be related to network latencies / problems, because some end points are connected to WIFI networks.

¿Does anyone know if it’s a known issue?
¿Can anyone help us?
Comments:By: Asterisk Team (asteriskteam) 2020-03-30 05:53:00.811-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Josep B (icr) 2020-03-30 05:59:25.563-0500

SegFault information

By: Benjamin Keith Ford (bford) 2020-03-30 14:06:15.538-0500

Looks like the tdata is not present, for whatever reason. I'll open up a ticket for this.

By: Josep B (icr) 2020-04-01 12:18:07.934-0500

When we analized the segfault core, we thought the same, tdata is not present. But we don't know how to fix the problem.

The problem was reproduced 2 additional times. I'll attach stacks for try to help.

I saw you have an internal issue (SWP-11065) linked as clone. Is there any fix, workarround or information you can share with us? Can we help in some way to try solve the issue?

By: Joshua C. Colp (jcolp) 2020-04-01 12:28:33.379-0500

There is no additional data on the SWP issue, it serves for planning purposes for the Sangoma Asterisk Team. Any comments and such are placed on here, and there is noone actively working on this issue. If that changes then the assignee will change.

By: Josep B (icr) 2020-04-28 05:10:31.495-0500

Hi,

After applying pjsip #2350 problem is still reproducing.

We implemented a patch to avoid null pointer segfaults with some addittional logging information to confirm when the problem is avoided.

After testing it for a week, logs confirm the problem was avoided to times and the platform seems stable.

We know this is probably not the right solution but at least it seems to avoid this segfault.

We will share the patch: 0001-sip_100rel-Additional-null-pointer-validation-to-avo.zip

If any developer wants to collaborate, we can collaborate to collect information or test to find a more suitable solution.

By: Josep B (icr) 2020-05-12 05:23:04.368-0500

If someone is interested in this issue, we are treating it at pjsip tracker https://github.com/pjsip/pjproject/issues/2387.