[Home]

Summary:ASTERISK-28941: segfault in pjsip timer
Reporter:Alan Graham (zerohalo)Labels:
Date Opened:2020-06-09 09:49:31Date Closed:2020-06-23 10:44:07
Priority:MajorRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:16.8.0 16.9.0 16.10.0 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-28161 Removal of Previous Patch Causes PJSIP Timer Issues
Environment:Debian StretchAttachments:
Description:At least daily, sometimes multiple times daily, Asterisk crashes in this thread:

{noformat}
Thread 1 (Thread 0x7f9f6e98b700 (LWP 53)):
#0  0x00007f9fdc30f711 in copy_node (ht=0x561e6933d960, slot=0, moved_node=0x7f9fb006b0f0) at ../src/pj/timer.c:137
#1  0x00007f9fdc30f9ee in reheap_down (ht=0x561e6933d960, moved_node=0x7f9fd009c680, slot=0, child=1) at ../src/pj/timer.c:185
#2  0x00007f9fdc30fd34 in remove_node (ht=0x561e6933d960, slot=0) at ../src/pj/timer.c:252
#3  0x00007f9fdc310694 in pj_timer_heap_poll (ht=0x561e6933d960, next_delay=0x7f9f6e98ae10) at ../src/pj/timer.c:643
#4  0x00007f9fdc259c3e in pjsip_endpt_handle_events2 (endpt=0x561e6933d678, max_timeout=0x7f9f6e98ae80, p_count=0x0) at ../src/pjsip/sip_endpoint.c:716
#5  0x00007f9fdc259d84 in pjsip_endpt_handle_events (endpt=0x561e6933d678, max_timeout=0x7f9f6e98ae80) at ../src/pjsip/sip_endpoint.c:777
#6  0x00007f9f91580936 in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4715
#7  0x00007f9fdc2f71d4 in thread_main (param=0x561e69569b68) at ../src/pj/os_core_unix.c:541
#8  0x00007f9fda7854a4 in start_thread (arg=0x7f9f6e98b700) at pthread_create.c:456
#9  0x00007f9fd938fd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
{noformat}

It looks similar to older issues like ASTERISK-27187, though that issue used external pjsip and we're using bundled.

I have complete core and BT available, if needed.
Comments:By: Asterisk Team (asteriskteam) 2020-06-09 09:49:31.904-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: George Joseph (gjoseph) 2020-06-10 08:17:22.467-0500

Hi Alan, We've been chasing issues in that pjproject timer code for some time now.   Teluu (the maintainers of pjproject) released version pjproject 2.10 with some timer fixes but we found issues with that release so we hadn't updated Asterisk with it.  Just yesterday, they gave us a patch to test so we're going to try that today and see if we can get something for you to test maybe tomorrow.



By: Alan Graham (zerohalo) 2020-06-10 13:07:22.121-0500

Thanks, George. I did pull Kevin's patch for bundling 2.10 from Gerrit and have been load testing with some success so far, Unfortunately, load testing in my dev environments has yet to produce the SEGV with 2.09 or 2.10 - naturally it only happens in production. I did see that patch that added to issue 2443 of pjproject, but haven't been able to test that yet - I'm having trouble getting it to work through the pjsip patch mechanism in the build.

In any case, I'm happy to continue testing and thank you!

By: Kevin Harwell (kharwell) 2020-06-11 13:33:23.286-0500

Adding a link to make it easier for others that might want to pull the patch for 2.10:

https://gerrit.asterisk.org/c/asterisk/+/14413 (16 branch)

Also I just got done running more test, and updating it after verifying a patch pjproject sent us that was causing an issue with sip sessions. But shouldn't have affected anything timer related

[~zerohalo] Thanks for testing!

By: Alan Graham (zerohalo) 2020-06-16 12:12:30.316-0500

We've completed several load tests against these patches on 16.9 (will be trying later versions soon) without triggering the timer issue, so we're going to attempt to move this into a production environment. Will report in the next several days/week.

By: Kevin Harwell (kharwell) 2020-06-16 17:17:10.370-0500

Good news so far. I'm going to put this into "waiting for feedback" since we're waiting on more testing.

By: Alan Graham (zerohalo) 2020-06-23 09:38:29.816-0500

We've been running this in production for a full week now and we've yet to see any new incidences of the timer segv, when typically we'd see it a couple-three times a week, easily. This is an ARI-only environment in k8s, so that's the extent to which we've been able to test it. We've closed our internal ticket for it so we're satisfied that this is fixed.

By: Kevin Harwell (kharwell) 2020-06-23 10:43:36.102-0500

Good news! And thanks for testing, and letting us know.