[Home]

Summary:ASTERISK-28869: pjsip: Crash in timer when sending request
Reporter:Chris (jnz)Labels:webrtc
Date Opened:2020-05-04 10:32:30Date Closed:2020-10-01 06:22:51
Priority:MajorRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:16.5.0 16.9.0 16.10.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Google cloud, asterisk running on n1-standard-4 (4 vCPUs, 15 GB memory - Ubuntu 18.04.2 LTS ). Multiple asterisk instances using realtime config (share same mysql instance). Pjsip devices. Asterisk nodes are running behind Kamailio+RTPEngine. Softphones are Webrtc based connecting to Kamailio via websocketsAttachments:( 0) 73fgqxw2.txt
( 1) coredump.tar.gz
Description:We have been running into random segfaults in Asterisk (tried versions 16.5 / 16.9 and now 16.10). Segfaults seem random but only appear to happen under a load of 15+ concurrent calls (never occur after hours while system is in use but under lighter load). Performance on asterisk machines during crash seems fine, not seeing any errors or warnings in the logs.

Originally we started with one asterisk node but due to crashes we ended up adding multiple Asterisk nodes (behind Kamailio) to let us take more calls. It seems like receiving/transmitting any SIP message (publish device state, qualify aor, hangup etc)  can cause a segfault.

Segfaults always occur inside pop_freelist in pjsip library. Ht's timer_ids_freelist always appears corrupt/out of range during segfault eg:

ht= {pool = 0x5616898c0140, max_size = 262142, cur_size = 23, max_entries_per_poll = 10, lock = 0x561689d02440, auto_delete_lock = 1, heap = 0x7f45530e7038,
 timer_ids = 0x561689c02448, timer_ids_freelist = 1587220789, callback = 0x0}

Thread1 backtrace: https://pastebin.com/73fgqxw2

We have the full coredumps/logs available but they contain customer data.  Asterisk nodes are compiled with don't optimize/better backtraces
Comments:By: Asterisk Team (asteriskteam) 2020-05-04 10:32:31.540-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Joshua C. Colp (jcolp) 2020-05-04 10:45:19.203-0500

Can you sanitize the additional logs of information so they can be attached?

By: Chris (jnz) 2020-05-05 12:46:56.598-0500

Core dump files

By: Chris (jnz) 2020-05-05 12:51:00.915-0500

Attached core dump files + snippet of full asterisk log leading up to crash

To add more information to our setup, the dialplan is pretty simple (plays greeting, starts mixmonitor for certain queues then hands off to our Stasis app which handles starting/stopping MOH and assigning a call to a queued agent via dialing the agent then bridging the 2 channels (via ARI))

By: Chris (jnz) 2020-05-18 12:56:08.504-0500

Just wanted to update this, I know this setup isn't supported but I noticed the latest PJSIP 2.10 included some timer refactoring, so I compiled asterisk against that and it seems to have resolved these segfaults (no issues for over a week where we normally would have had a handful of crashes)

By: Joshua C. Colp (jcolp) 2020-10-01 06:22:51.826-0500

I'm closing this out as fixed since we're now on 2.10 and after the refactor timer issues appear to have been solved.