[Home]

Summary:ASTERISK-28689: res_pjsip: Crash when locking group lock when sending stateful response
Reporter:Jesse Ross (jmross)Labels:webrtc
Date Opened:2020-01-13 21:50:49.000-0600Date Closed:2022-10-31 12:31:21
Priority:MajorRegression?
Status:Closed/CompleteComponents:pjproject/pjsip
Versions:13.20.0 13.30.0 Frequency of
Occurrence
Frequent
Related
Issues:
is duplicated byASTERISK-30070 pjsip: Deadlock between bridge, channel, and RTP
is duplicated byASTERISK-30183 SegFault / TCP Stack
is duplicated byASTERISK-30212 Getting random segfault in libasteriskpj.so.2
is duplicated byASTERISK-30498 Segfault / 16.22.0
is duplicated byASTERISK-29522 segfault in pjsip
is duplicated byASTERISK-29298 Crash in WebRTC Registration on 18.2
is duplicated byASTERISK-29316 Asterisk service is failed due to segmentation fault about twice a week
is duplicated byASTERISK-29860 Segfault in Asterisk 18.8.0
is duplicated byASTERISK-30148 18.12.1 Crash / Sigabrt On Pjsip Transaction Destroy Under Load
is duplicated byASTERISK-30169 res_pjsip: assertion on send_msg
is related toASTERISK-27037 Segfault in pjsip
Environment:Ubuntu 16.04 4.4.0-119-genericAttachments:( 0) console
( 1) console
( 2) core.tar.gz
( 3) core2.tar.gz
( 4) core-brief.txt
( 5) core-brief.txt
( 6) core-full.txt
( 7) core-full.txt
( 8) core-locks.txt
( 9) core-locks.txt
(10) core-thread1.txt
(11) core-thread1.txt
Description:Hi,

We are getting crashes like this randomly:

Jan 13 21:33:21 voip kernel: [46434830.037815] traps: asterisk[24067] bq. general protection ip:7fc0ecf762b5 sp:7fc02d86ebc8 error:0 in libasteriskpj.so.2[7fc0ece5f000+16d000]
Jan 13 21:44:48 voip kernel: [46435517.854842] asterisk[26382]: segfault at 18 ip 00007f3a5467ba28 sp 00007f39990b4bd0 error 4 in libasteriskpj.so.2[7f3a54564000+16d000]

It didn't happen at all before but started to do it a couple times a day. Now it is doing it around 20 times per day. But only in the evening for some reason...

I've attached the core dump info.

I upgraded to 13.30 from 13.20 and installed the latest libpri and dahdi but it didn't help.

Thanks
Comments:By: Asterisk Team (asteriskteam) 2020-01-13 21:50:50.557-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Joshua C. Colp (jcolp) 2020-01-14 05:08:08.229-0600

What version of Asterisk are the backtraces from? Can you attach the console output of what was going on at the time? What is the usage pattern? What transports are in use? What does the pjsip.conf configuration look like?

By: Jesse Ross (jmross) 2020-01-14 08:19:31.226-0600

bq. What version of Asterisk are the backtraces from?

13.30.0

bq. What is the usage pattern?

This crash happened at 10:38 PM where the usage is low - probably around 10 calls per hour. A lot of our calls are incoming over DAHDI.

bq. What transports are in use?

We have UDP, TCP, and TLS enabled. Most of the endpoints (75%) are using TLS

bq. What does the pjsip.conf configuration look like?

Most of the endpoints are defined in realtime ODBC. The pjsip.conf is mostly for defining trunks (voip.ms, Thinktel, Northern911). I'm not sure how much I can post here. Here's the file without those trunks:

{quote}
[global]
type = global
user_agent = <REDACTED>

[transport-udp]
type = transport
protocol = udp
bind = 0.0.0.0

[transport-tcp]
type = transport
protocol = tcp
bind = 0.0.0.0

[transport-tls]
type = transport
protocol = tls
bind = 0.0.0.0:5061
cert_file = /etc/letsencrypt/live/<REDACTED>/cert.pem
priv_key_file = /etc/letsencrypt/live/<REDACTED>/privkey.pem
method = tlsv1
{quote}

By: Joshua C. Colp (jcolp) 2020-01-14 08:28:53.267-0600

Can you please run the ast_coredumper script as follows:

{noformat}
ast_coredumper --tarball-coredumps
{noformat}

This will include the core dump and other files in the tarball, allowing us to examine the state of things.

By: Jesse Ross (jmross) 2020-01-14 08:37:28.999-0600

I just uploaded the console output. We are getting a lot of "res_pjsip/pjsip_distributor.c: Request 'REGISTER' from..." messages. I think some hacker is trying to register. I firewalled that IP with UFW but it's still there. I have fail2ban running but maybe it's not configured correctly. Not sure if that is related or not...


bq. ast_coredumper --tarball-coredumps

I will do this.

By: Jesse Ross (jmross) 2020-01-14 08:44:09.590-0600

-I uploaded the new core*.txt files using --tarball-coredumps but I get this error trying to upload core.tar.gz.-

-"core.tar.gz is too large to attach. Attachment is 20.85 MB but the largest allowed attachment is 10.00 MB."-


The core file from last night got overwritten by another crash... The ones I just uploaded (and deleted) were from another crash.

I will have to reupload the core*.txt files and console output from the latest crash that I still have the core from.

By: Joshua C. Colp (jcolp) 2020-01-14 08:57:47.612-0600

I have increased the maximum attachment size so you should be okay as well.

By: Jesse Ross (jmross) 2020-01-14 09:08:32.222-0600

I've uploaded core.tar.gz.

I'll upload another console output.

This crash is from Jan 13 22:55:37.

By: Jesse Ross (jmross) 2020-01-14 09:59:48.613-0600

I added more console output from 22:55:00 to 22:56:59. There's not much there though... there didn't seem to be any calls happening at that time.

By: Jesse Ross (jmross) 2020-01-14 10:35:32.499-0600

Also, I want to note again that it hasn't happened since 10:55 last night. It seems to only happen between the hours of 4 PM and 11 PM. Maybe a specific endpoint is registered at that time or is doing something to trigger it?

By: Jesse Ross (jmross) 2020-01-14 13:30:41.042-0600

Not sure if this helps, but I looked in 'messages' instead of 'all' - I had verbose(30) instead of verbose(5) there - and found this:

{quote}
...
\[2020-01-13 22:55:37\] NOTICE\[12237\] res_pjsip/pjsip_distributor.c: Request 'REGISTER' from '"6214" <sip:6214@209.5.238.132>' failed for '185.53.88.98:5060' (callid: 2122440826) - No matching endpoint found
\[2020-01-13 22:55:37\] WARNING\[12192\] pjproject:                            SSL 6 \[SSL_ERROR_ZERO_RETURN\] (Read) ret: 0 len: 32000
\# asterisk start up...
{quote}

This following happened at the same timestamp as the crash (time that kern.log says the crash happened - 22:45:30)
but there was a page or so of "res_pjsip/pjsip_distributor.c:" stuff before the asterisk start up log.

{quote}
\[2020-01-13 22:45:30\] WARNING\[11217\] pjproject:                            SSL 6 \[SSL_ERROR_ZERO_RETURN\] (Read) ret: 0 len: 32000
{quote}

It also crashed at 22:38:58:

{quote}
\[2020-01-13 22:38:57\] WARNING\[8495\] pjproject:                     SSL 6 \[SSL_ERROR_ZERO_RETURN\] (Read) ret: 0 len: 32000
{quote}

Also, I got my firewall/fail2ban situation under control so my future log files won't be so messy if you need them.

By: Joshua C. Colp (jcolp) 2020-01-16 05:23:59.519-0600

It doesn't look like the core file actually appeared in the tarball, potentially because of naming. Can you individually attach it or provide a link to download it?

Is it possible to also get a debug log so I can trace the history of the thing causing the crash? This can be done by setting debug to go to a file in logger.conf, setting "core set debug 9", and also setting a log_level of "5" in pjproject.conf startup section.

By: Jesse Ross (jmross) 2020-01-16 08:31:18.420-0600

-Here's the core:-

-It's over 200 MB and I can't upload it here.-

Nevermind... I'll tar it then upload it here...

Also, it hasn't crashed since Monday night. It seems to have stopped when I got the firewall and fail2ban to work right. I think the message in the Asterisk logs changed at some point and the regex for fail2ban stopped finding these lines. I have debug logging enabled right now but I don't have that log when it crashed before.

By: Sean Bright (seanbright) 2020-01-16 10:56:05.944-0600

Pretty close to the linked issue ASTERISK-27037. The backtraces are basically the same save for a call to {{register_aor_core}}.

By: Joshua Elson (joshelson) 2021-02-18 12:17:16.081-0600

So we've been working on tracking this down, as this one is pretty disruptive to the fleet, and we're approaching being certain that this condition is only triggered when realtime is used. Looking at the three related tickets on this, it appears that everyone is using realtime, although not everyone is using WebRTC. We initially enabled realtime as a part of another upgrade and assumed that 18.2 was the trigger for this condition, but thus far, we can observe crashes on all 18.x releases but only when realtime is enabled, and not thus far when we switch back to a static config.

Like the others, this occurs under a high rate of registration or failed/hung registration attempts. Still working on seeing if we can get a full lab repro that we can share, but perhaps this is helpful or jogs the memory of someone...

By: Joshua Elson (joshelson) 2021-03-02 16:35:06.057-0600

Just as another note: we are now 100% sure that this is related to pjsip with endpoints stored in realtime, independent of the database driver. We've backed all the use of realtime out on our end to static files and we are at zero crashes now. The additional bug submission today also references a realtime configuration so this seems consistent with our and every other experience.

We're working on a docker scenario that hopefully can reproduce this, but if anyone has any thoughts on repro or possible fixes, we're very ready to help develop and test them.

By: Joshua C. Colp (jcolp) 2021-03-02 16:44:10.219-0600

Is it only endpoints stored in realtime?

By: Joshua Elson (joshelson) 2021-03-02 17:02:22.302-0600

That is correct [~jcolp]. We just have ps_aors, ps_auths, ps_endpoints, ps_endpoint_id_ips, and an empty ps_globals table in realtime. Everything else, especially transports, are in pjsip.conf.

By: Tiago Galvão (Tiagoggsouza) 2021-07-19 16:04:53.999-0500

Joshua C. Colp, we have some news about this situation?

By: Joshua C. Colp (jcolp) 2021-07-19 17:06:32.848-0500

Are you asking for an update? If so, any update would be posted here.

By: Joshua Elson (joshelson) 2021-07-19 17:14:51.066-0500

We have not had a chance to revisit this yet, but we are 100% sure at this point it is related to pjsip peers in realtime. With identical configuration, just moved to the filesystem to create/update peers, this bug doesn't occur. If anyone else has cycles to look at the issue, we could certainly test patches, or we'll get back to this at some point soon here, I hope.

By: Tiago Galvão (Tiagoggsouza) 2021-07-19 17:19:50.944-0500

yes, i would like to understand how digium fix this kind of problem and a new version.

By: Joshua C. Colp (jcolp) 2021-07-19 17:22:55.341-0500

Sangoma is not currently working on this issue.

By: Tiago Galvão (Tiagoggsouza) 2021-09-14 14:51:19.769-0500

are there  some schedule to fix it?

By: Joshua C. Colp (jcolp) 2021-09-15 04:20:33.192-0500

There is no timeframe on such a thing.

By: Tiago Galvão (Tiagoggsouza) 2021-09-16 09:33:09.444-0500

ok, thanks..

By: Aaron An (aaron) 2022-06-23 21:05:17.162-0500

We are using pjsip relatime configuration, and exactly the same crash randomly.  Joshua C.Colp can you support some information about this issue? maybe we can make a patch to fix it.


By: Joshua C. Colp (jcolp) 2022-06-24 03:44:42.297-0500

Any information would be on this issue, I have nothing additional.

By: RKAG (rkag) 2022-09-08 04:30:53.041-0500

Is there something we can provide to help in further narrowing down this issue? We are currently seeing a crash every 1-2 days in our setup. Asterisk 18.14 with PJSIP Realtime.

By: Joshua C. Colp (jcolp) 2022-09-08 04:35:54.042-0500

If any additional information is needed, then a comment will be left here. The same applies when this starts getting worked on - this will end up getting assigned.

By: Aaron An (aaron) 2022-09-12 01:10:10.607-0500

I have checked recent pjproject commits and found those four may be related to this isuse.
1.
https://github.com/pjsip/pjproject/commit/90a16c523bfdf4d43c10506c972c5fd4250b2856
2.
https://github.com/pjsip/pjproject/commit/92d2e8df9ac902d4fadbd7483891d4440f695bc1
3.
https://github.com/pjsip/pjproject/commit/e1edb64d4d95f04e89b67b733806a49c502e5782
4.
https://github.com/pjsip/pjproject/commit/3980472947aa43da0a87da4677ef8dba6d4b7df4

By: RKAG (rkag) 2022-09-12 01:38:33.903-0500

These four patches are all included in PJSIP 2.12.1 which is already bundled with Asterisk 18.14 (https://github.com/asterisk/asterisk/commit/d052418b94f6f070e81201c82f330465a97e9eeb). Asterisk 18.14 is still containing this issue.

By: Aaron An (aaron) 2022-09-12 03:56:44.329-0500

@RKAG thanks for your information, we are currently running Asterisk18.10 online.
from the coredump analysis, it sames that tls or wss transport was destroyed when sending response.

By: Friendly Automation (friendly-automation) 2022-10-31 12:31:23.848-0500

Change 19516 merged by Friendly Automation:
res_pjsip: prevent crash on websocket disconnect

[https://gerrit.asterisk.org/c/asterisk/+/19516|https://gerrit.asterisk.org/c/asterisk/+/19516]

By: Friendly Automation (friendly-automation) 2022-10-31 12:49:41.850-0500

Change 19515 merged by Friendly Automation:
res_pjsip: prevent crash on websocket disconnect

[https://gerrit.asterisk.org/c/asterisk/+/19515|https://gerrit.asterisk.org/c/asterisk/+/19515]

By: Friendly Automation (friendly-automation) 2022-10-31 13:37:50.235-0500

Change 19514 merged by Friendly Automation:
res_pjsip: prevent crash on websocket disconnect

[https://gerrit.asterisk.org/c/asterisk/+/19514|https://gerrit.asterisk.org/c/asterisk/+/19514]

By: Friendly Automation (friendly-automation) 2022-11-01 05:50:55.901-0500

Change 19512 merged by George Joseph:
res_pjsip: prevent crash on websocket disconnect

[https://gerrit.asterisk.org/c/asterisk/+/19512|https://gerrit.asterisk.org/c/asterisk/+/19512]

By: Friendly Automation (friendly-automation) 2022-11-01 05:51:23.726-0500

Change 19413 merged by George Joseph:
res_pjsip: prevent crash on websocket disconnect

[https://gerrit.asterisk.org/c/asterisk/+/19413|https://gerrit.asterisk.org/c/asterisk/+/19413]

By: Friendly Automation (friendly-automation) 2022-11-01 05:51:44.203-0500

Change 19513 merged by George Joseph:
res_pjsip: prevent crash on websocket disconnect

[https://gerrit.asterisk.org/c/asterisk/+/19513|https://gerrit.asterisk.org/c/asterisk/+/19513]