[Home]

Summary:ASTERISK-30244: res_pjsip_pubsub: Occasional crash when TCP/TLS connection terminated and subscription persistence is removed
Reporter:nappsoft (nappsoft)Labels:patch
Date Opened:2022-09-23 10:36:22Date Closed:2022-12-01 11:49:08.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_pjsip_pubsub
Versions:18.14.0 Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) ast.diff
( 1) ast.log
( 2) ast2.log
( 3) crash.txt
Description:We are observing crashes from time to time on less than one percent of systems. The segfault always happens at the same address in res_pjsip.so

What they have in common: while we use lots of clients with TCP and TLS connections, only on these systems we have phones (Yealink) connected which subscribe over TLS and/or TCP.

I can reproduce the crash from time to time when rebooting the phone. The crash happens as soon as it is up again and renewing the subscriptions.

I was now able to create a backtrace, which you will find attached.
Comments:By: Asterisk Team (asteriskteam) 2022-09-23 10:36:28.400-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: nappsoft (nappsoft) 2022-09-23 10:36:48.146-0500

Backtrace

By: nappsoft (nappsoft) 2022-09-26 04:21:31.857-0500

Some additional notes:

- the issue only affects reliable transports (what is clear as this codepath does only exist in conjunction with reliable transports)
- the way I can reproduce this is by not doing a normal reboot of the phone (in which case the phone would shutdown the transport first) but by unplugging and replugging power
- when removing "ast_sip_transport_monitor_unregister" from subscription_persistence_remove in res_pjsip_pubsub.c asterisk doesn't crash (and somehow I didn't find any negative side-effects) => so maybe you can tell me what happens when not unregistering the monitor? will this cause any leaks and shouldn't be done as a workaround? For the moment we are better off with a system with undetectable side-effects than with one with frequent crashes...

As you might ask: we do not use pjsip bundled, but we use the same pjsip version (2.12.1) and use the same config_site.h and patches.

By: Joshua C. Colp (jcolp) 2022-09-26 04:43:52.128-0500

Not unregistering the monitor will result in a slow memory leak.

I think the underlying issue here is that the transport has gone away, but pubsub still has a pointer to it but no reference. It probably should have been terminated when the transport went away, and not later when the transport is no longer valid.

By: nappsoft (nappsoft) 2022-09-28 07:20:46.623-0500

Thanks for your feedback. At least the memory leak seems not to be severe: I've activated the patched version a bit more than two days ago on the two mentioned systems and the increase in memory consumption over time seems to be quite small.

As I was able to reliably let asterisk crash with the mentioned method (success rate >60%) this seems to be a possible DOS vulnerability for authenticated sessions

By: Joshua C. Colp (jcolp) 2022-09-28 07:58:52.511-0500

I've locked this down just in case.

By: George Joseph (gjoseph) 2022-09-29 09:48:07.023-0500

I've been trying to reproduce the issue by registering a bunch of phones using TCP with each phone having a bunch of BLF subscriptions, then just power cycling the POE inserter.  No luck.

* What distribution and version are you running?
* I see you're using MUSL...what version?
* When in the process does the crash happen?  When the phones come down or when they come back up again?
* How long are you waiting before re-powering the phones?
* Can you set the following and reproduce?
logger.conf:
{code}
debug => [plain]debug-crash
{code}
From the CLI:
{code}
CLI> core set debug 4 res_pjsip_pubsub.c
CLI> core set debug 4 res_pjsip/pjsip_transport_events.c
{code}
Then attach the /var/log/asterisk/debug-crash file to the issue.

* Also run ast_coredumper on the new coredump:
{code}
$ sudo /var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps --tarball-config <path to coredump file>
{code}
The resulting tarball will be too large to attach to the issue so you can host it on Google Drive, Dropbox or the hosting site of your choice and email the link to me with the subject "ASTERISK-30244: Coredump Tarball".

Thanks!



By: George Joseph (gjoseph) 2022-09-29 09:48:55.370-0500

{panel:title=Private Submission of Information Disclaimer}You have indicated that you wish to submit unredacted information privately. It is not recommended to do this as it will substantially restrict the number of individuals who can help with your issue, as submitted information is only available to Sangoma. Note that submission of such information does not change the priorization of this issue. If you still wish to proceed you may do so by sending it to asteriskteam@digium.com with the issue number in the subject. For large files please send a link where they can then be downloaded. By sending this information you agree to the Website Terms of Use available on the Sangoma website at https://www.sangoma.com/legal/. Any exchange of private information between you and an Asterisk community member outside of the Asterisk JIRA is not subject to the Website Terms of Use and should be privately discussed between yourself and the Asterisk community member.{panel}



By: nappsoft (nappsoft) 2022-09-29 10:10:22.668-0500

- We use musl 1.2.3 (so the newest version)
- we use our own distribution
- the crash happens when the phone comes back up again (if I interpreted the log messages correctly it seems like the transport_monitor unregister happens more or less at the same time when the phone re-registers after power cycling)
- I repower them immediately (while I was debugging remotely where I didn't have access to the phone I used some iptables rules so that the phone was not able to close the socket, rebootet the phone over the webinterface and removed the iptables rules again. This way I was able to simulate a power cycle process in a network point of view)

I will have to prepare a different system for debugging as I was debugging this on a productive system, might take some days.

Worth to mention: appart from that I really can't observe any kind of asterisk crashes. We run about 2500 asterisk systems and I frequently scan all systems for segfault messages. In the past there were quite a few issues, but all the other ones I was able to fix by submitting a patch to either the asterisk team or the pjsip team.

So this is really the last issue left for the moment (what is a great thing, so congratulations to you and the other asterisk devs for the maturation of asterisk, which was known to be quite buggy 20 years ago when I started with voip...). I didn't dig

By: George Joseph (gjoseph) 2022-09-29 10:52:32.855-0500

Gotcha and thanks!.

Since you run your own distro, you can dispense with the ast_coredumper thing.  It won't be usable since I won't be able to replicate an environment to examine it in.

One of the things I was attempting to discover though was whether the Asterisk transport manager detected the disconnect before or after the phone reconnected.  This we could tell from the debug-crash log I requested.  

I'll leave this issue in "waiting fro feedback" for now but I'm still trying a few things to see if I can reproduce.


By: nappsoft (nappsoft) 2022-09-30 02:28:50.257-0500

I wasn't lucky before with reproducing on the testsystem as well... However: I was using a Yealink T58A first, while on my tests I was using a T46s. However: the customer where I could discover the same issue uses T58A as well. But when I was registering the same T46s I also had no luck with reproducing the issue, until I now subscribed to enough Hints (was using 6 before, now 10) => maybe this is a sort of racecondition which only happens when the distruction of the subscriptions needs a certain amount of time: it seems like the first few callbacks run just fine, but then the reliable transport gets destroyed maybe before the last callback can run?

See the logs attached

By: nappsoft (nappsoft) 2022-09-30 02:50:58.524-0500

Added a second logfile (ast2.log)

=> here I was able to reboot the device several times without a crash, the last time then asterisk crashed. Maybe having a look at the difference might be interesting?

By: nappsoft (nappsoft) 2022-09-30 04:34:14.726-0500

Just curious whether I get the code right:

- when the transport shuts down, sub_tree_transport_cb is called
- after that asterisk pushes a task to the serializer (sub_tree_subscription_terminate_cb)
- this then sends pjsip_evsub_terminate
- because of this (after termination of the evsub) the pubsub_on_evsub_state callback ist called

=> and this is where the subscription_persistence_remove routine is called

Of course there are other events that lead to a termination of the subscription (for example if not renewed or whatever) but in the special event of transport shutdown: isn't it clear that it can happen that the transport is already invalid (as the transport shutdown is what is causing the evsub termination) when the subscription_persistence_remove callback is being called? Or is it granted that the transport is valid until pjsip has executed all subscription termination callbacks? => if the later is not the case it might be better to keep a copy of the transport->obj_name in the sub_tree and using this one in ast_sip_transport_monitor_unregister than relying on a still valid transport?

By: nappsoft (nappsoft) 2022-09-30 05:36:30.036-0500

Just added a draft for a fix. The attached patch fixes the issue for me it seems without risking a memory leak

By: George Joseph (gjoseph) 2022-10-03 05:45:07.462-0500

Thanks for the patch.  Please _don't_ post it to Gerrit or it'll become public before we're ready.  I'll add a bit to it and post it to the security repository then post a copy of the final patch here for you to re-test.



By: nappsoft (nappsoft) 2022-10-03 05:51:44.910-0500

sure, it is incomplete anyway (the pjsip.h part is lacking (the definition of the register_transport_monitor_unregister2 wrapper)).

my patch is up and running on over 100 systems for the moment and didn't cause any issues, but feel free to fix the bug in a different way, as my implementation is enlarging the memory footprint a bit...

By: Friendly Automation (friendly-automation) 2022-12-01 11:49:08.957-0600

Change 19632 merged by Benjamin Keith Ford:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19632|https://gerrit.asterisk.org/c/asterisk/+/19632]

By: Friendly Automation (friendly-automation) 2022-12-01 11:51:53.882-0600

Change 19633 merged by Benjamin Keith Ford:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19633|https://gerrit.asterisk.org/c/asterisk/+/19633]

By: Friendly Automation (friendly-automation) 2022-12-01 11:52:07.760-0600

Change 19651 merged by Benjamin Keith Ford:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19651|https://gerrit.asterisk.org/c/asterisk/+/19651]

By: Friendly Automation (friendly-automation) 2022-12-01 11:52:27.729-0600

Change 19634 merged by Benjamin Keith Ford:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19634|https://gerrit.asterisk.org/c/asterisk/+/19634]

By: Friendly Automation (friendly-automation) 2022-12-01 11:52:37.195-0600

Change 19635 merged by Benjamin Keith Ford:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19635|https://gerrit.asterisk.org/c/asterisk/+/19635]

By: Friendly Automation (friendly-automation) 2022-12-03 10:24:13.837-0600

Change 19642 merged by Friendly Automation:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19642|https://gerrit.asterisk.org/c/asterisk/+/19642]

By: Friendly Automation (friendly-automation) 2022-12-03 10:24:28.797-0600

Change 19641 merged by Friendly Automation:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19641|https://gerrit.asterisk.org/c/asterisk/+/19641]

By: Friendly Automation (friendly-automation) 2022-12-03 10:24:36.264-0600

Change 19640 merged by Friendly Automation:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19640|https://gerrit.asterisk.org/c/asterisk/+/19640]

By: Friendly Automation (friendly-automation) 2022-12-03 10:24:41.667-0600

Change 19645 merged by Friendly Automation:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19645|https://gerrit.asterisk.org/c/asterisk/+/19645]

By: Friendly Automation (friendly-automation) 2022-12-03 10:28:04.201-0600

Change 19660 merged by Friendly Automation:
pjsip_transport_events: Fix possible use after free on transport

[https://gerrit.asterisk.org/c/asterisk/+/19660|https://gerrit.asterisk.org/c/asterisk/+/19660]