ASTERISK-27170: pjproject: Unsafe usage of gethostbyname causing memory corruption

[Home]

Summary: ASTERISK-27170: pjproject: Unsafe usage of gethostbyname causing memory corruption

Reporter: nappsoft (nappsoft) Labels:

Date Opened: 2017-08-02 09:47:21 Date Closed: 2017-10-24 09:33:04

Priority: Major Regression?

Status: Closed/Complete Components: PBX/General

Versions: 13.16.0 13.17.0 Frequency of
Occurrence

Related
Issues:

Environment: 64bit linux musl 1.1.16-git Attachments: ( 0) backtrace4.txt
( 1) crashlog.txt
( 2) gethostbyname_r.diff
( 3) trace_cel_crash.txt
( 4) trace.txt
( 5) valgrind2.txt
( 6) valgrind4.txt
( 7) vgcore.24994-brief.txt
( 8) vgcore.24994-full.txt
( 9) vgcore.24994-locks.txt
(10) vgcore.24994-thread1.txt

Description: From time to time asterisk crashes when a component is trying to allocate memory. According to the sip traces this seem to happen mainly soon (sometimes some milliseconds, sometimes a few seconds) after a call in which a PickupChan operation was involved has been finished.

Comments: By: Asterisk Team (asteriskteam) 2017-08-02 09:47:22.539-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].
By: nappsoft (nappsoft) 2017-08-04 06:28:27.488-0500

I can't reproduce the crash with a modified softphone that does the same "late submission" of the answers to the NOTIFY messages, so this doesn't seem to be the root cause of the crash.

However I've made an interesting observation in the SIP traces we have from this crash: currently our systems:

- currently we have exactly two type of asterisk crashes on several systems (according to the segfault message in dmesg): one is crashing in libpjsip.so (pj_sockaddr_in_set_str_addr), one is crashing in libc (malloc or other memory operations). The pjsip crash seems to happen after a REFER (when hanging up the channels/sending the BYE mesasges), the other crash happens at the end of a call (as well while sending the BYE messages)
- we had both kinds of crashes on the same day on one system. In both cases one of the channels made a pickup with PickupChan over agi
- there were only 7 PickupChan operations during the whole day, but lots of other calls

=> so the fact that in both cases a Channel was involved that got picked up might be significant. I'll try to get other samples and to reproduce the issue
By: Rusty Newton (rnewton) 2017-08-07 16:28:01.268-0500

I'm not sure we will be able to help much without a trace. If it is impossible to get a trace, at least see if you can get a debug log and a complete SIP trace. Showing more about what is going on exactly may help others who are running into the same issue (and who can provide a trace) find this issue.
By: nappsoft (nappsoft) 2017-08-10 04:21:53.361-0500

Well I still try to get more information. I had now to wait for 8 days to observe new crashes (there weren't any crashes in the meantime). It crashed two times today, both times at the same libc.so address as always (it was always either libc.so or libpjsip.so).

The interesting thing: there were exactly 3 calls which involved PickupChan, two of them crashed. So I really seem to be on the right track as I have 4 samples of crashes now and in each one a PickupChan was involved (something that isn't done all to often...) and each time asterisk crashed on hangup!

I'll continue to try to reproduce the issue myself in a gdb session.
By: George Joseph (gjoseph) 2017-08-10 06:07:05.642-0500

It appears that you're NOT using the bundled version of pjproject. We can't really troubleshoot pjproject related issues when you use an external pjproject. Rebuild asterisk with the {{--with-pjproject-bundled}} option supplied to {{./configure}}, then follow the instructions at https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace for building asterisk with the {{DONT_OPTIMIZE}} and {{BETTER_BACKTRACES}} options.

By: nappsoft (nappsoft) 2017-08-10 10:28:39.361-0500

I'm wondering how you came to the 'you're not using bundled pjproject' assumption?!
By: George Joseph (gjoseph) 2017-08-10 11:25:46.574-0500

You mentioned that libpjsip.so was in the stack traces in your last comment. That's from an external pjproject install. The bundled version is statically linked to asterisk and does not link to any libpj* shared libraries.

By: nappsoft (nappsoft) 2017-08-10 12:11:46.192-0500

Yeah, I did. But we had the same issues with the bundled version in the past. Currently we use pjproject trunk with the same config_site.h and configure flags as you are using in the bundled version so I do not understand why the bundled version should be supported why an identically compiled version is not.

As the crashes (with external pjsip) always happened at the same point in the library independent of the version we were using I doubt that this is a pjproject issue. But well, close this bug like all the crash reports from users during the last 3 weeks (there are several reports of crashes in the bug tracker, all ended with "waiting for feedback"). If you take the time and read through the last four issues I've reported during the last 12 months you will see that I only report issues that are issues and usually in a way (with quite some pre-analysis) that makes fixing the issue easy. During the same time I've also reported some issues to the pjproject (including patches).

But well. I can keep these things for me in the future if you prefer.
By: George Joseph (gjoseph) 2017-08-10 15:54:04.124-0500

The difference is that we don't know what's in pjproject trunk that we don't have in bundled. We'd be just guessing at what trunk was at the time you compiled vs what it is when we troubleshoot. Maybe someone committed something in trunk that doesn't work for us. It's happened before. That's why we created bundled and why we're starting to get strict about using bundled if you want support. If there's a specific pjproject commit that's in trunk that you think _needs_ to be in bundled, just let us know.

We don't have to close the issue but there's nothing in _this_ issue to troubleshoot. No logs or backtraces. I'll give you a pass on bundled but you must compile asterisk with DONT_OPTIMIZE and BETTER_BACKTRACES and compile pjproject with its symbols left intact, then provide a backtrace. Failing that, a scenario we can reproduce would be fine. Otherwise we got nuthin. :)

By: nappsoft (nappsoft) 2017-08-11 06:11:12.071-0500

I'm still not able to debug with BETTER_BACKTRACES and such things as this is a production embedded system with limited amount of memory. But I was able (after 4 hours of testing...) to reproduce the issue on a clone of the system (still with "normal" backtraces). Have a look at the following part of the cli output, this happens immediately before the crash (does the dialog id have anything to do with the (identical) transaction id for the hangup that occured 14 seconds before?:

[Aug 11 12:25:44] DEBUG[13474][C-000001bd]: res_rtp_asterisk.c:4560 ast_rtcp_interpret: Got RTCP report of 64 bytes
[Aug 11 12:25:44] DEBUG[13474][C-000001bd]: res_rtp_asterisk.c:4560 ast_rtcp_interpret: Got RTCP report of 64 bytes
[Aug 11 12:25:44] DEBUG[19605]: res_pjsip/pjsip_distributor.c:492 distributor: Searching for serializer associated with dialog dlg0x3045be8 for Request msg UPDATE/cseq=13844 (rdata0x3309ea8)
[Aug 11 12:25:44] DEBUG[19605]: res_pjsip/pjsip_distributor.c:500 distributor: Found serializer pjsip/distributor-000000f8 associated with dialog dlg0x3045be8

=> 0x3045be8 is from a call that had been hung up quite a while ago:

[Aug 11 12:25:30] DEBUG[13407][C-000001b7]: channel.c:2681 ast_hangup: Channel 0x27e42d0 'PJSIP/5119-00000252' hanging up. Refs: 2
[Aug 11 12:25:30] DEBUG[13407][C-000001b7]: chan_pjsip.c:1991 hangup_cause2sip: AST hangup cause 16 (no match found in PJSIP)
[LWP 13407 exited]
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2517 handle_outgoing_request: Method is BYE
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2422 __print_debug_details: Function session_inv_on_tsx_state_changed called on event TSX_STATE
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2436 __print_debug_details: The state change pertains to the endpoint '5119(PJSIP/5119-00000252)'
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2444 __print_debug_details: The inv session does NOT have an invite_tsx
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2447 __print_debug_details: The UAC BYE transaction involved in this state change is 0x3045be8
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2451 __print_debug_details: The current transaction state is Calling
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2453 __print_debug_details: The transaction state change event is TX_MSG
[Aug 11 12:25:30] DEBUG[6034]: res_pjsip_session.c:2458 __print_debug_details: The current inv state is CONFIRMED
[Aug 11 12:25:30] DEBUG[6034]: channel.c:2233 ast_channel_destructor: Channel 0x27e42d0 'PJSIP/5119-00000252' destroying

have a look into the complete log (at least of the last 14 seconds before the crash) and the backtrace
By: George Joseph (gjoseph) 2017-08-11 07:49:53.116-0500

Unfortunately, there's nothing useful in the backtrace without the symbols, pjproject's especially, and I didn't see anything out of the ordinary in the log

pjproject maintains pools of pre-allocated memory which it continually reuses for different things so unless musl is doing something goofey, the re-use of that pointer is fine.

Can you reproduce the crash scenario on demand now or is it still hit or miss? If you can reproduce on demand, can you package up a bare-minimum scenario for us to try?

By: nappsoft (nappsoft) 2017-08-11 09:58:04.445-0500

No I can't reproduce it that easy (just needed another hour with hundreds of call start/stops/pickups) as it seems to be a race condition causing some sort of memory corruption.
This time it didn't crash in a malloc made by pjsip but in a call to malloc made by sqlite3 when trying to insert cel logs (went out of asterisk libraries at this point: write_cel (event=<optimized out>) at cel_sqlite3_custom.c:270). However: as no other software on these (lighttpd, ghostscript, php, dropbear ssh, our own c daemons) is segfaulting I do not think that it is an issue with musl.
By: nappsoft (nappsoft) 2017-08-14 02:44:40.160-0500

I doubt that this is of any help, but I've attached a valgrind log of a crash
By: George Joseph (gjoseph) 2017-08-14 08:42:01.779-0500

Well, _something_ is going on but at this point, without good backtraces there's nothing we can do. Since there now seem to be malloc issues elsewhere, DONT_OPTIMIZE, BETTER_BACKTRACES and *MALLOC_DEBUG* are the options to use. Since we've modified the bundled pjproject to handle respect MALLOC_DEBUG, you must use bundled as well.
By: nappsoft (nappsoft) 2017-08-14 09:33:58.906-0500

Well then we need to let this unfixed for the moment... While DONT_OPTIMIZE (currently the DONT_OPTIMIZE version is running) isn't a problem at all, there are tons of compilation issues with MALLOC_DEBUG for which we currently do not have time to investigate. For BETTER_BACKTRACES our distribution does currently not have libbfd (usually it takes some time to integrate new libs when cross-compiling) and there are some cross-compile issues with bundled pjproject as well, at least it was like that last time.
By: nappsoft (nappsoft) 2017-08-15 04:09:27.259-0500

Some things to mention:

- all observed crashes (whether they crash wenn logging the cel event or in pjsip) have in common that they happen immediately or soon after the end of calls (I'll also upload a backtrace of a CEL case shown in valgrind2.txt)
- currently I'm not able to reproduce the crash in valgrind with a version compiled with DONT_OPTIMIZE (we use gcc 4.9.3)

=> we'll build some bots to automate the test-scenario to see whether DONT_OPTIMIZE could be at least used as a workaround
By: nappsoft (nappsoft) 2017-08-15 06:10:55.649-0500

Just out of curiosity one question: is there a way to force a hangup for a certain channel being completed before the hangup of the bridged channel?. First my current test-scenario: A calls B, C picksup the ringing channel for B with PickupChan

Now usually I see the HANGUP CEL entry for Channel C being made before the CEL entry for channel A (no matter, whether I hangup phone A or C first or whether I do a channel request hangup for one or the other channel). But according to both crashlogs it's the other way round in the crashing scenarios: I see debug log for the CEL entry for Channel A but no entry for channel C (but in the second case, when libsqlite3 was crashing, this happened while writing out the CEL log entry for channel C. This is not in the log because of output buffers I guess). I just don't manage to reproduce the "different hangup order" scenario, maybe this would help to reproduce the crash...
By: nappsoft (nappsoft) 2017-08-16 09:37:39.238-0500

FYI: today we had the 5th crash on the live system since I've started making SIP-traces. In every single case the customer dialed *8 what makes the agi script execute PickupChan. In all of the 5 cases the system crashed just after this particular call ended (+/- 0-1 seconds later, always when a thread tried to allocate memory). (So this is still all we know and the strange thing is that it's still hard to reproduce the crash even though I know some of the necessary conditions but obviously not all of them...)

As I noticed when running agi debug that the agi process number changes when doing "Exec PickupChan" in agi I decided to no longer exec PickupChan in the agi script directly => I'll deploy a version to the customer tonight that will Set a channel variable with the name of the channel to pickup, set a dialplan priority in which the PickupChan operation will be done and leave the agi script. Maybe this helps, will let you know.
By: George Joseph (gjoseph) 2017-08-16 09:45:47.718-0500

AFAIK, there isn't a way to change the hangup order directly.

Interesting find about PickupChan and AGI. Maybe there's some interaction on between the two. Let us know how moving it out of the AGI works out.

By: nappsoft (nappsoft) 2017-08-17 05:55:35.571-0500

Unfortunatelly this does not help. Btw: on the test-system I've managed to change the hangup order for some experiments by hacking a bit in the asterisk sources, but this had no influence (I just delayed the thread that was cleaning up the channel with the extension name in question a bit). Still was not able to let the version compiled with DONT_OPTIMIZE crash inside of gdb or valgrind, I just managed to let it crash outside of debuggers ;) (But this way it's somehow never creating a coredump even though asterisk has been started with -g and ulimit -Hc and -Sc have been set to unlimited, strange, but I'll have a look at that).

One side node (what might have nothing to do with this problem here at all, but just wanted to mention): when starting the software with valgrind I often need to do that 3-4 times as it usually crashes while loading the modules. Not always in the same module but always during the same operation (just to mention: apart from the above discuessed behavior in some rare situations our distribution bundled with asterisk 13.16/17 is running rock solid on hundreds of virtual machines and 11.25.1 was even running for months without a crash on any system at any time, so I doubt that we have a basic memory allocation problem in musl or so):

valgrind output (no matter whether I set a bigger stack size or not):

.
.
.
==11086== Conditional jump or move depends on uninitialised value(s)
==11086== at 0x401ADBC: strlen (in /lib/libc.so)
==11086== by 0x533F3D: load_modules (loader.c:1355)
==11086== by 0x45BD83: asterisk_daemon (asterisk.c:4692)
==11086== by 0x45B30E: main (asterisk.c:4444)
==11086==
==11086== Conditional jump or move depends on uninitialised value(s)
==11086== at 0x401ADBC: strlen (in /lib/libc.so)
==11086== by 0x531269: find_resource (loader.c:405)
==11086== by 0x53302B: load_resource (loader.c:1040)
==11086== by 0x5334DF: load_resource_list (loader.c:1166)
==11086== by 0x534073: load_modules (loader.c:1376)
==11086== by 0x45BD83: asterisk_daemon (asterisk.c:4692)
==11086== by 0x45B30E: main (asterisk.c:4444)
==11086==
==11086== Thread 7:
==11086== Invalid read of size 1
==11086== at 0x4053151: ??? (in /lib/libc.so)
==11086== by 0x402472F: __copy_tls (in /lib/libc.so)
==11086== by 0x7DBFF: ???
==11086== by 0x7DFFF: ???
==11086== by 0x10D95FFF: ??? (in /usr/lib/asterisk/modules/res_timing_pthread.so)
==11086== by 0xFFF: ???
==11086== by 0x10E13A3F: ???
==11086== by 0x4054D5D: pthread_create (in /lib/libc.so)
==11086== by 0x6DB082F: ???
==11086== by 0x10D96FFF: ???
==11086== by 0x6DB0737: ???
==11086== by 0x8E4A8F: ???
==11086== Address 0x12db6438 is not stack'd, malloc'd or (recently) free'd
==11086==
==11086==
==11086== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==11086== Access not within mapped region at address 0x12DB6438
==11086== at 0x4053151: ??? (in /lib/libc.so)
==11086== by 0x402472F: __copy_tls (in /lib/libc.so)
==11086== by 0x7DBFF: ???
==11086== by 0x7DFFF: ???
==11086== by 0x10D95FFF: ??? (in /usr/lib/asterisk/modules/res_timing_pthread.so)
==11086== by 0xFFF: ???
==11086== by 0x10E13A3F: ???
==11086== by 0x4054D5D: pthread_create (in /lib/libc.so)
==11086== by 0x6DB082F: ???
==11086== by 0x10D96FFF: ???
==11086== by 0x6DB0737: ???
==11086== by 0x8E4A8F: ???
==11086== If you believe this happened as a result of a stack
==11086== overflow in your program's main thread (unlikely but
==11086== possible), you can try to increase the size of the
==11086== main thread stack using the --main-stacksize= flag.
==11086== The main thread stack size used in this run was 50003968.

gdb output:

(gdb) bt full
#0 0x0000000004053151 in memcpy () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#1 0x0000000004024730 in __copy_tls () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#2 0x000000000007dc00 in ?? ()
No symbol table info available.
#3 0x000000000007e000 in ?? ()
No symbol table info available.
#4 0x0000000010d96000 in ?? ()
No symbol table info available.
#5 0x0000000000001000 in ?? ()
No symbol table info available.
#6 0x0000000010e13a40 in ?? ()
No symbol table info available.
#7 0x0000000004054d5e in pthread_create () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#8 0x0000000006db0830 in ?? ()
No symbol table info available.
#9 0x0000000010d97000 in ?? ()
No symbol table info available.
#10 0x0000000006db0738 in ?? ()
No symbol table info available.
#11 0x00000000008e4a90 in consoles ()
No symbol table info available.
#12 0x00000000005f44f2 in ast_inet_ntoa (ia=...) at utils.c:617
buf = 0x8e4a90 <consoles+16> "\260\212\071\a"
#13 0x00000000079e0860 in ?? ()
No symbol table info available.
#14 0x0000000000000000 in ?? ()
No symbol table info available.

By: nappsoft (nappsoft) 2017-08-17 06:28:16.742-0500

Ok, I've just updated from the latest official musl release (1.16) to git-head as there were two changes concerning threads, the startup problems now happen at a different place. means: I'll have a look at musl for the moment (the two versions should be binary compatible I guess, will check that).
By: George Joseph (gjoseph) 2017-08-17 06:33:15.599-0500

Well, __copy_tls is copy thread local storage, another memory operation and in one of the backtraces you attached, the SEGV isn't actually caused by a write to an invalid location (which is usually the case) but an invalid size of 8 bytes so I'm still thinking there's something going on with memory management. Recap for me please (because it's 5:30am and my eyes aren't fully open yet) when did the issue start happening? If it's only been recently, can you narrow it down with a 'git bisect'? Maybe we can at least find the commit that was the trigger.

For the backtraces, check {{/proc/sys/kernel/core_pattern}} and see where the default location for core dumps is. I usually recommend setting it to somewhere specific like {{/var/log/coredumps/core-%e-%t}} which will result in {{/var/log/coredumps/core-asterisk-<timestamp>}}. Also, I'd recommend disabling the abrt* stuff if you have it enabled.

We haven't seen any issues starting under valgrind either. Exactly what valgrind options do you use. I'll test it.

By: nappsoft (nappsoft) 2017-08-17 06:51:39.711-0500

let me test over the next few days (weekend included), I hope to have more useful information till then.
By: George Joseph (gjoseph) 2017-08-17 07:29:06.004-0500

Sounds good. When I was talking about doing a git bisect, I was talking about asterisk, but if you can do it with musl as well, that's good.

By: nappsoft (nappsoft) 2017-08-18 05:55:53.761-0500

FYI:
the startup bug was caused by musl, the following change in git-head fixed the issue: http://git.musl-libc.org/cgit/musl/commit/?id=27b3fd68f67b674440d21ea7ca5cf918d2e1559f
However the following change in git-head breaks startup completely as long as one doesn't blacklist some modules: http://git.musl-libc.org/cgit/musl/commit/?id=6476b8135760659b25c93ff9308425ca98a9e777

I'll check now with musl git head with dynlink.c reverted back to the latest version before 6476b8135760659b25c93ff9308425ca98a9e777 to see whether the issues persist.

About bisecting asterisk: this won't help I guess as we have these kind of crashes since many versions (since switching to chan_pjsip), but usually only in special situations, like network issues, sip switch issues or similar. That's why I never had a deeper look into it. However one single (new) customer has frequent crashes. Phones and Softphones in use are the same as at other places, the configuration is almost identical (we have some sort of database that is "provisioning" the config files on startup and most of the logic is implemented in agi-scripts doing actions based on the database), the same software versions (by booting the virtual machines get a binary firmware blob over tftp), that's why I focussed on features this customer is using others don't use... and this seems to be pickupChan, which was as well involved in the 6th observed crash from which we have pcap files.

However: who knows, maybe some changes in musl git head do their magic (especially some "fix uninitialized value on failure in XY" changes). Will let you know after further tests.
By: nappsoft (nappsoft) 2017-08-20 03:36:56.026-0500

I still have the same issue after upgrading musl. I had a look at the source code for PickupChan and as it's one of the few places where ast_channel_snapshot_create is involved I was wondering whether this issue could be ralted to the following unresolved issues that had been closed due to lack of activity?

=> ASTERISK-26192 and ASTERISK-27150
By: nappsoft (nappsoft) 2017-08-20 05:08:48.477-0500

I finally managed to let the application crash with the DONT_OPTIMIZE version with valgrind. backtrace4.txt and valgrind4.txt uploaded
By: nappsoft (nappsoft) 2017-08-22 07:51:39.371-0500

Could you please let me know whether the new traces might be of any help or whether debugging like this is just a waste of time?
By: nappsoft (nappsoft) 2017-08-23 03:31:35.331-0500

Yesterday and today we have observed two more crashed in the live system, in both cases it happened again immediately after the call in which a PickupChan operation was involved had been hung up. => So this simply can't be a coincidence anymore. A pickup was involved into all observed crashes (7 during the last 16 days) since we started taking sip traces and as these operations happen in less than 1% of the calls (according to the number of calls per day and the pickups visible in the sip traces) it must have something to do with this operation...
By: George Joseph (gjoseph) 2017-08-23 06:55:23.838-0500

Can you run /var/lib/asterisk/scripts/ast_coredumper on the core file that you used to get backtrace4.txt? This way we can see what else is going on in that asterisk instance that might have caused the corruption.

By: nappsoft (nappsoft) 2017-08-23 11:05:45.857-0500

FYI: as this is not the only customer who uses the Pickup Function put the only one showing these symptoms we'll also try to avoid time consuming name resolutions as this is currently the only difference I have in mind (apart from the number of mwi subscriptions) compared to other systems (this system has a cname that most clients use, the real hostname is always in the hosts file, but not the cname in this case, so it's at least worth a try to put the other hostname in there as well).
By: nappsoft (nappsoft) 2017-08-24 06:47:17.094-0500

One question as I did not yet manage to let the testsystem crash after making sure that hostnames in FROM headers can be resolved locally (about 500 times reproduced the test scenario): Is it possible that two threads can use the message_ip_updater code at the same time? On line 283 we have the following:

if (pj_strcmp2(&uri->host, "localhost") && pj_sockaddr_parse(pj_AF_UNSPEC(), 0, &uri->host, &ip) == PJ_SUCCESS) {

pj_sockaddr_parse seems to be using pj_sockaddr_in_set_str_addr which is using pj_gethostbyname which in our case is using gethostbyname which is not reentrant what means that it might not be threadsafe... However: musl internally is using gethostbyname2 which is using gethostbyname2_r internally for each call to gethostbyname, so this should not be an issue if it is working as expected.

However I might have found a possible reason for memory corruption there. The first free(h) doesn't look safe... does it?:

struct hostent *gethostbyname2(const char *name, int af)
{
static struct hostent *h;
size_t size = 63;
struct hostent *res;
int err;
do {
free(h);
h = malloc(size+=size+1);

By: nappsoft (nappsoft) 2017-08-24 07:59:19.753-0500

(forget about the cited code: this is a static struct so it should be safe)
By: George Joseph (gjoseph) 2017-08-24 08:27:55.460-0500

Yes, message_ip_updater can be called by multiple threads at the same time so it's quite possible that the underlying gethostbyname is at fault. I'll do a little test later today.

gethostbyname2_r is thread safe but it makes no difference because it's called from an unsafe gethostbyname2. To be thread safe, you'd have to call gethostbyname2_r directly.

By: George Joseph (gjoseph) 2017-08-24 08:28:55.830-0500

The free may be "safe" but gethostbyname2 is certainly NOT.

By: nappsoft (nappsoft) 2017-08-24 08:33:42.581-0500

Yeah, I got to the same conclusion in the meantime. I'll modify the pjsip function in question to use gethostbyname_r. As I've seen in the bugtracker there are other users who had similar troubles before changing to Asterisk 14 (what is currently not an option for us) so this might be worth a try.
By: nappsoft (nappsoft) 2017-08-24 09:54:16.701-0500

This might make pj_gethostbyname threadsafe
By: George Joseph (gjoseph) 2017-08-24 10:38:41.444-0500

Let us know if this fixes the problem. Regardless, when you're happy with this patch let us know so we can add it to bundled pjproject and submit it upstream to the pjproject guys for you.

By: nappsoft (nappsoft) 2017-08-24 13:39:03.420-0500

I've tested a bit with the patch applied and it seems to solve the issue. (This is still not a final conclusion as it could still be a coincidence... I did about 500 testcalls with pickup. Yes, I do have some other hobbies ;) )

However it's more a hack than a clean solution as the memory management in this modified function is quite stupid considering the fact that pjsip works with memory pools wherever it's possible. However: stability matters more for me than overhead... Also I didn't check whether any function using pj_gethostbyname is relying on the error code, as h_error might contain wrong values in some race conditions. And last but not least it might be better to use getaddrinfo instead of gethostbyname_r. But this can do whoever has time to solve this in a clean way, I'm happy for the moment with my patch ;)
By: nappsoft (nappsoft) 2017-08-28 09:52:36.783-0500

Looks good so far, no more crashes on the live system since the update (3-4 more days without a crash might give some certainty). I've applied the patch to about 30 systems yesterday, didn't see any regressions there. Will give an update in a few days.
By: nappsoft (nappsoft) 2017-08-31 05:43:24.759-0500

Still no crash since applying my patch (after 3.5 workdays) on the system which crashed at least once every second day during the last few weeks. So the source of the problem seems to have been identified and I think that you should either take my patch or work on a cleaner solution. I'll also write a mail to the pjsip mailing list.
By: Joshua C. Colp (jcolp) 2017-08-31 05:49:03.261-0500

This issue has been accepted and is already in queue to be looked at.
By: nappsoft (nappsoft) 2017-08-31 06:10:17.822-0500

Good to know! I'm sorry, should not have sounded like a command ;)
By: Kevin Harwell (kharwell) 2017-10-23 12:04:16.192-0500

The issue here is that it appears musl (or the version you are using) does not support the _getaddrinfo_ method, which is a thread safe alternative to _gethostbyname_. The pjproject configure script uses the following to see if _getaddrinfo_ is supported by the platform:
{noformat}
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking if getaddrinfo() is available" >&5
$as_echo_n "checking if getaddrinfo() is available... " >&6; }
cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
int
main ()
{
getaddrinfo(0, 0, 0, 0);
;
return 0;
}
_ACEOF
if ac_fn_c_try_compile "$LINENO"; then :
$as_echo "#define PJ_SOCK_HAS_GETADDRINFO 1" >>confdefs.h
{noformat}
If it is then _getaddrinfo_ is used by pjsip's _pj_getaddrinfo_ function. Otherwise it falls back to using _gethostbyname_.

If it is possible I'd highly recommend using a library that supports the _getaddrinfo_ function. Barring that you'll have to make the pjsip code thread safe, which you've done with your patch.

Unfortunately, the patch as-is is not in a state to be submitted upstream. According to the man page for [gethostbyname|http://man7.org/linux/man-pages/man3/gethostbyname.3.html] the "_r" implementations are gnu only extensions.

In order for us to submit this patch upstream to pjproject the gnu extension calls will need to be wrapped in #defines. Another option would be to not use the extension(s), but use some kind of locking mechanism.

By: nappsoft (nappsoft) 2017-10-23 23:51:51.133-0500

This is complete nonsense: of course musl supports getaddrinfo (also the version used by me: checking if getaddrinfo() is available... yes). You must have looked into pjproject and musl wrongly, as it is quite obvious that in the described case pj_gethostbyname is used (I've put into one message the complete sequence of involved functions, nothing is influenced by PJ_SOCK_HAS_GETADDRINFO there... So I really don't know what you're talking about).

I've fixed the issue for my needs more than two months ago (see submitted patch), everyhting is working perfect since then. It took some minutes to fix it after I've discovered the reasons for the memory corruptions, so I don't understand why the asterisk devs didn't integrate a fix before the pjsip team fixed it (after days without anything happening I've also reported the bug to the pjsip team, which fixed the issue immediatly (fixed in 2.7)). So as you bundle pjproject 2.7 now, the issue should be fixed. Have a look at the changes made by the pjsip team: https://trac.pjsip.org/repos/changeset/5644/pjproject
By: Kevin Harwell (kharwell) 2017-10-24 09:32:40.698-0500

{quote}
You must have looked into pjproject and musl wrongly, as it is quite obvious that in the described case pj_gethostbyname is used
{quote}
Ah yes you are correct. I did miss the call from sock_common.c. I was describing what happens when a call is made to the pj_getaddrinfo method, which you were not using.
{quote}
It took some minutes to fix it after I've discovered the reasons for the memory corruptions, so I don't understand why the asterisk devs didn't integrate a fix before the pjsip team fixed it
{quote}
We have many issue to work on and we get to issues when we can. Your fix, while sufficient for your needs, was not an acceptable fix for everyone. But it appears the pjproject team cleaned it up and fixed the issue, which we're glad to hear. Thanks for letting us know!