[Home]

Summary:ASTERISK-28384: res_xmpp: Crash when distribute_events=yes and res_mwi_devstate loads
Reporter:abelbeck (abelbeck)Labels:patch
Date Opened:2019-04-14 14:51:22Date Closed:
Priority:MinorRegression?Yes
Status:Open/NewComponents:Resources/res_mwi_devstate Resources/res_xmpp
Versions:13.26.0 16.3.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:AstLinux 1.3.x development Linux 3.16.64 x86_64 XMPP server via prosody 0.10.3Attachments:( 0) asterisk-13-res_mwi_devstate-load-order-fix.patch
( 1) asterisk-13-xmpp-fix.patch
( 2) config.log
( 3) menuselect.makeopts
Description:When Asterisk starts, typically every 2-20 times it will either segfault:
{noformat}
asterisk[2408]: segfault at 2b6ce0021000 ip 00002b6cc88b8f91 sp 00002b6ccababb70 error 4 in libssl.so.1.0.0[2b6cc8892000+57000]
{noformat}
or generate a syslog:
{noformat}
WARNING[3061]: res_xmpp.c:3870 in xmpp_client_thread: [asterisk] Socket read error
{noformat}

*What I know:*
Only effects Asterisk 13.26.0 / 16.3.0 with res_mwi_devstate.so allowed to load *and* res_xmpp.so configured with "distribute_events=yes".  If either "noload=>res_mwi_devstate.so" or "distribute_events=no" the crash never occurs.

*What I suspect:*
Looking at this code:
{noformat}
res_xmpp.c: xmpp_init_event_distribution()
res_mwi_devstate.c: load_module()
{noformat}

There may be some startup race condition or interaction that causes the problem.

I tested using this shell command on our system:
{noformat}
while true; do service asterisk stop ; sleep 2 ; rm /var/log/asterisk/full ; service asterisk init ; sleep 4 ; grep 'asterisk.*seg' /var/log/messages; done
{noformat}
A crash will occur within a minute or two with the above test script.

*What I can do:*
If you have a patch to test, I'll be happy to give it a try, given how hard it may be to reproduce this given the XMPP pubsub environment.

*What I can't do:*
Sorry, with our stripped binary embedded system, I can't offer any 'gdb' backtrace help.
Comments:By: Asterisk Team (asteriskteam) 2019-04-14 14:51:24.177-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Joshua C. Colp (jcolp) 2019-04-15 07:08:53.758-0500

I've labbed this up and have been unable to reproduce the problem. Each time the state is published as expected, and no crash occurs.

Can you reproduce this on a non-embedded system and get a backtrace?

By: abelbeck (abelbeck) 2019-04-15 07:29:54.551-0500

Hi Joshua, are you stopping and starting asterisk ever few seconds as I noted in the Description ?

Over the years differences in XMPP servers has caused a crash in Asterisk (now fixed in prosody), but given that res_xmpp has not changed in awhile, I'm thinking a startup interaction with the new res_mwi_devstate should be the focus.

{quote}
Can you reproduce this on a non-embedded system and get a backtrace?
{quote}
Sorry, no.


By: Joshua C. Colp (jcolp) 2019-04-15 07:39:24.678-0500

Yes, 40 times in fact. I also tried unloading and loading the res_mwi_devstate module and the res_xmpp module after that repeatedly to see if that would change things with no impact. The res_mwi_devstate module uses all core APIs to operate, the same APIs that other things use to provide device state. The res_xmpp module is also written to not even receive such information until after authentication has occurred and the session is active. It's certainly possible there's a race condition there but nothing immediately stands out. Without a backtrace I don't see anyone spending much time on this issue if any.

By: Joshua C. Colp (jcolp) 2019-04-15 07:45:33.447-0500

What are the precise embedded system specs? What is the voicemail configuration and how many voice mailboxes are there? Can you also attach the output of config.log and menuselect.makeopts?

By: abelbeck (abelbeck) 2019-04-15 07:47:28.880-0500

OK, in your test did you see any log like this ?
{noformat}
WARNING[3061]: res_xmpp.c:3870 in xmpp_client_thread: [asterisk] Socket read error
{noformat}

Also you might try running the stop/start test or 5-10 minutes to be sure, my tests are on lower end hardware, but I was able to reproduce it on two different hardware types.

By: Joshua C. Colp (jcolp) 2019-04-15 07:50:46.216-0500

No, I saw nothing like that no matter what I did.

By: Joshua C. Colp (jcolp) 2019-04-15 07:53:21.501-0500

The further details I mentioned would be nice to ensure my attempts are as accurate as possible.

By: abelbeck (abelbeck) 2019-04-15 08:00:16.602-0500

{quote}
What are the precise embedded system specs?
{quote}
Box1: PC Engines APU2 - 4 GB RAM, CPU: AMD GX-412TC SOC (4-core) @ 1000 MHz
Box2: Jetway NF9HG-2930 - 2 GB RAM, CPU: Intel Celeron N2930 (4-core) @ 1830 MHz

Using prosody 0.10.3 as a localhost XMPP server.

{quote}
What is the voicemail configuration and how many voice mailboxes are there?
{quote}
Very few, 1 or 2 boxes, mostly defaults.


By: Joshua C. Colp (jcolp) 2019-04-15 08:06:05.241-0500

And the version of OpenSSL and Iksemel?

By: abelbeck (abelbeck) 2019-04-15 08:15:16.912-0500

OpenSSL 1.0.2r

iksemel  1.5-pre1  (/usr/lib/libiksemel.so.3.1.1) archived from http://iksemel.googlecode.com/files
I looked for a newer, more official version of iksemel, no luck, but this has always worked in the past.


By: abelbeck (abelbeck) 2019-04-15 08:20:12.421-0500

Attached: Build system config.log and menuselect.makeopts

By: Joshua C. Colp (jcolp) 2019-04-15 08:21:18.336-0500

I've acknowledged this issue but as previously mentioned I don't know if or when anyone will work on it. Generally with backtraces attached they have a higher chance, but I'm unable to acquire one personally.

By: abelbeck (abelbeck) 2019-04-15 08:23:31.601-0500

Note the attached config.log was built with --without-pjproject, but exact same results with pjproject and associated modules built and loaded.

By: abelbeck (abelbeck) 2019-04-15 08:26:59.167-0500

Thanks Joshua, I'm willing to perform ugly patch hacks to help provide more clues, as distribute_events=yes calls the xmpp_init_event_distribution() function.

By: Joshua C. Colp (jcolp) 2019-04-15 08:30:05.070-0500

I'm not actively working on the issue, I was just doing initial triage.

By: abelbeck (abelbeck) 2019-04-15 13:21:52.749-0500

Attached: asterisk-13-xmpp-fix.patch

Not sure if this is a production quality fix, but it does solve the issue noted here.

By simply adding sleep(2) before calling xmpp_init_event_distribution() solves the issue.

I tested various delay amounts, (Refer to hardware description above)

sleep/usleep delay:
0.5 sec - failed/APU2,  failed/Jetway NF9HG-2930
1.0 sec - failed/APU2,  worked/Jetway NF9HG-2930
1.5 sec - worked/APU2,  worked/Jetway NF9HG-2930
2.0 sec - worked/APU2,  worked/Jetway NF9HG-2930

As such this patch uses 2 seconds.

By: Sean Bright (seanbright) 2019-04-16 06:37:41.926-0500

You're free to run that patch in your environment. If you're ever able to get a usable backtrace we'll be sure to take a look.

By: abelbeck (abelbeck) 2019-04-16 10:50:42.532-0500

Attached: asterisk-13-res_mwi_devstate-load-order-fix.patch

Possibly related, the new res_mwi_devstate module defines the load priority as AST_MODPRI_DEVSTATE_PROVIDER but does not set the AST_MODFLAG_LOAD_ORDER flag.  This patch sets AST_MODFLAG_LOAD_ORDER.

Unfortunately, this only make things worse wrt this issue.  Possibly Joshua could re-run his test with this patch added and see if the crash is reproducible.

Regardless, this patch is probably what was intended for res_mwi_devstate.

By: abelbeck (abelbeck) 2019-07-17 10:12:17.466-0500

Update:

Our project just upgraded to OpenSSL 1.1.1, as such our previous iksemel 1.5 no longer compiles with native openssl support, so our project reverted to iksemel 1.4 (with Debian patches) without any native SSL support (we don't include GNUTLS).

Given these new OpenSSL and iksemel versions, I gave this issue another look...
Sadly, no change, same segfault in the newer libssl.so.1.1
{noformat}
asterisk[9775]: segfault at 2adf30021000 ip 00002adf1f066aa9 sp 00002adf1e23d920 error 4 in libssl.so.1.1[2adf1f046000+71000]
{noformat}

BTW, please correct me if I am wrong, but Asterisk does not use any native "iksemel" SSL support, but rather Asterisk wraps iksemel with OpenSSL when needed.  As for use in Asterisk, iksemel does not need to be built with any SSL support.

Sorry I can't provide a usable backtrace.  Feel free to close this issue, or keep it open for users with the same issue that can provide a backtrace.