ASTERISK-24472: Asterisk Crash in OpenSSL when calling over WSS from JSSIP

[Home]

Summary: ASTERISK-24472: Asterisk Crash in OpenSSL when calling over WSS from JSSIP

Reporter: Badalian Vyacheslav (slavon) Labels: Security

Date Opened: 2014-10-30 07:30:44 Date Closed: 2014-12-10 07:36:46.000-0600

Priority: Critical Regression?

Status: Closed/Complete Components: Resources/res_http_websocket

Versions: 11.13.1 11.14.1 Frequency of
Occurrence

Related
Issues:

Environment: Opera 20.0.1387.77. Use: DTLS, WSS, Valid SSL certificate Client - jssip 0.3.0 Attachments: ( 0) ASTERISK-24472-11-round-3.diff
( 1) ASTERISK-24472-null-3.diff
( 2) ASTERISK-24472-null-4.diff
( 3) ASTERISK-24472-websocket-read-bail-2.diff
( 4) backtrace_openssl_debug1.txt
( 5) backtrace_openssl_debug2.txt
( 6) backtrace_openssl_debug3.txt
( 7) backtrace_openssl_debug4.txt
( 8) backtrace_openssl_debug5.txt
( 9) backtrace.txt
(10) backtrace2.txt
(11) backtrace3.txt
(12) locks3.txt
(13) rt3592.patch
(14) valgrind_srtp15_1.txt
(15) valgrind_srtp15_XXX10.txt
(16) valgrind_srtp15_XXX11.txt
(17) valgrind_srtp15_XXX6.txt
(18) valgrind.txt
(19) valgrind10.txt
(20) valgrind2.txt
(21) valgrind3.txt
(22) valgrind4.txt
(23) valgrind7.txt
(24) ws_rewrite_2.diff
(25) ws_rewrite_3.diff
(26) ws_rewrite.diff

Description: Valgrind and GDB backtrace (3 pices) attached bellow

CentOS x86_64 release 6.6 (Final)

OpenSSL> version
OpenSSL 1.0.1e-fips 11 Feb 2013

# rpm -qa | grep openssl
openssl-devel-1.0.1e-30.el6_6.2.x86_64
openssl-debuginfo-1.0.1e-30.el6_6.2.x86_64
openssl-1.0.1e-30.el6_6.2.x86_64

Comments: By: Badalian Vyacheslav (slavon) 2014-10-30 07:39:15.783-0500

Also have this errors

[2014-10-30 15:37:26] ERROR[17876]: /home/obs.agira/asterisk-11.13.1/include/asterisk/utils.h:569 _ast_realloc: Memory Allocation Failure in function __ast_websocket_read at line 465 of res_http_websocket.c
[2014-10-30 15:37:26] WARNING[17876]: res_http_websocket.c:466 __ast_websocket_read: Failed allocation: 0x7effb03d3950, 0, 0

after it:

*** glibc detected *** asterisk: double free or corruption (!prev): 0x00007eff8838f5a0 ***
Segmentation fault (core dumped)

BT attach bellow

By: Badalian Vyacheslav (slavon) 2014-10-30 07:45:26.343-0500

If we was call from Сhrome server is not down. We think that was then we call from Opera 20.0.1387.77.

Use: DTLS, WSS, Valid SSL certificate
Client - jssip 0.3.0
By: Badalian Vyacheslav (slavon) 2014-10-30 08:06:08.349-0500

Added some files but with OpenSSL Debug info loaded
By: Badalian Vyacheslav (slavon) 2014-10-30 08:08:22.255-0500

Added valgrind test. 5+ calls from one client. Browser Opera
By: Rusty Newton (rnewton) 2014-11-05 12:01:07.208-0600

You may go ahead and file this issue with the OpenSSL project as well just in case the bug is there. When you do, please go ahead and add a link to that bug here.
By: Matt Jordan (mjordan) 2014-11-05 16:46:23.615-0600

So, the patch I've just attached ( {{ASTERISK-24472-websocket-read-bail.diff}} ) may not solve all of the problems. However, your error message of a {{realloc}} failing is disconcerting: we failed to allocate a big enough buffer to deal with the message that just arrived, and instead of bailing on the websocket connection, we are actually returning 0 with a payload length of 0.

Unfortunately, up in {{chan_sip}}, that looks like success - so it attempts to process things with a payload length of 0, which eventually causes what appears to be a memory corruption. Yikes.

This patch is a trial - if the {{realloc}} fails, we tell the higher level to not process the read request. That *should* at least handle the failure in the memory allocation better in {{chan_sip}}, which *may* avoid the crashes you're seeing. It may result in the call being cancelled (in fact, I suspect it will), but that's not terribly surprising: we failed to allocate memory, and something has to give. I'm hopeful it may at least not crash after that point.

If you can test with this patch under valgrind as well, that'd be appreciated. That may at least clear Asterisk out of this - although you may still have problems in OpenSSL.
By: Badalian Vyacheslav (slavon) 2014-11-06 04:47:12.386-0600

Link to bugreport https://rt.openssl.org/Ticket/Display.html?id=3592&user=guest&pass=guest

We will try patch today
By: Badalian Vyacheslav (slavon) 2014-11-06 06:17:44.963-0600

Attached 2 valgrind test after patch applyed
By: Badalian Vyacheslav (slavon) 2014-11-06 06:18:31.260-0600

Tested and valgrind attached
By: Badalian Vyacheslav (slavon) 2014-11-06 06:25:38.908-0600

Crash with patch. After 20 min uptime. About 100 calls
By: Badalian Vyacheslav (slavon) 2014-11-06 06:28:58.988-0600

Again. After few minutes of work
By: Badalian Vyacheslav (slavon) 2014-11-06 06:47:07.663-0600

also now i apply ASTERISK-24333 patch because in valgrind see some error.
Its does not go to git?
By: Badalian Vyacheslav (slavon) 2014-11-06 07:30:09.254-0600

lanch valgrind for 24 hours test.

I look to code. Maybe need return if Payload = 0 before malloc?

In code we have warnings:
[2014-10-30 15:36:50] WARNING[17665] res_http_websocket.c: Failed allocation: 0x7effb0144ca0, 0, 0
[2014-10-30 15:37:26] WARNING[17876] res_http_websocket.c: Failed allocation: 0x7effb03d3950, 0, 0
[2014-10-30 15:38:00] WARNING[18045] res_http_websocket.c: Failed allocation: 0x7effb02fe010, 0, 0

By: Badalian Vyacheslav (slavon) 2014-11-06 07:37:48.083-0600

==62557== Address 0x5fb4fa0 is 0 bytes after a block of size 512 alloc'd

Its look to you allocated 512 bytes but realloc payload data to 0 bytes because *payload = 0
By: Matt Jordan (mjordan) 2014-11-06 08:50:25.584-0600

First, the patch on ASTERISK-24333 doesn't matter, as that isn't the same issue as this one. As pointed out, that is a memory corruption in {{libsrtp}} itself.

Are you sure the patch was applied correctly? Looking at your valgrind output, {{chan_sip}} is still attempting to get the string after {{ast_websocket_read}}, which should not be possible if the {{realloc}} failed (which it did in your trace) and returned {{-1}} (which it does with the patch).

Do you know that you are sending a 0 byte message inside a WebSocket frame? The {{payload_len}} is set to 0 if the {{realloc}} fails - if it is 0 prior to that point, I would expect things to still be "okay" although sending an empty WebSocket message would be rather odd.
By: Badalian Vyacheslav (slavon) 2014-11-06 09:07:14.134-0600

i think so
{code}
[root@ds-asterisk01 asterisk-11.13.1]# diff ./../asterisk-11.13.1_gcc_old/res/res_http_websocket.c res/res_http_websocket.c -u
--- ./../asterisk-11.13.1_gcc_old/res/res_http_websocket.c 2014-06-26 16:06:22.000000000 +0400
+++ res/res_http_websocket.c 2014-11-06 14:58:37.199623246 +0300
@@ -467,7 +467,7 @@
session->payload, session->payload_len, *payload_len);
*payload_len = 0;
ast_websocket_close(session, 1009);
- return 0;
+ return -1;
}

/* Per the RFC for PING we need to send back an opcode with the application data as received */

[root@ds-asterisk01 asterisk-11.13.1]# md5sum res/res_http_websocket.so /usr/lib/asterisk/modules/res_http_websocket.so
643340c9f6bb09dd9c22b88800234350 res/res_http_websocket.so
643340c9f6bb09dd9c22b88800234350 /usr/lib/asterisk/modules/res_http_websocket.so

{code}

Mybe its next part of code?

{code}
if ((*opcode == AST_WEBSOCKET_OPCODE_PING) && (ast_websocket_write(session, AST_WEBSOCKET_OPCODE_PONG, *payload, *payload_len))) {
*payload_len = 0;
ast_websocket_close(session, 1009);
return 0;
}
{code}

You close socket but return true and chan_sip still work
By: Badalian Vyacheslav (slavon) 2014-11-06 09:12:28.924-0600

Realloc can't be fail. its strange. Server have more than 32 gb free memory and all compiled by x64. Maybe wrong address or anything other?

We have many calls in test. Maybe some calls after memory corruption do realloc fail, but valgrind simple say - payload was 512 bytes (init size of adress) and now 0 bytes. You was realloc it to 0 bytes becouse payload_len = 0 and you do realloc(...,0)...

I thnik its two different questions. First - why chan_sip do next work and second - why allocated buffer 512 of bytes stay 0 bytes (may be its normal and valgrind do panic) :)
By: Matt Jordan (mjordan) 2014-11-06 09:26:37.750-0600

You may be right - the PING may be catching it here.

Well, {{realloc}} can fail even if you have a lot of memory - if you don't have enough contiguous memory to allocate a particular chunk, it will fail even if you technically have sufficient memory.

Since you're able to reproduce this pretty easily (and neither Rusty nor I can), let's try the round 2 of the patch - it will only attempt a {{realloc}} if we have a payload body, will bail out if the write on the PING fails, and only attempts to write a body out if we have a payload.
By: Badalian Vyacheslav (slavon) 2014-11-06 09:52:52.847-0600

Ok. Tomorow i will test it. :)
By: Badalian Vyacheslav (slavon) 2014-11-07 07:59:07.946-0600

After patch 2
By: Badalian Vyacheslav (slavon) 2014-11-07 07:59:27.784-0600

added valgrind after test.

Warnings like this
Failed allocation: 0x7effb0144ca0, 0, 0
is not showed.

But valgrind say about realloc to 0.... strange
By: Matt Jordan (mjordan) 2014-11-07 09:02:19.545-0600

Yeah, I'm at a loss here. I'm not sure how we aren't getting that message but still ending up with valgrind complaining about the {{realloc}}. We're also still clearly setting the string in {{chan_sip}}, which makes no sense.

For that to occur (with the patch I've just attached), {{* payload_len}} *must* be non-zero:

{noformat}
if (*payload_len && !(new_payload = ast_realloc(session->payload, (session->payload_len + *payload_len)))) {
{noformat}

The only thing I can think to do at this point it to throw some serious debug into {{ast_http_websocket}} code to try and figure out what in the blue blazes is going on in here.

I'll attach a patch with that in a few minutes. When you re-test with it, let's get:
* A pcap of the message traffic
* A debug log ( {{mylog => debug,verbose,notice,warning,error}} ) showing the traffic leading up to the corruption/crash

By: Matt Jordan (mjordan) 2014-11-07 09:06:43.347-0600

Just as a further question: are there any other WARNINGs or ERRORs leading up to the crash?
By: Matt Jordan (mjordan) 2014-11-07 09:35:42.912-0600

Patch (ASTERISK-24472-11-round-3.diff) with more debugging added. Hopefully this gives us the pointer to what is causing the problem.

By: Badalian Vyacheslav (slavon) 2014-11-07 10:05:20.082-0600

We use WSS.... pcap does not help....
By: Badalian Vyacheslav (slavon) 2014-11-07 16:35:30.893-0600

I rewrite you code. Test for small load look good. On mondey i will try its for high load and post patch if all be ok.
By: Badalian Vyacheslav (slavon) 2014-11-10 06:49:36.478-0600

Valgrind after my patch
By: Badalian Vyacheslav (slavon) 2014-11-10 06:52:39.448-0600

Patch to fix memleak.

Also i rewrite READ part
-its more C style development
-its catch more life situation
-its have more readable logic
-Maybe need fix "russian english" in text
Tested. 200 calls from different WebRTC clients and browsers
By: Badalian Vyacheslav (slavon) 2014-11-10 10:20:46.490-0600

{code}
- if (!(req.data = ast_str_create(payload_len + 1))) {
+ if (!(req.data = ast_str_create(payload_len))) {
goto end;
}

- if (ast_str_set(&req.data, -1, "%s", payload) == AST_DYNSTR_BUILD_FAILED) {
+ if (ast_str_set(&req.data, payload_len, "%s", payload) == AST_DYNSTR_BUILD_FAILED) {
deinit_req(&req);
goto end;
{code}

Please look to this code. Its may be secutrity bug!

if
- payload_len = 30
- paylad[0] = [some data of 30 bytes without '\0' at end]
- paylad[payload_len] = [some data of XXXX bytes] - this can be happened after realloc. We not shure that paylad[payload_len] is zero memory part.

and you do
- ast_str_set(&req.data, -1, "%s", payload)
WoW WoW WoW.... payload without '\0' now have lenght 30 + XXXX bytes.... and "-1" param.... its can be very very long string....

By: Badalian Vyacheslav (slavon) 2014-11-10 10:24:50.029-0600

Patch tested on big load.... have one crash by OpenSSL... but by assert... it is not memory corruprion... we will more test it....
By: Matt Jordan (mjordan) 2014-11-10 14:17:21.635-0600

[~slavon]: Unfortunately, your patch doesn't follow the project coding guidelines. See here for more information:

https://wiki.asterisk.org/wiki/display/AST/Coding+Guidelines

I'd really rather not bundle general improvements to {{res_http_websocket}} - which are subject to a lot of debate - with the actual bug fix for this memory corruption. If you'd like to split that up into two patches, that'd be fine.

Were additional changes necessary to the patch I posted to fix the actual memory corruption?
By: Badalian Vyacheslav (slavon) 2014-11-10 16:57:41.711-0600

I write this pach to be more linear logic.
But after my patch was have some results as begin.

My patch help me found problem.

Valgrind say 2 lines
==15529== Invalid read of size 1
==15529== Address 0x7ed57d0 is 0 bytes after a block of size 512 alloc'd

But its problem of chicken and eggs. We was think that have problem becouse memory realloc to 0 bytes and then reading but its false.
Problem (that valgrind showed) becouse:
1. Invalid read memory
2. Realoc failed memory to small part (and address cleaned becouse have X bytes and we needed to Y.... Y < X - ITS NORMAL BY CODE LOGIC)
Its all that show valgrind but its not problem. First problem is Invalid read.

Part 2 of my patch its simple this part:

{code}
if (opcode == AST_WEBSOCKET_OPCODE_TEXT || opcode == AST_WEBSOCKET_OPCODE_BINARY) {
struct sip_request req = { 0, };

- if (!(req.data = ast_str_create(payload_len + 1))) {
+ if (!(req.data = ast_str_create(payload_len))) {
goto end;
}

- if (ast_str_set(&req.data, -1, "%s", payload) == AST_DYNSTR_BUILD_FAILED) {
+ if (ast_str_set(&req.data, payload_len, "%s", payload) == AST_DYNSTR_BUILD_FAILED) {
deinit_req(&req);
goto end;
{code}

{code}
+ if (!(new_payload = ast_realloc(session->payload, (session->payload_len + 1)))) {
+ ast_log(LOG_WARNING, "Failed allocation: %p, %zu, %"PRIu64"\n",
+ session->payload, session->payload_len, *payload_len);
+ ast_websocket_close(session, 1009);
+ return -1;
}

+ session->payload = new_payload;
+ session->payload[session->payload_len] = '\0';
+ session->payload_len += 1;
{code}

This was fix memory leak for copy string without length after realloc to memory where next data block (after payload) started not from '\0';
Only this part fix problem.
All other parts of patch do for us more accurate catching RFC situations and get code more readable and understanding.

By: Badalian Vyacheslav (slavon) 2014-11-11 17:41:45.307-0600

Any updates?
By: Badalian Vyacheslav (slavon) 2014-11-12 13:09:59.367-0600

Patch tested on 10 concurent calls. Total calls - 1000. One crash on OpenSSL assert after 964 calls. Before patch we get crash after 30-50 calls.
After patch asterisk get more stability.

I will try to fix SRTP leak and maybe it fix openssl asstet crash.

By: Badalian Vyacheslav (slavon) 2014-11-22 07:18:38.583-0600

Fix code guildlines

Please review patch. Its fix memory leak in WS and we not have crashes or leaks in WS after apply patch.
Patch tested on 2 production servers. More 100 000 calls. Status - stable.

By: Badalian Vyacheslav (slavon) 2014-11-22 08:07:05.604-0600

Updated.
1. Fix small Codeguidless in my code
2. Fix small space typo Codeguidless in chan_sip.c "){" -> ") {"
By: Joshua C. Colp (jcolp) 2014-11-22 15:12:35.653-0600

Can you please try this alternative patch?

While I realize yours fixes things I'd like to fix the crash first and then we can explore changing stuff further.
By: Joshua C. Colp (jcolp) 2014-11-24 06:40:07.185-0600

Updated patch with some res_http_websocket tweaks.
By: Badalian Vyacheslav (slavon) 2014-11-24 06:49:55.543-0600

I wiil try but not shure... it is production servers with lage calls (more 20k per day).

{code}
+ free(session->payload);
+ session->payload = NULL;
{code}

1. I think you must use ast_free
2. session->payload is auto free in res_http_websocket.c:session_destroy_fn if we exit from read while and destroy object.
3. If new payload_len < old payload_len realloc destroy data and memcpy replace it by new
By: Badalian Vyacheslav (slavon) 2014-11-24 06:53:56.777-0600

Added valgrind test 300+ calls with my patch (ws_rewrite.diff) and libsrtp 1.5.0
By: Joshua C. Colp (jcolp) 2014-11-24 06:56:21.237-0600

1. Fixed
2. The loop won't exit with this patch, payloads with 0 are handled fine (I've tested it)
3. That's on purpose - unless it is fragmented each new message adjusts the size of session->payload, it's only when a frame is across multiple messages that it will grow the payload to store the new data at the end
By: Badalian Vyacheslav (slavon) 2014-11-24 06:58:56.299-0600

All errors in WSS and CHAN_SIP look to bug in Openssl (aesni_cbc_encrypt ) and cloned structures from it.

I think my patch is stable for deploy it to production.
I will test it on one production server without WSS(TLS).
I will post if after patch and update libsrtp to 1.5.0 we will not see crashes more...
By: Joshua C. Colp (jcolp) 2014-11-24 07:02:22.883-0600

Whose patch are you testing - yours or mine?
By: Badalian Vyacheslav (slavon) 2014-11-24 07:06:59.967-0600

I will try test you patch if bussness say OK to get resources, but not shure....
By: Badalian Vyacheslav (slavon) 2014-11-24 07:07:28.972-0600

> Whose patch are you testing - yours or mine?
My patch
By: Joshua C. Colp (jcolp) 2014-11-24 07:10:12.763-0600

Can you explain what you fixed, how it was causing OpenSSL to crash, and how you fixed it?
By: Joshua C. Colp (jcolp) 2014-11-24 07:42:51.256-0600

Based on my own analysis I believe that while chan_sip expecting a NULL terminator is a bug the real issue is the handling of payload length 0. Given the following scenario:

1. Websocket client connects
2. Websocket client sends text of payload length 25
3. Websocket client sends text of payload length 0
4. Websocket client disconnects

The code will incorrectly treat this as an error when in reality realloc has freed the underlying payload. As a result when the session is destroyed the payload will be freed a *SECOND* time. If the memory is now in use elsewhere this will cause problems.

My code fixes this by only doing realloc if there is a new payload, otherwise the old payload is freed and NULLed out.
By: Badalian Vyacheslav (slavon) 2014-11-28 21:29:14.967-0600

In WSS mode we have assert
https://rt.openssl.org/Ticket/Display.html?id=3592
login: guest / guest

In WS mode now we fix two memory errors. I continue test my patch and after test your
please look to ASTERISK-24566 and ASTERISK-24538

By: Badalian Vyacheslav (slavon) 2014-12-01 06:28:07.231-0600

We catch openssl DTLS bug with openssl team.
answer from openssl team:
{quote}
Thanks! That's a big help. I have managed to reproduce this. If when querying
the underlying BIO the MTU size comes back with the ridiculously small value of
13 then this problem can occur. Other ridiculously small values can cause it to
go into an infinite loop which is also bad!!!

Now I've just got to work out the fix...
{quote}
By: Badalian Vyacheslav (slavon) 2014-12-02 20:04:13.392-0600

OpenSSL give me patch to test crashes on WSS.
I go to test WSS mode

WS mode after my patch look stable. Only ASTERISK-24566 look bad in valgrind. Valgrind output after my patch i attach below.
By: Badalian Vyacheslav (slavon) 2014-12-02 20:10:23.045-0600

Test 2 days.
3 concurent WS calls. While (true). All calls hangup after 15 seconds and recall again. In answer channel is MOH.
More 30 000 calls. Valgrind looks good. Voice look good.

in log only 2 types errors and warnoings like this
{code}
[2014-12-01 18:38:44] ERROR[12425] utils.c: fwrite() returned error: Broken pipe

[2014-12-01 18:31:13] WARNING[10699][C-000000b3] chan_sip.c: sip_xmit of 0x330184c0 (len 455) to 172.16.10.102:60531 returned -1: Success
{code}

All other is clean.
By: Badalian Vyacheslav (slavon) 2014-12-03 10:07:58.397-0600

Joshua Colp, sorry... i was try to test but get deadlock from ASTERISK-24571 ....
I untar *vanila version 1.14.1* and apply your patch {{ASTERISK-24472-null-4.diff}} but astersik deadlocked after 5 minuts of work....

By: Badalian Vyacheslav (slavon) 2014-12-04 06:23:48.890-0600

We do < 1000 calls with you patch. After get deadlock. Results from valgrind and core show locks attached.
By: Badalian Vyacheslav (slavon) 2014-12-04 06:26:09.738-0600

But you patch will fix inital issue as we see in valgriind...
By: Badalian Vyacheslav (slavon) 2014-12-09 21:44:16.033-0600

Last valgrind test with {{your}} patch!