[Home]

Summary:ASTERISK-19373: Segmentation Fault in ast_udptl_write() due to bad memcpy() call
Reporter:Rob Gagnon (rgagnon)Labels:
Date Opened:2012-02-16 11:47:40.000-0600Date Closed:2012-08-20 10:26:07
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/T.38
Versions:10.1.2 Frequency of
Occurrence
Occasional
Related
Issues:
is related toASTERISK-19762 Segfault in ast_frdup when invalid data length specified in duplicated frame
is related toASTERISK-19468 Crash seems to be related to sip fax calls
Environment:CentOS 5.7 x86_64 24GB Intel Xeon X5650 2.67GHzAttachments:( 0) asterisk-1.8.3.2-t38-null-frame-crash.patch
( 1) ASTERISK-19373.diff
( 2) T19373_bt_full.txt
( 3) T19373_bt_full2.txt
( 4) T19373_sip_trace.pcap
Description:During Fax gateway processing over SIP, Asterisk core dumps very rarely, but we did finally correlate a core dump to an event in the logs.

gdb output:
~~~~~~~~~~~
Core was generated by `/usr/sbin/asterisk -f -U asterisk -G asterisk -vvvg -c'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003cb987c312 in memcpy () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003cb987c312 in memcpy () from /lib64/libc.so.6
#1  0x0000000000544c18 in ast_udptl_write ()
#2  0x00002aaab33ca304 in sip_write (ast=0x15be71a8, frame=0x2aaab8aea008) at chan_sip.c:6577
#3  0x0000000000472c33 in ast_write ()
#4  0x0000000000477d11 in ast_generic_bridge ()
#5  0x0000000000478d1e in ast_channel_bridge ()
#6  0x00000000004a8ad5 in ast_bridge_call ()
#7  0x00002aaab2f72e88 in dial_exec_full (chan=0x2aaab87a7018, data=<value optimized out>,
   peerflags=0x41593590, continue_exec=0x0) at app_dial.c:2829
#8  0x00002aaab2f74ea6 in dial_exec (chan=0x169f3670,
   data=0x2000000000 <Address 0x2000000000 out of bounds>) at app_dial.c:2929
#9  0x00000000004f807f in pbx_exec ()
#10 0x00002aaaaab2104b in handle_exec (chan=0x2aaab87a7018, agi=0x41594830, argc=3, argv=0x41593720)
   at res_agi.c:2495
#11 0x00002aaaaab2002c in agi_handle_command (chan=0x2aaab87a7018, agi=0x41594830,
   buf=<value optimized out>, dead=0) at res_agi.c:3390
#12 0x00002aaaaab25f72 in run_agi (chan=0x2aaab87a7018, data=<value optimized out>,
   enhanced=<value optimized out>, dead=0) at res_agi.c:3581
#13 agi_exec_full (chan=0x2aaab87a7018, data=<value optimized out>, enhanced=<value optimized out>,
   dead=0) at res_agi.c:3873
#14 0x00000000004f807f in pbx_exec ()
#15 0x00000000005078bb in pbx_extension_helper ()
#16 0x000000000050a68b in __ast_pbx_run ()
#17 0x000000000050ca7b in pbx_thread ()
#18 0x000000000054712c in dummy_start ()
#19 0x0000003cba00673d in start_thread () from /lib64/libpthread.so.0
#20 0x0000003cb98d44bd in clone () from /lib64/libc.so.6

(gdb) up
#1  0x0000000000544c18 in ast_udptl_write ()

(gdb) up
#2  0x00002aaab33ca304 in sip_write (ast=0x15be71a8, frame=0x2aaab8aea008) at chan_sip.c:6577
6577    chan_sip.c: No such file or directory.
       in chan_sip.c

(gdb) p ast->tech_pvt                                                                                    
$2 = (void *) 0x15c66bf8

(gdb) p *0x15c66bf8
$3 = 0

Message in log file at the same time:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Feb 14 08:17:53] WARNING[29141] udptl.c: UDPTL (SIP/XXXXXXXXX-00002a8f): UDPTL asked to send 373781248 bytes of IFP when far end only prepared to accept 375 bytes; data loss will occur.You may need to override the T38FaxMaxDatagram value for this endpoint in the channel driver configuration.

T38FaxMaxDatagram should be 400 here, as it is set via sip.conf:
t38pt_udptl=yes,fec,maxdatagram=400

I would think the "373781248 bytes" is the problem.  This was probably a faulty packet size sent from one side, or it could have been a faulty math problem in the server maybe?

I also suspected a faulty memory module on the server, however, the server is one of 32 blades running in an HP blade system, and the internal HP hardware monitor indicates no problems with the devices.

===================================================
I'm going to investigate this issue myself as well, but wanted to post the ticket in case there are more people with similar problems, or if someone more familiar with this area of the code can see a simple solution.

I'm also about to install Asterisk with DEBUG_THREADS and DONT_OPTIMIZE enabled in order to obtain possibly more information if and when another core happens.
Comments:By: Matt Jordan (mjordan) 2012-02-16 15:15:06.313-0600

I've acknowledged this issue as there's enough to go on here; however, please to do upload a full backtrace.  Additionally, in this case, a packet capture of the SIP/T.38 traffic that produced this problem would also be useful.

By: Rob Gagnon (rgagnon) 2012-02-16 16:03:04.445-0600

Full Backtrace

By: Rob Gagnon (rgagnon) 2012-02-16 16:32:49.675-0600

SIP Trace uploaded...

I don't have a UDP trace of the T38 traffic as that would be a huge load on the server at the moment.  We don't know yet how to replicate the problem in order to know when to enable a packet capture.

We do however keep all SIP messaging traces, so maybe that could be a little help.

In the trace, Asterisk is the server with the 149.five.46.eight IP Address

By: Rob Gagnon (rgagnon) 2012-04-02 12:58:22.286-0500

Extremely similar back trace from another core dump that occurred on Asterisk 10.2.1.  Uploading here so similarities and differences between this and the first "bt_full.txt" file can be compared

By: Damjan Jovanovic (damjan-ecn) 2012-04-05 04:57:49.042-0500

Problem is caused by the following:
* some bad packet arrives on a T.38 port (seems to be an RTCP packet? I don't have it around any more).
* the parser in udptl_rx_packet() interprets it badly, resulting in an ast_frame having positive datalen but NULL data.ptr.
* thus somewhere else in Asterisk, either in ast_udptl_write() like in the above trace, or in a frdup(), a memcpy() attempts to copy a positive number of bytes from a NULL pointer
* SIGSEGV isn't caught -> "Segmentation fault"

I have a patch that returns ast_null_frame if any ast_frame returned from udptl_rx_packet() has that bad property (of NULL data.ptr and positive datalen), and in my limited testing it seems to prevent the crash, but this will probably still corrupt the T.38 session, as udptl_rx_packet() will wrongly update rx_seq_no for the ast_udptl. I can upload the patch when my license agreement is approved, but I think the parser in udptl_rx_packet() should be made more robust to bad packets instead.

Oh and by the way, this is some kind of regression, as Asterisk 1.4.21 DOESN'T have this bug.


By: Rob Gagnon (rgagnon) 2012-04-05 10:37:48.859-0500

Damjan,

Excellent description.  Bad packets were my initial instinct as well, but regular work always interferes with trying to find the amount of time to trace things like this :-)

I don't believe this is a regression issue.  Any of the segfaults we get for Faxes happen during Fax gateway operation, which is the only way we pass faxes on our platform.  Asterisk versions prior to version 10 didn't support gateway mode.

By: Damjan Jovanovic (damjan-ecn) 2012-04-10 04:02:16.586-0500

The regression would be in main/udptl.c, the file that is used by both T.38 gateway and passthrough, that's why I've found the problem with Asterisk 1.8 too.


By: Damjan Jovanovic (damjan-ecn) 2012-04-11 10:04:40.207-0500

A patch. In a week of heavy testing in production, this stops the crashes. It works by checking for a NULL frame->data.ptr when frame->datalen is non-zero, and returning ast_null_frame instead of the bad frame. But it doesn't actually fix the problem in udptl_rx_packet() so faxes could still fail.


By: Rob Gagnon (rgagnon) 2012-06-18 18:36:51.075-0500

Posting a follow-up.  We had 4 servers core dump once each today in the exact manner as posted previous.  This is now under version 10.5.0 that we crashed.

The origin of the crash always starts out in sip_write() for the frametype AST_FRAME_MODEM as suspected by Damjan above.

I had patched udptl.c in various places (including ast_udptl_write()) to log write an error if conditions are not proper before dropping in to udptl_build_packets() function, but the log message never appeared.

Note in the backtrace that we go from ast_udptl_write() into a failed memcpy() call.  There are no memcpy() calls in that function, and I could not locate any macros that might trigger a memcpy() call without jumping to another function which would then end up on the backtrace.

By: Kinsey Moore (kmoore) 2012-08-06 09:01:07.070-0500

Uploaded updated version of the patch.

By: Kinsey Moore (kmoore) 2012-08-09 09:49:54.576-0500

Rob,
Would you mind trying the patch on ASTERISK-19762 (ASTERISK-19762_fix.diff)?  I suspect that this is the same bug reported in that issue.

By: Rob Gagnon (rgagnon) 2012-08-09 11:59:13.683-0500

I can... I've been running currently with the ASTERISK-19373.diff patch in order to test it out.

Should I run with that AND the 19762 patch, or just one of them?

By: Kinsey Moore (kmoore) 2012-08-09 12:03:19.322-0500

Just one of them.  The ASTERISK-19373.diff patch should be unnecessary.

By: Rob Gagnon (rgagnon) 2012-08-09 14:31:53.774-0500

Ok.  Will report back when a difference is noted.  It does however take a number of days, and sometimes weeks before this bug would appear, so it might be a while.

By: Kinsey Moore (kmoore) 2012-08-17 08:30:18.035-0500

Since it has been about a week now, how are things looking on your end?

By: Rob Gagnon (rgagnon) 2012-08-17 10:57:26.157-0500

We upgraded 8 of our servers from 10.5.2 (without the patch) to 10.7.0 (with the 19762 patch) on 8/9/2012.  Without any issues appearing by the 14th, we upgraded the rest of them, and so far have no seg faults caused by this issue.

This is generally good news, and reading the code of the patch, looks like it should solve the problem.  For full disclosure though, I think our T.38 call volume is much lower than it has been due to our platform not always being able to handle the fax calls because of the bug.  We are still waiting for more of our customers to start routing more of their T.38 calls through us.

We've also had updates from CentOS for glibc that may have been a contributing issue as well.  I would bet we could close this issue and re-visit it if anything comes up again.