[Home]

Summary:ASTERISK-19851: Asterisk Crashes in chan_sip when failing to create ast_str in init_resp
Reporter:Sebastian Gutierrez (sum)Labels:
Date Opened:2012-05-08 10:29:41Date Closed:2012-09-17 14:34:02
Priority:MajorRegression?
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:1.8.10.1 Frequency of
Occurrence
Related
Issues:
Environment:Centos 6.2 dahdi 2.6.1 asterisk-18.10Attachments:( 0) Backtrace.rtf
( 1) backtrace.txt
( 2) backtrace10p1.txt
( 3) backtrace10p2.txt
( 4) backtrace11.txt
( 5) backtrace2.txt
( 6) backtrace3.txt
( 7) backtrace4.txt
( 8) backtrace5.txt
( 9) backtrace6.txt
(10) backtrace7.txt
(11) backtrace8.txt
(12) backtrace9.txt
(13) BackTraceAndOthers.txt
(14) BACKTRACEDUMP.rtf
(15) backtracenueva1.txt
(16) coreshowlocks.rtf
(17) debug.tar.gz
(18) debug2.tar.gz
(19) valgrind.txt
Description:1.8.10

System crashes.

[Edit] - update by Matt Jordan

Per the backtrace, the crash occurs in malloc_consolidate when called by ast_calloc.  At the time of the crash, chan_sip is requesting an ast_str be created with an initial size of SIP_MIN_PACKET, which is 4096 bytes.
Comments:By: Sebastian Gutierrez (sum) 2012-05-08 10:30:55.169-0500

I know this is not optimized but the system that I´m using gets very slow on optimized.
let me know if this is useful.

By: Matt Jordan (mjordan) 2012-05-08 10:56:52.201-0500

The backtrace does contain valid information.

However, it appears as if the crash occurred due to your system not being able to allocate sufficient memory on the heap for the SIP response packet.  How much memory is available on the system running Asterisk?

By: Sebastian Gutierrez (sum) 2012-05-08 11:02:52.237-0500

this is the free of the server

[root@SERVER ~]# free
            total       used       free     shared    buffers     cached
Mem:       8050748    2168432    5882316          0     198088    1644976
-/+ buffers/cache:     325368    7725380
Swap:     10289144          0   10289144



By: Matt Jordan (mjordan) 2012-05-08 11:08:51.416-0500

We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information

Interesting - if you can reproduce the error, can you attach a DEBUG log using the instructions provided above.  Please make sure that 'sip set debug on' is also enabled.

By: Sebastian Gutierrez (sum) 2012-05-08 12:44:22.756-0500

I will try to get a sip debug file, although is gona be large due to this is a production environment.


By: Matt Jordan (mjordan) 2012-05-08 12:53:32.918-0500

If the file is too large, do the following:

* Use split to separate the log into chunks
* Determine what chunks correspond to the time at which the crash occurs.  Tarball those up and attach them to the issue

We don't need the entire run history, just the SIP dialog history and what Asterisk was doing when the crash happened.

By: Sebastian Gutierrez (sum) 2012-05-09 10:31:18.650-0500

This time the machine enter in deadlock as you can see on the txt the netstat to the port 5060 the udp queue start growing not being able to processs more, in the tar.gz you have full log with sip debug, at the end you can see when it stop working. I now have this same environment with don´t optimeze and  debug threads so if it happens again you can have all the information avaiable.

By: Sebastian Gutierrez (sum) 2012-05-09 10:54:30.561-0500

it crashes again I uploaded a backtrace full with dont optimize.

By: Sebastian Gutierrez (sum) 2012-05-09 11:16:32.235-0500

and this debug is for the crash with sip debug on

By: Sebastian Gutierrez (sum) 2012-05-09 14:15:22.176-0500

I can upgrade tonight to other version, maybe 1.8.12??? if this is better for moving this issue forward

By: Matt Jordan (mjordan) 2012-05-09 14:27:49.841-0500

I don't think going to a variety of point releases of asterisk 1.8 is going to change your behavior.  Your latest backtrace that you attached the issue shows another memory corruption issue, this time in manager (although this time it terminated with an abort, as opposed to a seg fault).

Looking at the logs, one thing that is very odd are:

[May  9 06:36:28] VERBOSE[10315] chan_sip.c:
<--- SIP read from UDP:192.168.11.109:5062 --->


<------------->

There are a lot of these.  Something appears to be sending empty packets to the Asterisk server.  Can you determine what is doing that, and why?

By: Richard Mudgett (rmudgett) 2012-05-09 19:12:49.228-0500

Matt:
Many implementations send cr/lf packets to keep NAT traversal pin-holes open.  This is the lightweight NAT feature just implemented in trunk.

By: Sebastian Gutierrez (sum) 2012-05-10 09:35:39.670-0500

Again another crash I moved to 1.8.11 last night, this seems again a manager crash??

full backtrace with dont optimize uploaded

By: Matt Jordan (mjordan) 2012-05-10 11:46:41.674-0500

Richard:
I know that's the lightweight NAT feature, and it shouldn't cause any problems if a phone sends that to Asterisk and it doesn't support it.  When weird stuff is happening, however, its not usually a bad approach to eliminate the things that Asterisk does not explicitly support in the version he's using.

sum:
I took a look at the latest backtrace, and it appears as if a memory corruption is occurring when the period purge_events is called on a tcptls manager connection.  However, on its face, that shouldn't be possible - each event to be purged is freed after its removed from the list, and the list should never contain an event that has already been freed.  I walked through the manager code and couldn't find a mechanism wherein an event would be left on the list after it had been deallocated.

Given the other backtrace you attached, wherein an ast_str couldn't be allocated, there still does not appear to be an obvious problem in Asterisk.  

The next step in attempting to determine what the cause of this is to use valgrind to get the most information possible regarding the memory corruptions.  We shouldn't need DEBUG_THREADS enabled, but MALLOC_DEBUG and DONT_OPTIMZIE will be needed.  Please follow the instructions on the wiki [1] and attach an output of the valgrind.txt file that is created to this issue.

https://wiki.asterisk.org/wiki/display/AST/Valgrind



By: Sebastian Gutierrez (sum) 2012-05-10 11:55:52.357-0500

Matt:  I dont have anything connecting with tls... to the manager, is only connected directly from socket 5038. Running under valgrind is impossible, remember this is a production environment, I may be able to add malloc debug but I dont know if will give you info on a backtrace

By: Sebastian Gutierrez (sum) 2012-05-10 12:40:27.116-0500

Another crash, this time asterisk enter in deadlock when I try to get the information of core show locks asterisk crashed, again a full BT

By: Sebastian Gutierrez (sum) 2012-05-10 14:26:57.685-0500

backtraces of crashes and valgrind in a test environment

By: Sebastian Gutierrez (sum) 2012-05-10 15:56:43.655-0500

other crash now seems a different place.. I have one more day to test with 1.8 according to the client, 10 crashes in one day

By: Sebastian Gutierrez (sum) 2012-05-10 17:22:18.138-0500

other involving SIP....

By: Matt Jordan (mjordan) 2012-05-14 12:46:55.255-0500

Thanks for testing this out in your test environment.

Can you put your sip.conf and extensions.conf on this issue?

Thanks

Matt

By: Sebastian Gutierrez (sum) 2012-05-14 12:49:43.424-0500

Matt: I compiled asteirsk without the function developed by me, and downgraded to 1.8.8 to see if the problem persists and it persist, the system was on sip deadlock, and a core show locks dumped asterisk, this is the backtrace generated

By: Sebastian Gutierrez (sum) 2012-05-14 12:52:13.110-0500

sip.conf

[general]

allowguest=no
allowoverlap=no
bindport=5060
bindaddr=0.0.0.0
srvlookup=yes
alwaysauthreject=yes
rtcachefriends=yes
qualifyfreq=30
rtptimeout=60
callcounter=yes
allowsubscribe=no
notifyringing=yes
notifyhold=yes
notifybusy=yes
limitonpeers=yes
subscribecontext=default
transport=udp
videosupport=no

sip peers are realtime


extensions.conf is very large but as an example of outgoing calls:

exten=> _1XXXXXXXX,1,Set(HASH(Result)=${ODBC_Data(select did\,name from queues where name in (select queue_name from queue_members where membername =  '${CDR(accountcode)}')  and direction = 'Outbound')})
exten=> _1XXXXXXXX,2,Set(__OUTQUEUE=${HASH(Result,name)})
exten=> _1XXXXXXXX,3,Set(CALLERID(num)=${HASH(Result,did)})
exten=> _1XXXXXXXX,4,Set(CALLERPRES()=allowed_passed_screen)
exten=> _1XXXXXXXX,5,Set(__AGENT=${CDR(accountcode)})
exten=> _1XXXXXXXX,6,GUID(guid)
exten=> _1XXXXXXXX,7,Set(CDR(guid)=${guid})
exten=> _1XXXXXXXX,8,Set(CDR(type)=record)
exten=> _1XXXXXXXX,9,Set(__DIALED=${EXTEN})
exten=> _1XXXXXXXX,10,MixMonitor(${guid}.gsm,b,)
exten=> _1XXXXXXXX,11,Set(CDR(campaign)=${OUTQUEUE})
exten=> _1XXXXXXXX,12,Set(CDR(accountcode)=${AGENT})
exten=> _1XXXXXXXX,13,Dial(DAHDI/g1/${EXTEN:1},60,tT,)
exten=> _1XXXXXXXX,14,Hangup()

Never had crashes with 1.6.2

By: Sebastian Gutierrez (sum) 2012-05-14 12:58:05.202-0500

and another dump, I wont be able to follow on 1.8 much time... there are some crashes that they all look the same

1) crashes on manager tls... I have tls off on manager.conf
2) sip deadlocks that crash on core show locks

the system uses chanspy, maybe is important to know for the sip deadlocks

By: Sebastian Gutierrez (sum) 2012-05-17 15:53:48.162-0500

I´m adding new data that can be usefull, the system is now with asterisk-1.8.10 and with manager.c and manager.h from the head of trunk, this seems more stable than all the other versions tested, yesterday no dump at all (I was having many every day on other releases), today I had 2 consecutives crashes and no more crashes (with 8000 calls processed after de 2 crashes ) The first crash backtrace10p1 (seems in mixmonitor) and the safe_asterisk start asterisk again and after 1 min, the second crash (seems on manager events), this have other software that reconectes when the conection to the manager is down so maybe is some cleanup that is not managed well while starting again so for that the error on purge_events that seems to crash asterisk.  

By: Sebastian Gutierrez (sum) 2012-05-18 10:40:15.958-0500

this is a coreshowlocks obtained of a sip deadlock from today.


By: Richard Mudgett (rmudgett) 2012-05-18 11:09:42.056-0500

Without looking further, the core-show-locks file shows an ast_read either locking a channel and forgetting to release the lock or locking the channel and then waiting on a poll that is not going to happen.

By: Sebastian Gutierrez (sum) 2012-05-24 16:17:36.598-0500

any update on this? let me know if I can help further, I still have my client with this version, so is possible to test patches.

thanks!

By: Sebastian Gutierrez (sum) 2012-05-28 18:01:16.705-0500

attached a new dump, this time seems different in mixmonitor, Im also seeing this messages time to time

May 28 15:46:19] ERROR[28099]: lock.c:438 __ast_pthread_mutex_unlock: audiohook.c line 705 (audio_audiohook_write_list): Error releasing mutex: Operation not permitted
[May 28 15:46:19] ERROR[28099]: lock.c:280 __ast_pthread_mutex_lock: audiohook.c line 688 (audio_audiohook_write_list): Error obtaining mutex: State not recoverable
[May 28 15:46:19] ERROR[28099]: lock.c:407 __ast_pthread_mutex_unlock: audiohook.c line 705 (audio_audiohook_write_list): mutex '&(audiohook)->lock' freed more times than we've locked!

By: Sebastian Gutierrez (sum) 2012-05-30 14:06:50.537-0500

just to add more info a memtest passed 100% ok, so the memory of the server is ok.

By: Sebastian Gutierrez (sum) 2012-06-04 13:39:24.430-0500

again more crashes, this one seems related to realtime odbc, I´ve just moved from the deprecated mysql for cdr and realtime, but I´m finding that the deprecated are far more stable than this ones, as for cdr I opend this issue ASTERISK-19957, as for this I´m attaching it here due to always after a crash safe_asterisk make asterisk up again and when that happend some minutes later I have another crash of purge_events on manager (this may lead to the problem) for now I´ve moved back to cdr_mysql due to loss of data on adaptative_odbc, as for realtime Im letting this until I have other crash that seems related then I will go back to res_config_mysql.

By: Sebastian Gutierrez (sum) 2012-09-17 14:34:02.865-0500

moved to 10 and this bug is not present