[Home]

Summary:ASTERISK-14050: [patch] Asterisk loses SIP phones, possible deadlock, 1.6.1.0
Reporter:Marc M. Adkins (madkins)Labels:
Date Opened:2009-05-01 11:03:26Date Closed:2009-05-06 12:41:03
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_timing_pthread
Versions:Frequency of
Occurrence
Related
Issues:
is related toASTERISK-21389 res_timing_pthread fails to return from write, causing timer dependent operations to block indefinitely
Environment:Attachments:( 0) core-locks.txt
( 1) desktop.dir.txt
( 2) extensions.conf
( 3) locked.txt
( 4) openreaddir.c
( 5) server.dir.txt
( 6) sip.conf
Description:I have two Cisco 7905g SIP phones connected to an Asterisk 1.6.1.0 instance running on a 64 bit Xen instance of Debian 4.0.  My initial configuration was more complex, but I've removed a lot of the complexity searching for the problem.

Basically, I can start the Asterisk server and pick up a SIP phone and call either a test extension or the other phone.  Works fine.  If I leave it alone for a time ... say over a long lunch or overnight ... I come back and the phones won't connect to Asterisk.

This repeats reliably but at unknown intervals.


****** ADDITIONAL INFORMATION ******

Using tcpdump I see SIP packets hitting the server, but not being answered:

09:00:32.229953 IP 10.104.230.41.sip > vspbx1.devint.marchex.com.sip: SIP, length: 773
09:00:32.369973 IP 10.104.230.40.sip > vspbx1.devint.marchex.com.sip: SIP, length: 447
09:00:32.667864 00:01:e8:57:a5:a1 (oui Unknown) > 01:00:0c:cc:cc:cd (oui Unknown) SNAP Unnumbered, ui, Flags [Command], length 50
09:00:32.777186 IP 10.104.230.41.sip > vspbx1.devint.marchex.com.sip: SIP, length: 333
09:00:33.278492 IP 10.104.230.41.sip > vspbx1.devint.marchex.com.sip: SIP, length: 333

(230.41 is the phone I was testing in this instace, vspbx1.devint.marchex.com is the server running asterisk)

I will attach some more data presently.
Comments:By: Marc M. Adkins (madkins) 2009-05-01 11:05:14

core locks after an overnight test ... came in this morning and asterisk locked up again

By: Marc M. Adkins (madkins) 2009-05-01 11:09:11

i have asterisk compiled with debug on and using gdb i used 'info threads' and 'thread apply all bt' to show the backtraces from various threads

i wasn't running with sip debug on so i don't have that, sorry, but it would likely have been quite large on an overnight run

By: Marc M. Adkins (madkins) 2009-05-01 11:11:18

attaching sip.conf and extensions.conf as well, note that peer connection was commented out before last night's test, up until now i was thinking that might have been part of the problem

By: Marc M. Adkins (madkins) 2009-05-01 11:13:25

i have been messing with settings all week, have tried various things, thus far none of my changes have prevented the problem, i'm more than willing to try pretty much anything but i only get two reliable tests a day (lunch and overnight) so it's been slow going

i may try tracing some source code, but i'm really hoping this was a cockpit error and i get flamed about some misbegotten setting i screwed up ;-)

By: Marc M. Adkins (madkins) 2009-05-01 16:16:13

timing thread

 init_timing_thread starts do_timing as separate thread
 do_timing calls ao2_callback with run_timer as argument
> __ao2_callback locks pthread_timers container
 run_timer is called from __ao2_callback
> run_timer locks specific timer from pthread_timers container
 tries to write byte to timer pipe
 ...and hangs..

monitor thread

 sip_request_call and sip_reload invoke restart_monitor
 restart_monitor starts do_monitor if necessary
 do_monitor calls ast_io_wait with an io context
 ast_io_wait calls back to sipsock_read
 sipsock_read calls handle_request_do
> handle_request_do locks netlock
 handle_request_do calls handle_incoming
 handle_incoming calls handle_request_invite
 handle_request_invite calls sip_new
 sip_new calls ast_channel_alloc
 ast_channel_alloc calls ast_timer_open
> ast_timer_open locks timing_interfaces
 ast_timer_open invokes a callback on a data structure
   which (eventually) invokes pthread_timer_open
> pthread_timer_open attempts to lock pthread_timers
 ...and there we go...

i haven't been able to figure out why the pipe is not read ... i'm going to instrument the code some and do more testing ... appreciate any comments

By: Marc M. Adkins (madkins) 2009-05-01 18:01:50

i was given some advice about using nat=yes, but that was actually a mistake, i had set nat=no in the general level of sip.conf but not removed it from the individual sip phones, doing so now

By: Marc M. Adkins (madkins) 2009-05-01 18:19:45

in the meantime, i instrumented res_timing_pthread.c...

i added code to track how many bytes were written to and read from each pipe inside of res_timing_pthread.c, adding code to read_pipe and write_byte

there's one FD (11 on my box) that only seems to be written to, never read from, and when it hits 65536 all debug statements stop, this seems like a problem, like there's nothing setup to read from that pipe

i also instrumented the pthread_timer_xxx functions (e.g. pthread_timer_open), but they don't get called until an actual phone call is made ... which begs the question ... why is write_byte being called so many times if the pthread_timer_open initialization is never called?

still exploring that

By: Marc M. Adkins (madkins) 2009-05-01 18:21:04

btw, filling up the pipe (if that is indeed what is happening) takes roughly an hour, which fits with what i've been observing thus far

By: Marc M. Adkins (madkins) 2009-05-01 19:01:01

so it's the do_timing call of ao2_callback w/run_timer as an argument that does it, apparently run_timer is called (a lot) to return a match value and each time it is called it writes a byte to a pipe that is (afaik) never read from

just duplicated the behavior w/nat=no throughout so that was a red herring

there's also a pile of messages at the end after it locks up that make it look like a call is in some sort of progress:

[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: write_byte(11)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: done writing byte 11 (65533)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: write_byte(11)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: done writing byte 11 (65534)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: write_byte(11)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: done writing byte 11 (65535)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: write_byte(11)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: done writing byte 11 (65536)
[May  1 16:58:59] DEBUG[13588] res_timing_pthread.c: write_byte(11)
[May  1 16:59:05] DEBUG[13585] chan_sip.c: **** Received REGISTER (2) - Command in SIP REGISTER
[May  1 16:59:05] DEBUG[13585] chan_sip.c: Trying to put 'SIP/2.0 20' onto UDP socket destined for 10.104.230.40:5060
[May  1 16:59:05] DEBUG[13582] devicestate.c: No provider found, checking channel drivers for SIP - Batman
[May  1 16:59:05] DEBUG[13582] chan_sip.c: Checking device state for peer Batman
[May  1 16:59:05] DEBUG[13582] devicestate.c: Changing state for SIP/Batman - state 1 (Not in use)
[May  1 16:59:05] DEBUG[13582] devicestate.c: device 'SIP/Batman' state '1'
[May  1 16:59:17] DEBUG[13585] acl.c: Found IP address for this socket
[May  1 16:59:17] VERBOSE[13585] netsock.c:   == Using SIP RTP CoS mark 5
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Setting NAT on RTP to Off
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Allocating new SIP dialog for 837895375@10.104.230.41 - INVITE (With RTP)
[May  1 16:59:17] DEBUG[13585] chan_sip.c: **** Received INVITE (5) - Command in SIP INVITE
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Setting NAT on RTP to Off
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Trying to put 'SIP/2.0 40' onto UDP socket destined for 10.104.230.41:5060
[May  1 16:59:17] DEBUG[13585] chan_sip.c: **** Received ACK (6) - Command in SIP ACK
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Stopping retransmission on '837895375@10.104.230.41' of Response 1: Match Found
[May  1 16:59:17] DEBUG[13585] chan_sip.c: **** Received INVITE (5) - Command in SIP INVITE
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Setting NAT on RTP to Off
[May  1 16:59:17] DEBUG[13585] chan_sip.c: We're settling with these formats: 0xc (ulaw|alaw)
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Checking SIP call limits for device Robin
[May  1 16:59:17] DEBUG[13585] chan_sip.c: Updating call counter for incoming call
[May  1 16:59:17] DEBUG[13585] res_timing_pthread.c: pthread_timer_open()
[May  1 16:59:17] DEBUG[13585] res_timing_pthread.c: pipes:  19/20

By: Gergely Dömsödi (doome) 2009-05-04 11:25:48

I had the same issue. And after a while I recognized a growing number of sip channels (20000+) waiting for an OPTIONS reply.

After adding qualify=no to the sip.conf the problem was solved.

Doesn't this help to You, too?

By: Marc M. Adkins (madkins) 2009-05-04 17:54:27

No, that doesn't help.

This thing gets more and more twisted.  I'm seeing different behavior on two different systems.  Both are running Debian 4.  I've downloaded source for 1.6.1.0 and built per normal directions on both.

On my desktop, running 32-bit D4, I don't see the problem.  On the server, running 64-bit D4, it is easily repeatable.

Right now it's looking like the order of loading of modules chan_iax2.so and res_timing_pthread.so is changing the behavior between the two.  If res_timing_pthread.so is loaded first then it is possible for chan_iax2.so to add itself to the existing timing thread, otherwise it can't happen.  When it adds to the existing timing thread (which happens on our server but not my desktop) the timing thread fires repeatedly until the pipe is filled up.

More questions to answer, I suppose.

By: Marc M. Adkins (madkins) 2009-05-04 18:44:18

on my desktop the main/loader.c load_modules() function shows (via trace statements) a regular, pseudo-alphabetized order for modules in the modules directory (autoload=yes).  This allows chan_iax2 to load prior to res_timing_pthread which avoids the pipe filling up problem

on the server the same function shows a 'random' order for modules (again autoload=yes) which just happens to always load res_timing_pthread prior to chan_iax2 so that the timer is invoked during the iax load module code and the pipe filling up occurs

i can't say why the order is different on the two systems

if i use 'preload => res_timing_pthread.so i can get the correct (?) ordering on my deskto and i'm running a test now to see if it locks up as i expect, it won't be done until tomorrow (bus schedule, sorry)

i suspect i could use 'preload => chan_iax2.so' on the server and get an incorrect ordering that would avoid the lockup, but that seems like the wrong solution

since i'm not currently using IAX for anything, i think i'm going to turn it off in modules.conf for the nonce, which may well prevent the issue entirely

hazarding a guess, it may be that if i actually had any IAX connections their normal function would invoke the proper _disable_continuous and/or _ack timer callback(s) and thus pull data from the full pipe, which would work as everyone expects

By: Marc M. Adkins (madkins) 2009-05-05 11:37:31

yup, setting 'preload => res_timing_pthread.so' in modules.conf causes the problem i've been having on the server to replicate on the desktop

without the specific ordering the problem won't occur, so i'm thinking that perhaps this issue has been around all along but never showed up before due to the directory ordering being different on different platforms

to summarize, in modules.conf use:

[modules]
autoload=yes
preload => res_timing_pthread.so

and eventually the chan_iax2.so timer will fill up its outgoing thread, unless (presumably) one actually has IAX communication

to fix, in modules.conf use:

noload => chan_iax2.so

By: Marc M. Adkins (madkins) 2009-05-05 11:39:29

i would change the category of this bug if i could (or knew how) as it now seems to have nothing to do with SIP code

it seems like this should never be able to happen, but i'm unclear on what a good fix would be, and now that i have a work-around i'm going to stop looking

if there's a good place where i should document this behavior i'm happy to write something up, but i need a pointer thereto

By: Marc M. Adkins (madkins) 2009-05-05 12:06:50

ok, one more bit, i grabbed a simple C program from the internet that just lists a directory, built and ran it on both my desktop and the server, just to show what i think the root cause of the module loading order issue is

both systems are running Debian 4.0, both are using libc.so.6 v 2.3.6, both are configured with ext3 file systems

the server is 64-bit on XEN, the desktop is 32-bit, don't know of any other differences

i'll attach the test program and directory listings

By: Marc M. Adkins (madkins) 2009-05-05 12:10:07

had to stop and sign a license agreement to upload that file, even though it wasn't actually a code submission and i hadn't checked the box indicating same, don't know if that was intended

By: Marc M. Adkins (madkins) 2009-05-05 12:18:45

of course the code should never actually depend on the directory traversal order ;-)

so there may be two issues here:

1) some modules depend on others and that relationship isn't (afaik) clearly documented anywhere (please correct me if i'm wrong, love to have that as a resource) and the appearance of issues is in some cases dependent on the directory traversal order which is unpredictable

2) the res_timing_pthread code will eventually lockup if the pipe is not periodically cleared or read from and there appears to be no fallback for this situation in cases such as this, which may occur when a module is loaded and creates a timer object and then never calls an _ack or _disable_continuous on it

both of these appear to be existing problems and may require separate solutions, as it appears that the chan_iax2.so module may not properly configure itself without a hint in the modules.conf

again ... i may have missed some key documentation ... i confess in advance to ignorance as well as laziness, impatience, and hubris

By: Digium Subversion (svnbot) 2009-05-06 12:38:52

Repository: asterisk
Revision: 192808

U   trunk/channels/chan_iax2.c

------------------------------------------------------------------------
r192808 | file | 2009-05-06 12:38:52 -0500 (Wed, 06 May 2009) | 10 lines

Fix a bug where a timer would be created but not acknowledged.

This scenario crept up if chan_iax2 was loaded with no configuration file present.
It would create a timer and tell it to go at an interval but the thread that normally
acknowledges it would not be created because no configuration file was present. The timer
will now be closed if no configuration file is present.

(closes issue ASTERISK-14050)
Reported by: madkins

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=192808

By: Digium Subversion (svnbot) 2009-05-06 12:39:54

Repository: asterisk
Revision: 192809

_U  branches/1.6.1/
U   branches/1.6.1/channels/chan_iax2.c

------------------------------------------------------------------------
r192809 | file | 2009-05-06 12:39:54 -0500 (Wed, 06 May 2009) | 17 lines

Merged revisions 192808 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r192808 | file | 2009-05-06 14:38:51 -0300 (Wed, 06 May 2009) | 10 lines
 
 Fix a bug where a timer would be created but not acknowledged.
 
 This scenario crept up if chan_iax2 was loaded with no configuration file present.
 It would create a timer and tell it to go at an interval but the thread that normally
 acknowledges it would not be created because no configuration file was present. The timer
 will now be closed if no configuration file is present.
 
 (closes issue ASTERISK-14050)
 Reported by: madkins
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=192809

By: Digium Subversion (svnbot) 2009-05-06 12:41:02

Repository: asterisk
Revision: 192810

_U  branches/1.6.2/
U   branches/1.6.2/channels/chan_iax2.c

------------------------------------------------------------------------
r192810 | file | 2009-05-06 12:41:02 -0500 (Wed, 06 May 2009) | 17 lines

Merged revisions 192808 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r192808 | file | 2009-05-06 14:38:51 -0300 (Wed, 06 May 2009) | 10 lines
 
 Fix a bug where a timer would be created but not acknowledged.
 
 This scenario crept up if chan_iax2 was loaded with no configuration file present.
 It would create a timer and tell it to go at an interval but the thread that normally
 acknowledges it would not be created because no configuration file was present. The timer
 will now be closed if no configuration file is present.
 
 (closes issue ASTERISK-14050)
 Reported by: madkins
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=192810