[Home]

Summary:ASTERISK-17123: [patch] fix SIP indicate deadlocks when lots of state changes
Reporter:cmaj (cmaj)Labels:
Date Opened:2010-12-16 16:21:06.000-0600Date Closed:2011-02-24 02:47:21.000-0600
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace-threads-110128.txt
( 1) chan_sip_fix_deadlocks_bug_18491.txt
( 2) core_show_locks.txt
( 3) coreshowlocks.gelo.12-39.txt
Description:Running 1.8.2-rc1 and was able to deadlock several times during failed SIP state change indications.  Found by testing 4 SIP phones in close proximity and rapidly, yet manually, transferring blind, parking calls, picking them up at another phone, etc.  Also, was using the Bridge() app for buddy pickup purposes.

****** ADDITIONAL INFORMATION ******

sip_set_rtp_peer() in chan_sip needed some ast_channel locking & unlocking.
Comments:By: Leif Madsen (lmadsen) 2010-12-20 08:47:06.000-0600

Some dialplan examples and call flows of how to reproduce this may also be useful.

By: cmaj (cmaj) 2010-12-20 13:18:13.000-0600

Phone situation:
----------------

Registered 4 phones, 5081-5084.

5081 & 5082 context=the-parkers
5083 & 5084 context=the-stealers

Extension 5082 registers hints for 5081 & 5083 & 701 (parking lot space 1).


Dialplan examples:
------------------

[the-stealers]
exten => _XXXX,1,Set(DB(lastcall/${EXTEN})=${CHANNEL})
exten => _XXXX,n,Dial(SIP/${EXTEN})
exten => _*XXXX,1,Bridge(${DB(lastcall/${EXTEN:1})})
exten => _XXXX,hint,SIP/${EXTEN}

[the-parkers]
include => the-stealers
exten => 701,1,ParkedCall(${EXTEN})
exten => 702,1,ParkedCall(${EXTEN})
exten => 701,hint,park:701@parkedcalls
exten => 702,hint,park:702@parkedcalls
exten => 700,1,NoOp
exten => 700,n,Set(PARKINGEXTEN=701)
exten => 700,n,Park()
exten => 700,n,Set(PARKINGEXTEN=702)
exten => 700,n,Park()


Call flow:
----------

The most important element of the flow to uncover this locking bug was the speed at which the transfers are taking place.  Try this:

1. From 5083 dial 5081 - do not answer at 5081.
2. While still ringing, dial *5081 from 5082 to pickup the call.
3. 5082 blind xfers to 700 (using REFER via custom "Park" soft key.)
4. 5083 gets announcement "701" then MOH.
5. When MOH starts, 5082 dials 701 to pickup the parked call.
6. Repeat steps 3-5 three times.  Rapidly.
7. 5082 blind xfers to 5084.
8. 5084 answers and IMMEDIATELY blind xfers to 5081.  (Not even enough time to say "hello.")
9. Hangup this call from 5083.
10. Repeat steps 1-9 one time (so you do all this twice).  Then it locks.  See core_show_locks.txt attachment from 2010-12-16.


Further comments:
-----------------

Problem I believe is in queue'd up sip_notify() calls -- or perhaps the lack of queue'ing of such calls.  This bug does not manifest itself w/o the "hint" line in extensions.conf so it is very much in the NOTIFY handlers.



By: Jeff Peeler (jpeeler) 2011-01-26 18:17:41.000-0600

Since the private is locked in sip_set_rtp_peer, this looks like the right fix. Several calls to sip_set_rtp_peer from remote_bridge_loop have the channels locked, but then others do not.

By: gelo (gelo) 2011-01-27 06:26:13.000-0600

We're experiencing deadlocks on a production server (aprox once a day).
I'm attaching a core show locks, which imo seems to be related to this issue, although we are not using hints.

If you could please confirm the relationship, I'll apply the patch today and let you know if * stops freezing.

By: Jeff Peeler (jpeeler) 2011-01-27 12:09:55.000-0600

Gelo, yep this is the same deadlock. Apply the patch and report back.

By: Sébastien Couture (sysreq) 2011-01-27 13:21:56.000-0600

Would that also affect 1.6.2?

By: Sébastien Couture (sysreq) 2011-01-27 13:26:34.000-0600

Nevermind.. the ast_channel_lock/unlock are not present in the 1.6.2 version either, so I guess that it does affect it too.

By: gelo (gelo) 2011-01-28 02:12:56.000-0600

Patch applied, fingers crossed

Just asking, shouldn't the patch have 'License OK'?

By: gelo (gelo) 2011-01-28 06:42:14.000-0600

We've just had a crash, I can see it's related to sip_set_rtp_peer function, that is why I'm attaching debug information here.

Asterisk was running fine until last week, when some queue agents began making transfers, then deadlock issues started. This crash happened right after one of these transfers.

Also, I had to recompile without DEBUG_THREADS, as htop showed CPU usage around 90%, asterisk process reaching 395% (AWESOME!). Users complaint about sound delays, so I removed the flag and CPU usage went down to aprox 1%, asterisk process around 70% (can't figure out how this is possible with just one quad core)



By: Jeff Peeler (jpeeler) 2011-01-28 14:32:11.000-0600

I don't know what caused the crash, can you run it under valgrind? I don't think the crash is related to the patch.

By: gelo (gelo) 2011-01-31 05:43:37.000-0600

I'm afraid I cannot, as it is a production server and I can't risk to run valgrind.
Furthermore, we're downgrading to 1.4, so I won't be able to help debug this issue.

By: cmaj (cmaj) 2011-02-01 17:54:54.000-0600

I've since "solved" this by removing Bridge() in favor of Pickup().

By: cmaj (cmaj) 2011-02-03 12:27:25.000-0600

gelo, I double-checked your core locks and mine from the OP, and it looks like these are two different (yet related) problems. Your deadlocks seem to be in the masquerade functions, mine are not.

Do you have some snippet of dialplan still remaining? Or maybe a quick few steps of usage scenario that was causing this?

By: Jeff Peeler (jpeeler) 2011-02-03 17:41:23.000-0600

I'm going to go ahead and commit this. It's too bad gelo couldn't provide further information/testing, but the fix is fine.

By: Digium Subversion (svnbot) 2011-02-03 17:49:29.000-0600

Repository: asterisk
Revision: 306215

U   branches/1.8/channels/chan_sip.c

------------------------------------------------------------------------
r306215 | jpeeler | 2011-02-03 17:49:28 -0600 (Thu, 03 Feb 2011) | 20 lines

Fix SIP deadlock involving state changes.

Once again a call to pbx_builtin_getvar_helper (and pbx_builtin_setvar_helper)
has caused locking problems. Both of these functions lock the channel when
the channel argument is passed in!

In this case, the suspected problem (the backtrace makes it impossible to tell)
was the private being locked in sip_set_rtp_peer and then:
transmit_reinvite_with_sdp
try_suggested_sip_codec
  pbx_builtin_getvar_helper
(Traced to verify that the fix was only required in 1.8 and later.)

(closes issue ASTERISK-17123)
Reported by: cmaj
Patches:
     chan_sip_fix_deadlocks_bug_18491.txt uploaded by cmaj (license 830)
Tested by: cmaj


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=306215

By: Digium Subversion (svnbot) 2011-02-03 17:50:10.000-0600

Repository: asterisk
Revision: 306216

_U  trunk/
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r306216 | jpeeler | 2011-02-03 17:50:09 -0600 (Thu, 03 Feb 2011) | 26 lines

Merged revisions 306215 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.8

........
 r306215 | jpeeler | 2011-02-03 17:49:28 -0600 (Thu, 03 Feb 2011) | 20 lines
 
 Fix SIP deadlock involving state changes.
 
 Once again a call to pbx_builtin_getvar_helper (and pbx_builtin_setvar_helper)
 has caused locking problems. Both of these functions lock the channel when
 the channel argument is passed in!
 
 In this case, the suspected problem (the backtrace makes it impossible to tell)
 was the private being locked in sip_set_rtp_peer and then:
 transmit_reinvite_with_sdp
  try_suggested_sip_codec
    pbx_builtin_getvar_helper
 (Traced to verify that the fix was only required in 1.8 and later.)
 
 (closes issue ASTERISK-17123)
 Reported by: cmaj
 Patches:
       chan_sip_fix_deadlocks_bug_18491.txt uploaded by cmaj (license 830)
 Tested by: cmaj
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=306216