[Home]

Summary:ASTERISK-25213: [patch]Possibility of deadlock in chan_sip INVITE early Replace code
Reporter:Walter Doekes (wdoekes)Labels:
Date Opened:2015-06-30 09:56:27Date Closed:2015-07-04 19:12:22
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:11.18.0 11.19.0 Frequency of
Occurrence
Frequent
Related
Issues:
is related toASTERISK-25212 [patch]Segfault when using DEBUG_FD_LEAKS
Environment:Attachments:( 0) callpickup.sh
( 1) callpickup-alice.xml
( 2) callpickup-bob-waits-and-tells-charlie.xml
( 3) callpickup-charlie.xml
( 4) callpickup-extensions.conf
( 5) callpickup-sip.conf
( 6) deadlock-detect.txt
( 7) issueA25213-unlock_refer_around_sip_new.patch
( 8) issueA25213-unlock_refer_around_sip_new-PROOF.patch
Description:It seems there is a lock order problem with the following code in chan_sip:

{code}
static int handle_request_invite(struct sip_pvt *p, struct sip_request *req, struct ast_sockaddr *addr, uint32_t seqno, int *recount, const char *e, int *nounlock)
...
               /* This locks both refer_call pvt and refer_call pvt's owner!!!*/
               if (!error && ast_strlen_zero(pickup.exten) && (p->refer->refer_call = get_sip_pvt_byid_locked(replace_id, totag, fromtag)) == NULL) {
                       ast_log(LOG_NOTICE, "Supervised transfer attempted to replace non-existent call id (%s)!\n", replace_id);
                       transmit_response_reliable(p, "481 Call Leg Does Not Exist (Replaces)", req);
                       error = 1;
               } else {
                       refer_locked = 1;
               }
...
               if (!replace_id && (gotdest != SIP_GET_DEST_EXTEN_FOUND)) {     /* No matching extension found */
...
               } else {
...
                       /* First invitation - create the channel.  Allocation
                        * failures are handled below. */

                       c = sip_new(p, AST_STATE_DOWN, S_OR(p->peername, NULL), NULL, p->logger_callid);
{code}

The above code:
- locks a channel
- then calls sip_new, which eventually locks the channel list (through __ast_channel_alloc)

This order is wrong, because most of the other code does the inverse:
- locks the channel list
- locks a channel

This is a problem when:
- one thread piece of code iterates over the channel list; like say a MASTER_CHANNEL() function call
- another thread enters the shown code and obtains the channel lock before thread one has iterated over the locked channel

Now the two threads start waiting on each other.

Reproducing, see the attached files, starting with {{callpickup}}:
{noformat}
$ ls /etc/asterisk/ -1
asterisk.conf
callpickup-alice.xml
callpickup-bob-waits-and-tells-charlie.xml
callpickup-charlie.xml
callpickup.sh
extensions.conf
logger.conf
modules.conf
sip.conf
{noformat}

Run like this:
{noformat}
shell1# ./callpickup.sh
shell2# /usr/local/bin/sipp -s devil -ap devil2 -i 127.0.0.1 -nostdin \
 -nd -default_behaviors bye -sf /etc/asterisk/callpickup-alice.xml \
 -p 5073 -key tel devil 127.0.0.1
shell3# ulimit -n 2048; ulimit -c unlimited; asterisk -c
{noformat}

With DEBUG_THREADS, DONT_OPTIMIZE and DETECT_DEADLOCKS, on my system this takes less than a minute to trigger a deadlock (which may result in a crash, see ASTERISK-25212).
Sometimes it doesn't crash, in that case, it shows something like:

{noformat}
# grep locked /tmp/deadlock-detect.txt
[Jun 30 16:34:43] ERROR[30200][C-00000226]: lock.c:295 __ast_pthread_mutex_lock: chan_sip.c line 8976 (sip_pvt_lock_full): 'c[x]' was locked here.
[Jun 30 16:34:43] ERROR[30195][C-00000214]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30201][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30190][C-0000021d]: lock.c:295 __ast_pthread_mutex_lock: devicestate.c line 502 (ast_devstate_changed_literal): '&(&state_changes)->lock' was locked here.
[Jun 30 16:34:43] ERROR[30199][C-0000021a]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30197][C-00000217]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30193][C-00000211]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'c' was locked here.
[Jun 30 16:34:43] ERROR[30191][C-0000020e]: lock.c:295 __ast_pthread_mutex_lock: chan_local.c line 455 (local_queue_frame): 'chan' was locked here.
[Jun 30 16:34:45] ERROR[29779]: lock.c:295 __ast_pthread_mutex_lock: astobj2.c line 1095 (internal_ao2_callback): 'iter->c' was locked here.
{noformat}

I guess the refer_call should be unlocked before we enter sip_new; and then re-acquired.

Cheers,
Walter Doekes
OSSO B.V.
Comments:By: Asterisk Team (asteriskteam) 2015-06-30 09:56:28.682-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Rusty Newton (rnewton) 2015-06-30 15:34:56.687-0500

Thanks for being so thorough. I was able to reproduce using the scripts and configs you provided.

By: Walter Doekes (wdoekes) 2015-07-01 02:06:05.110-0500

Thanks for checking and reproducing, Rusty.

This attached proof of concept patch {{issueA25213-unlock_refer_around_sip_new-PROOF.patch}} ran (under valgrind) the entire night without deadlocks \[1\]. This morning I removed the patch again, and it deadlocked in 30 seconds \[2\].

\[1\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
 Call-rate(length)   Port   Total-time  Total-calls  Remote-host
 10.0(0 ms)/1.000s   5073   53245.13 s       164551  127.0.0.1:5060(UDP)

 0 new calls during 1.004 s period      1 ms scheduler resolution
 45 calls (limit 45)                    Peak was 45 calls, after 4 s
 0 Running, 109 Paused, 22 Woken up
 329130 dead call msg (discarded)       0 out-of-call msg (discarded)        
 2 open sockets                        

                                Messages  Retrans   Timeout   Unexpected-Msg
      Pause [   1000ms]         164551                        0        
     INVITE ---------->         164551    419958    104                
        100 <----------         0         0         0         0        
        401 <----------         164427    0         0         0        
        407 <----------         0         0         0         0        
        180 <----------         0         0         0         0        
        183 <----------         0         0         0         0        
        200 <----------         0         0         0         0        
        ACK ---------->         164427    0                            
     INVITE ---------->         164427    331580    4                  
...
{noformat}

\[2\]:
{noformat}
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
 Call-rate(length)   Port   Total-time  Total-calls  Remote-host
 10.0(0 ms)/1.000s   5073      43.17 s           74  127.0.0.1:5060(UDP)

 0 new calls during 1.004 s period      1 ms scheduler resolution
 45 calls (limit 45)                    Peak was 45 calls, after 4 s
 0 Running, 76 Paused, 13 Woken up
 18 dead call msg (discarded)           0 out-of-call msg (discarded)        
 2 open sockets                        

                                Messages  Retrans   Timeout   Unexpected-Msg
      Pause [   1000ms]         74                            0        
     INVITE ---------->         74        353       4                  
        100 <----------         0         0         0         0        
        401 <----------         42        0         0         0        
        407 <----------         0         0         0         0        
        180 <----------         0         0         0         0        
        183 <----------         0         0         0         0        
        200 <----------         0         0         0         0        
        ACK ---------->         42        0                            
     INVITE ---------->         42        96        0                  
...
{noformat}

On to fix the XXX part. I don't think the resulting patch will win any prizes, but in the good ol' chan_sip it may be sufficient.

By: Walter Doekes (wdoekes) 2015-07-01 05:15:27.578-0500

Attaching {{issueA25213-unlock_refer_around_sip_new.patch}}.

That cleans up some of the mess and adds the appropriate call destruction code (I think).

Ran valgrind on it. Looks good.

I didn't explicitly test the edge case though.