Summary: | ASTERISK-17431: [patch] Deadlock with attended transfer of SIP call | ||
Reporter: | Alec Davis (alecdavis) | Labels: | |
Date Opened: | 2011-02-17 19:25:41.000-0600 | Date Closed: | 2011-02-25 12:58:13.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/RTP |
Versions: | 1.8.2 | Frequency of Occurrence | |
Related Issues: | |||
Environment: | Attachments: | ( 0) bug18837.diff.txt ( 1) bug18837.diff2.txt ( 2) bug18837.diff3.txt ( 3) bug18837-trunk.diff.txt ( 4) bug18837-trunk.diff2.txt ( 5) bug18837-trunk.diff3.txt ( 6) CoreShowLocks.txt ( 7) CoreShowLocks-test.txt ( 8) coreshowlocks-trunk-r308680.txt ( 9) SIP-refer-deadlock.txt | |
Description: | 3 SIP phones. A calls B, and B answers on line 1. B puts A on hold by selecting line2. B calls C, and C answers. B initiates transfer of line1 to line2, phone uses REFER. ****** ADDITIONAL INFORMATION ****** sip.conf directmedia/canreinvite not defined, thus defaults to yes. also deadlocks with latest trunk downloaded today. | ||
Comments: | By: Alec Davis (alecdavis) 2011-02-17 19:41:48.000-0600 summary: of SIP-refer-deadlock.txt === Thread ID: -1292625008 (do_monitor started at [24470] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 23964 handle_request_do &netlock 0xb6796e80 (1) === ---> Lock #1 (channel.c): MUTEX 6211 ast_do_masquerade channels 0x8d4e0c8 (1) === ---> Lock #2 (channel.c): MUTEX 6214 ast_do_masquerade original 0xbd98f48 (1) === ---> Lock #3 (channel.c): MUTEX 6234 ast_do_masquerade clonechan <u>0xb24bf7d0</u> (1) === ---> Waiting for Lock #4 (chan_sip.c): MUTEX 6164 sip_fixup p <u>0xb24bab10</u> (1) === --- ---> Locked Here: chan_sip.c line 27632 (sip_set_rtp_peer) === ------------------------------------------------------------------- === === Thread ID: -1315861616 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Lock #0 (chan_sip.c): MUTEX 27632 sip_set_rtp_peer p <u>0xb24bab10</u> (1) === ---> Waiting for Lock #1 (pbx.c): MUTEX 9467 pbx_builtin_getvar_helper chan <u>0xb24bf7d0</u> (1) === --- ---> Locked Here: channel.c line 6234 (ast_do_masquerade) === ------------------------------------------------------------------- === === Thread ID: -1316602992 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Waiting for Lock #0 (channel.c): MUTEX 6211 ast_do_masquerade channels 0x8d4e0c8 (1) === --- ---> Locked Here: channel.c line 6211 (ast_do_masquerade) === ------------------------------------------------------------------- === ======================================================================= By: Irontec (irontec) 2011-02-18 01:43:26.000-0600 Hi, Do you have directmedia=yes/update and dtmfmode=info for those peers ??? Maybe is related to 0018734. By: Alec Davis (alecdavis) 2011-02-18 02:40:07.000-0600 bug18837.diff.txt glueX->update_peer points to sip_set_rtp_peer which will lock the pvt without any care for the channel lock. Now before update_peer is called, use correct locking order, channel then pvt. By: Alec Davis (alecdavis) 2011-02-18 02:53:01.000-0600 Irontec: I have the following on the peers DirectMedia : Yes DTMFmode : rfc2833 By: Irontec (irontec) 2011-02-18 03:50:41.000-0600 Sorry, but with this config we havn't got any lock.. (We can't reproduce the lock) The problems are with dtmfmode=info: -With directmedia=yes there are two Invites from Asterisk. http://lists.digium.com/pipermail/asterisk-dev/2011-February/047827.html -With directmedia=update we have deadlocks. See ASTERISK-1848734 I'll try to reproduce your deadlock without the patch. By: Alec Davis (alecdavis) 2011-02-18 04:02:01.000-0600 ASTERISK-17101 transferts-bug-lock.txt has same locks as ours. A summary of ASTERISK-17101 transferts-bug-lock.txt below: ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: -1291523216 (do_monitor started at [24482] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 23976 handle_request_do &netlock 0xb671c800 (1) === ---> Waiting for Lock #1 (channel.c): MUTEX 6211 ast_do_masquerade channels 0x966b1c8 (1) === --- ---> Locked Here: channel.c line 6211 (ast_do_masquerade) === ------------------------------------------------------------------- === === Thread ID: -1355433104 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Lock #0 (chan_sip.c): MUTEX 27649 sip_set_rtp_peer p <b>0x9680738</b> (1) === ---> Waiting for Lock #1 (pbx.c): MUTEX 9467 pbx_builtin_getvar_helper chan <b>0xb71f8d10</b> (1) === --- ---> Locked Here: channel.c line 6234 (ast_do_masquerade) === ------------------------------------------------------------------- === === Thread ID: -1355682960 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Lock #0 (channel.c): MUTEX 6211 ast_do_masquerade channels 0x966b1c8 (1) === ---> Lock #1 (channel.c): MUTEX 6214 ast_do_masquerade original 0xb71e8220 (1) === ---> Lock #2 (channel.c): MUTEX 6234 ast_do_masquerade clonechan <b>0xb71f8d10</b> (1) === ---> Waiting for Lock #3 (chan_sip.c): MUTEX 6164 sip_fixup p <b>0x9680738</b> (1) === --- ---> Locked Here: chan_sip.c line 27649 (sip_set_rtp_peer) === ------------------------------------------------------------------- === ======================================================================= By: Irontec (irontec) 2011-02-18 04:30:13.000-0600 Hi, I've applied the patch (1.8.2.3) and done an attended transfer with directmedia=update and dtmfmode=info, but we already have a deadlock... See CoreShowLocks.txt With dtmfmode=rfc2833 We don't have any problem without the patch. :( By: Alec Davis (alecdavis) 2011-02-18 04:30:26.000-0600 ASTERISK-17338 ShowLocks.txt also has same locking sequence. Summary below: === ------------------------------------------------------------------- === === Thread ID: -1355433104 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Lock #0 (chan_sip.c): MUTEX 27649 sip_set_rtp_peer p <b>0x9680738</b> (1) === ---> Waiting for Lock #1 (pbx.c): MUTEX 9467 pbx_builtin_getvar_helper chan <b>0xb71f8d10</b> (1) === --- ---> Locked Here: channel.c line 6234 (ast_do_masquerade) === ------------------------------------------------------------------- === === Thread ID: -1355682960 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Lock #0 (channel.c): MUTEX 6211 ast_do_masquerade channels 0x966b1c8 (1) === ---> Lock #1 (channel.c): MUTEX 6214 ast_do_masquerade original 0xb71e8220 (1) === ---> Lock #2 (channel.c): MUTEX 6234 ast_do_masquerade clonechan <b>0xb71f8d10</b> (1) === ---> Waiting for Lock #3 (chan_sip.c): MUTEX 6164 sip_fixup p <b>0x9680738</b> (1) === --- ---> Locked Here: chan_sip.c line 27649 (sip_set_rtp_peer) By: Alec Davis (alecdavis) 2011-02-18 04:34:39.000-0600 Irontec: I don't see CoreShowLocks.txt, is it on another bug report? By: Irontec (irontec) 2011-02-18 04:35:33.000-0600 I've just uploaded. Sorry. By: Alec Davis (alecdavis) 2011-02-18 04:50:42.000-0600 Just tried with DirectMedia : Yes DTMFmode : info no core locks. -- Called GXP8924 -- SIP/GXP8924-00000003 is ringing -- SIP/GXP8924-00000003 answered SIP/GXP8923-00000002 -- Remotely bridging SIP/GXP8923-00000002 and SIP/GXP8924-00000003 -- Started music on hold, class 'default', on SIP/GXP8924-00000003 -- Stopped music on hold on SIP/GXP8922-00000000 -- Stopped music on hold on SIP/GXP8924-00000003 -- Remotely bridging SIP/GXP8922-00000000 and SIP/GXP8924-00000003 == Spawn extension (trusted, 8923, 1) exited non-zero on 'SIP/GXP8923-00000002<ZOMBIE>' astrid*CLI> core show locks ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === ======================================================================= == Spawn extension (trusted, 8924, 1) exited non-zero on 'SIP/GXP8922-00000000' By: Alec Davis (alecdavis) 2011-02-18 04:54:13.000-0600 And removing the patch, just to check my sanity. DirectMedia : Yes DTMFmode : info Took 2 tries, but then dead lock, same culprits as normal. See CoreShowLocks-test.txt By: Alec Davis (alecdavis) 2011-02-18 04:59:25.000-0600 Irontec: I just got caught out with a deadlocked asterisk, don't forget to kill -9 the process id of asterisk :( By: Irontec (irontec) 2011-02-18 05:00:28.000-0600 You're right but with this config. the xfered call hangs. Asterisk sends 2 consecutive INVITEs (different CSeq): U 10.10.0.165:5060 -> 10.10.0.142:5064 INVITE sip:300@10.10.0.142:5064 SIP/2.0. Via: SIP/2.0/UDP 10.10.0.165:5060;branch=z9hG4bK748b128b. Max-Forwards: 70. From: "200" <sip:200@10.10.0.165>;tag=as08a90e13. To: <sip:300@10.10.0.142:5064>;tag=2be4485acdb9d4a5i4. Contact: <sip:200@10.10.0.165:5060>. Call-ID: 19bb6dd729b44c10706c9aa868aa3902@10.10.0.165:5060. CSeq: 105 INVITE. User-Agent: Asterisk PBX 1.8.2.3. Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH. Supported: replaces, timer. Remote-Party-ID: "200" <sip:200@10.10.0.165>;party=calling;privacy=off;screen=no. Content-Type: application/sdp. Content-Length: 205. . v=0. o=root 2032588221 2032588224 IN IP4 10.10.0.165. s=Asterisk PBX 1.8.2.3. c=IN IP4 10.10.0.165. t=0 0. m=audio 14354 RTP/AVP 8. a=rtpmap:8 PCMA/8000. a=silenceSupp:off - - - -. a=ptime:20. a=sendrecv. U 10.10.0.165:5060 -> 10.10.0.142:5064 INVITE sip:300@10.10.0.142:5064 SIP/2.0. Via: SIP/2.0/UDP 10.10.0.165:5060;branch=z9hG4bK1c16a556. Max-Forwards: 70. From: "200" <sip:200@10.10.0.165>;tag=as08a90e13. To: <sip:300@10.10.0.142:5064>;tag=2be4485acdb9d4a5i4. Contact: <sip:200@10.10.0.165:5060>. Call-ID: 19bb6dd729b44c10706c9aa868aa3902@10.10.0.165:5060. CSeq: 106 INVITE. User-Agent: Asterisk PBX 1.8.2.3. Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH. Supported: replaces, timer. Remote-Party-ID: "100" <sip:100@10.10.0.165>;party=calling;privacy=off;screen=no. Content-Type: application/sdp. Content-Length: 205. . v=0. o=root 2032588221 2032588225 IN IP4 10.10.0.167. s=Asterisk PBX 1.8.2.3. c=IN IP4 10.10.0.167. t=0 0. m=audio 16834 RTP/AVP 8. a=rtpmap:8 PCMA/8000. a=silenceSupp:off - - - -. a=ptime:20. a=sendrecv. And the peer can't handle this: U 10.10.0.142:5064 -> 10.10.0.165:5060 SIP/2.0 500 Internal Server Error. To: <sip:300@10.10.0.142:5064>;tag=2be4485acdb9d4a5i4. From: "200" <sip:200@10.10.0.165>;tag=as08a90e13. Call-ID: 19bb6dd729b44c10706c9aa868aa3902@10.10.0.165:5060. CSeq: 106 INVITE. Via: SIP/2.0/UDP 10.10.0.165:5060;branch=z9hG4bK1c16a556. Server: Cisco/SPA509G-7.4.3. Content-Length: 0. . So, there are no locks but call hangs up. (Using Cisco Spa5XX) Have you tried directmedia=update ?? (dtmfmode=info) That is the issue of 0018734 Thanks. By: Alec Davis (alecdavis) 2011-02-18 05:17:31.000-0600 I'd seen '500 Internal Server Error' earlier today, when using XLITE 4.0 But for testing tonight I've been using 3 Grandstreams - GXP2000s, which are working fine. Looks like either my locking idea is wrong, or there's another underlying issue. By: Alec Davis (alecdavis) 2011-02-20 19:27:58.000-0600 uploaded bug18837.diff2.txt previous bug18837.diff.txt caused BLF subscriptions on some phones to randomly dissappear. This unlocks and locks the pvt lock around the call to transmit_reinvite_with_sdp, which will attempt to lock the chan. The private was being locked in sip_set_rtp_peer and then: transmit_reinvite_with_sdp try_suggested_sip_codec pbx_builtin_getvar_helper tries to lock the chan. By: Alec Davis (alecdavis) 2011-02-20 19:31:00.000-0600 Irontec: Without any patches, I get '500 Internal Server Error' to a CISCO 7940 when in sip.conf sendrpid=yes You do have a difference, the 'Remote-Party-ID' Remote-Party-ID: "200" <sip:200@10.10.0.165>;party=calling;privacy=off;screen=no. Remote-Party-ID: "100" <sip:100@10.10.0.165>;party=calling;privacy=off;screen=no By: Alec Davis (alecdavis) 2011-02-20 19:36:09.000-0600 ASTERISK-17123 core_show_locks.txt loosly related, had similar problem, was setting 'pvt' in sip_set_rtp_peer before 'chan' by pbx_builtin_getvar_helper === Thread ID: 140518501463824 (pbx_thread started at [ 5035] pbx.c ast_pbx_start()) === ---> Lock #0 (chan_sip.c): MUTEX 27630 sip_set_rtp_peer p 0x836b090 (1) === ---> Waiting for Lock #1 (pbx.c): MUTEX 9467 pbx_builtin_getvar_helper chan 0x85614b0 (1) By: Irontec (irontec) 2011-02-21 02:33:31.000-0600 Hi, Yes I know RPID is different in two cases, but is Asterisk who sends two Invites (with different RPID) and seems that some phones can't handle those two consecutive Invites. We've aplied the patch and this issue continues but it seems that the main problem of deadlock is solved. (with directmedia=update and sipInfo) Now I'm doing more probes... Good job! By: Irontec (irontec) 2011-02-21 02:57:39.000-0600 Ok, deadlock seems solved. I've done lots of xfers without any lock (sendrpid=pai,directmedia=update and dtmfmode=Info) With directmedia=yes we still have problems with Spa5XX (500 Internal Server Error) when receives 2 (Re)Invites. Patch rocks!! ;P Thanks! By: Alec Davis (alecdavis) 2011-02-21 03:31:07.000-0600 Irontec: Now have a lookat ASTERISK-17378. If you have DAHDI/IAX trunks then transfer locally between SIP extenstions, SEGFAULT The patch there won't apply, but the lines like below can be manually added. - if (glue0->update_peer(c0, NULL, NULL, NULL, 0, 0)) { + if (glue0 != ast_rtp_instance_get_glue(c0->tech->type)) { + ast_log(LOG_WARNING, "Channel c0->'%s' technology changed, in bridge with c1->'%s'\n", c0->name, c1->name); + } else if (glue0->update_peer(c0, NULL, NULL, NULL, 0, 0)) { You get the idea. By: Irontec (irontec) 2011-02-21 03:37:42.000-0600 We only use SIP channels (gateways, providers...) but I understand that the patch (rtp_engine.c) will be like: --- main/rtp_engine.c (revision 306862) +++ main/rtp_engine.c (working copy) @@ -1202,10 +1202,15 @@ cs[1] = cs[2]; } - if (glue0->update_peer(c0, NULL, NULL, NULL, 0, 0)) { + if (ast_test_flag(c0, AST_FLAG_ZOMBIE)) { + ast_debug(1, "Channel '%s' Zombie cleardown from bridge\n", c0->name); + else if (glue0 != ast_rtp_instance_get_glue(c0->tech->type)) { + ast_log(LOG_WARNING, "Channel c0->'%s' technology changed, in bridge with c1->'%s'\n", c0->name, c1->name); + } else if (glue0->update_peer(c0, NULL, NULL, NULL, 0, 0)) { ast_log(LOG_WARNING, "Channel '%s' failed to break RTP bridge\n", c0->name); } - if (glue1->update_peer(c1, NULL, NULL, NULL, 0, 0)) { + + if (ast_test_flag(c1, AST_FLAG_ZOMBIE)) { + ast_debug(1, "Channel '%s' Zombie cleardown from bridge\n", c1->name); + } else if (glue1->update_peer(c1, NULL, NULL, NULL, 0, 0)) { ast_log(LOG_WARNING, "Channel '%s' failed to break RTP bridge\n", c1->name); } It's OK??? By: Alec Davis (alecdavis) 2011-02-23 14:37:48.000-0600 bug18837.diff3.txt included a reference count, making sure the pvt cannot disappear while pvt is unlocked. By: Matt Riddell (zx81) 2011-02-23 18:46:38.000-0600 Couple of the chunks failed on lastest svn in chan sip and rtp engine but manually tweaked - was pretty close. By: Alec Davis (alecdavis) 2011-02-24 02:23:45.000-0600 regression since r306216 in ASTERISK-17123 summary of coreshowlocks-trunk-r308680.txt ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: -1243620464 (do_monitor started at [24906] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 24390 handle_request_do &netlock 0xb6b84740 (1) === ---> Lock #1 (channel.c): MUTEX 6398 ast_do_masquerade channels 0x9b83b48 (1) === ---> Waiting for Lock #2 (channel.c): MUTEX 6401 ast_do_masquerade original <b>0xa342e80</b> (1) === --- ---> Locked Here: rtp_engine.c line 1287 (ast_rtp_instance_bridge) === ------------------------------------------------------------------- === === Thread ID: -1359201392 (pbx_thread started at [ 5170] pbx.c ast_pbx_start()) === ---> Lock #0 (chan_sip.c): MUTEX 28105 sip_set_rtp_peer chan 0xa2eb710 (1) === ---> Lock #1 (chan_sip.c): MUTEX 28106 sip_set_rtp_peer p <b>0x9eea7b8</b> (1) === ---> Waiting for Lock #2 (pbx.c): MUTEX 9601 pbx_builtin_getvar_helper chan <b>0xa342e80</b> (1) === --- ---> Locked Here: rtp_engine.c line 1287 (ast_rtp_instance_bridge) === ------------------------------------------------------------------- === === Thread ID: -1358955632 (pbx_thread started at [ 5170] pbx.c ast_pbx_start()) === ---> Lock #0 (rtp_engine.c): MUTEX 1287 ast_rtp_instance_bridge c0 <b>0xa342e80</b> (1) === ---> Lock #1 (rtp_engine.c): MUTEX 1288 ast_rtp_instance_bridge c1 0xa364b38 (1) === ---> Waiting for Lock #2 (chan_sip.c): MUTEX 28001 sip_get_rtp_peer p <b>0x9eea7b8</b> (1) === --- ---> Locked Here: chan_sip.c line 28106 (sip_set_rtp_peer) === ------------------------------------------------------------------- === ======================================================================= By: Alec Davis (alecdavis) 2011-02-24 02:44:14.000-0600 bug18837-trunk.diff.txt reverts patch from ASTERISK-17123 and uses unlock/lock pvt around transmit_reinvite_with_sdp with ref counting By: cmaj (cmaj) 2011-02-24 12:09:45.000-0600 With https://issues.asterisk.org/view.php?id=18491 was not aware of dialog_ref / dialog_unref - this is much better than ast_channel_lock. By: Alec Davis (alecdavis) 2011-02-24 20:21:28.000-0600 bug18837-trunk.diff2.txt from reviewboard https://reviewboard.asterisk.org/r/1126 safely lock 'p' then 'p->owner' with deadlock protection. By: cmaj (cmaj) 2011-02-25 00:34:05.000-0600 Those dialog ref/undef were nicer to look at than usleep(1), but the patch seems to hold up pretty good under 1.8.3-rc2 in a variety of transfer scenarios (couldn't reproduce yours exactly though, the last step "B initiates transfer of line1 to line2, phone uses REFER" in particular - not sure what buttons to press to do that.) By: Alec Davis (alecdavis) 2011-02-25 12:18:10.000-0600 bug18837-trunk.diff3.txt updated from reviewboard, typo in sip_set_udptl_peer() By: Digium Subversion (svnbot) 2011-02-25 12:52:55.000-0600 Repository: asterisk Revision: 308945 U branches/1.8/channels/chan_sip.c ------------------------------------------------------------------------ r308945 | alecdavis | 2011-02-25 12:52:54 -0600 (Fri, 25 Feb 2011) | 21 lines Fix Deadlock with attended transfer of SIP call Call path sip_set_rtp_peer (locks chan then pvt) transmit_reinvite_with_sdp try_suggested_sip_codec pbx_builtin_getvar_helper (locks p->owner) But by the time p->owner lock was attempted, seems as though chan and p->owner were different. So in sip_set_rtp_peer, lock pvt first then lock p->owner using deadlocking methods. (closes issue ASTERISK-17431) Reported by: alecdavis Patches: bug18837-trunk.diff3.txt uploaded by alecdavis (license 585) Tested by: alecdavis, Irontec, ZX81, cmaj Review: https://reviewboard.asterisk.org/r/1126 ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=308945 By: Digium Subversion (svnbot) 2011-02-25 12:58:11.000-0600 Repository: asterisk Revision: 308946 _U trunk/ U trunk/channels/chan_sip.c ------------------------------------------------------------------------ r308946 | alecdavis | 2011-02-25 12:58:11 -0600 (Fri, 25 Feb 2011) | 27 lines Merged revisions 308945 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.8 ........ r308945 | alecdavis | 2011-02-26 07:52:53 +1300 (Sat, 26 Feb 2011) | 21 lines Fix Deadlock with attended transfer of SIP call Call path sip_set_rtp_peer (locks chan then pvt) transmit_reinvite_with_sdp try_suggested_sip_codec pbx_builtin_getvar_helper (locks p->owner) But by the time p->owner lock was attempted, seems as though chan and p->owner were different. So in sip_set_rtp_peer, lock pvt first then lock p->owner using deadlocking methods. (closes issue ASTERISK-17431) Reported by: alecdavis Patches: bug18837-trunk.diff3.txt uploaded by alecdavis (license 585) Tested by: alecdavis, Irontec, ZX81, cmaj Review: [https://reviewboard.asterisk.org/r/1126/] ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=308946 |