Summary: | ASTERISK-21409: [patch] - Race condition with IAX2 transfer, 2 releases happen on same call legs. locks up with many threads blocked by iax2_destroy_helper | ||
Reporter: | Alec Davis (alecdavis) | Labels: | |
Date Opened: | 2013-04-11 00:19:50 | Date Closed: | 2013-06-10 02:37:03 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Channels/chan_iax2 |
Versions: | 1.8.15.0 11.3.0 | Frequency of Occurrence | Occasional |
Related Issues: | |||
Environment: | Debian Squeeze | Attachments: | ( 0) coreshowlocks-31may.txt ( 1) full.may08.C-0000034cd.txt ( 2) gdb-31may.txt ( 3) iax2_transfer.diff.txt ( 4) iax2_transfer.diff2.txt ( 5) iax-coreshowlocks-may08-auckland.txt ( 6) iax-lock-asterisk-11.txt ( 7) iax-lock-asterisk-1-8-15.txt ( 8) var_log_messages.txt |
Description: | Intermittently all IAX calls stop over the trunk.
I think this is reproduced when a call is transferred back to the originating site, we have "transfer=on" Also happened with asterisk SVN-branch-11-r382514. Attached are the 1.8.15.0 lockup, and the 11-branch lockup. | ||
Comments: | By: Alec Davis (alecdavis) 2013-04-11 00:25:23.598-0500 In the affected versions for 11 I picked 11.3. Both iax-lock-asterisk files are the same server. iax-lock-asterisk-1.8.15.txt was 7 March. iax-lock-asterisk-11.txt was 21st March. By: Alec Davis (alecdavis) 2013-04-11 01:13:31.100-0500 edit: Attached var_log_messages.txt By: Rusty Newton (rnewton) 2013-04-12 09:50:56.250-0500 Alec, be sure to get DEBUG,VERBOSE messages in your log for the next time this occurs. Can you also remove the debug from your "11/Apr/13 1:13 AM" comment and attach in separate files? By: Rusty Newton (rnewton) 2013-04-12 09:53:51.740-0500 Of course if you can also nail down a way to reproduce this then that is always helpful! By: Alec Davis (alecdavis) 2013-05-07 20:46:33.832-0500 another lockup today. luckily was running debug today:) Note: at the end of iax-coreshowlocks-may08-auckland the idle threads, normally there are 10. edit 3rd June 2013 added below: The releasing of channels is normally only done by one side, except when we lockup. {code} [Apr 23 10:47:56] VERBOSE[20815][C-00000285] chan_iax2.c: -- Releasing IAX2/auckland-17178 and IAX2/auckland-4151 [Apr 26 14:17:29] VERBOSE[20814] chan_iax2.c: -- Releasing IAX2/auckland-1574 and IAX2/auckland-16633 [Apr 26 14:20:08] VERBOSE[20810] chan_iax2.c: -- Releasing IAX2/auckland-1495 and IAX2/auckland-17776 [May 8 09:41:06] VERBOSE[19832][C-00000339] chan_iax2.c: -- Releasing IAX2/auckland-17778 and IAX2/auckland-15665 [May 8 10:03:14] VERBOSE[19829][C-0000033e] chan_iax2.c: -- Releasing IAX2/auckland-17063 and IAX2/auckland-5019 [May 8 10:14:57] VERBOSE[19835] chan_iax2.c: -- Releasing IAX2/auckland-16361 and IAX2/auckland-19586 [May 8 10:27:47] VERBOSE[19832] chan_iax2.c: -- Releasing IAX2/auckland-12923 and IAX2/auckland-16803 <<< these legs >>>> [May 8 10:27:47] VERBOSE[19834][C-0000034d] chan_iax2.c: -- Releasing IAX2/auckland-16803 and IAX2/auckland-12923 <<< and these same legs being released by different threads >>> {code} By: Alec Davis (alecdavis) 2013-05-07 21:13:50.818-0500 C-000034cd.txt debug catching both calls. C-000034c is the first call originate in Wellington, sent to Auckland receptionist over iax. C-000034d is the Auckland receptionist transferring the call back to a Wellington extension 8554. By: Matt Jordan (mjordan) 2013-05-08 08:23:56.219-0500 That's rather strange - it appears to be blocking on the removal of the scheduled callback for the {{lagid}}: {noformat} AST_SCHED_DEL_SPINLOCK(ast_sched_thread_get_context(sched), pvt->pingid, &iaxsl[pvt->callno]); pvt->pingid = DONT_RESCHEDULE; AST_SCHED_DEL_SPINLOCK(ast_sched_thread_get_context(sched), pvt->lagid, &iaxsl[pvt->callno]); pvt->lagid = DONT_RESCHEDULE; {noformat} And in fact, we can see that some thread has stalled out while holding this lock at this location: {noformat} === === Thread ID: 0xaf4c3b70 (iax2_process_thread started at [12405] chan_iax2.c start_network_thread()) === ---> Lock #0 (chan_iax2.c): MUTEX 1920 iax2_destroy_helper &iaxsl[pvt->callno] 0xb698cbe0 (1) /usr/sbin/asterisk(ast_bt_get_addresses+0x19) [0x8139751] /usr/sbin/asterisk(__ast_pthread_mutex_lock+0xac) [0x81329c1] /usr/lib/asterisk/modules/chan_iax2.so(+0x13860) [0xb683a860] /usr/sbin/asterisk() [0x81a922f] /usr/sbin/asterisk() [0x81ba1e3] /lib/i686/cmov/libpthread.so.0(+0x5955) [0xb72f9955] /lib/i686/cmov/libc.so.6(clone+0x5e) [0xb77391de] === ------------------------------------------------------------------- {noformat} We shouldn't be getting stuck on this - at most, we attempt to delete the scheduled ID 10 times: {noformat} int _sched_res = -1; \ while (id > -1 && (_sched_res = ast_sched_del(sched, id)) && ++_count < 10) { \ ast_mutex_unlock(lock); \ usleep(1); \ ast_mutex_lock(lock); \ } \ {noformat} So, the question is: what is this thread waiting on? A gdb backtrace may answer that. By: Alec Davis (alecdavis) 2013-05-31 00:22:06.491-0500 Another lockup today. 2 threads trying to release the same call, notice the swapped around fr->callno and bridgecallno. [May 31 14:44:01] VERBOSE[30820] chan_iax2.c: -- Releasing IAX2/auckland-13262 and IAX2/auckland-20457 [May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c: -- Releasing IAX2/auckland-20457 and IAX2/auckland-13262 {code} [May 31 14:44:01] DEBUG[12985][C-00000536] channel.c: Returning from native bridge, channels: IAX2/auckland-13262, IAX2/auckland-20457 [May 31 14:44:01] VERBOSE[30820] chan_iax2.c: -- Channel 'IAX2/auckland-13262' ready to transfer [May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c: -- Channel 'IAX2/auckland-20457' ready to transfer [May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c: -- Releasing IAX2/auckland-20457 and IAX2/auckland-13262 [May 31 14:44:01] VERBOSE[30820] chan_iax2.c: -- Releasing IAX2/auckland-13262 and IAX2/auckland-20457 [May 31 14:44:01] DEBUG[30824][C-00000536] sched.c: Attempted to delete nonexistent schedule entry 209951! [May 31 14:44:01] ERROR[30824][C-00000536] lock.c: chan_iax2.c line 1918 (iax2_destroy_helper): mutex '&iaxsl[pvt->callno]' freed more times than we've locked! [May 31 14:44:01] ERROR[30824][C-00000536] lock.c: chan_iax2.c line 1918 (iax2_destroy_helper): Error releasing mutex: Operation not permitted [May 31 14:44:01] DEBUG[30818][C-00000536] chan_iax2.c: Ooh, voice format changed to 'alaw' {code} re: chan_iax.c line ~11558, stop_stuff(iaxs[fr->callno]->bridgecallno); Shouldn't it have a lock around it, as iax2_destroy_helper() expects the pvt to be locked. fr-callno is already locked, but the bridgedcall isn't. But this will lead to a deadlock! By: Alec Davis (alecdavis) 2013-05-31 00:27:34.532-0500 gdb-31may.txt attached By: Alec Davis (alecdavis) 2013-05-31 00:31:15.043-0500 coreshowlocks-31may.txt attached. also output from 'iax2 show threads.' where have all the threads gone?? {code} auck1*CLI> iax2 show threads IAX2 Thread Information Idle Threads: Thread 8: state=0, update=1, actions=670328, func='' Active Threads: Thread P4: state=1, update=2989, actions=664243, func='socket_process' Dynamic Threads: 2 of 10 threads accounted for with 0 dynamic threads{code} By: Alec Davis (alecdavis) 2013-05-31 21:16:54.524-0500 Since our previous lockup on 8 May. The Releasing of the IAX channels is normally only done once. {code} [May 8 12:11:25] VERBOSE[17711][C-00000017] chan_iax2.c: -- Releasing IAX2/auckland-18215 and IAX2/auckland-1116 [May 8 12:36:15] VERBOSE[17716][C-0000001c] chan_iax2.c: -- Releasing IAX2/auckland-20953 and IAX2/auckland-5806 [May 8 12:44:46] VERBOSE[17708] chan_iax2.c: -- Releasing IAX2/auckland-2666 and IAX2/auckland-22666 [May 8 13:59:42] VERBOSE[17714][C-0000002f] chan_iax2.c: -- Releasing IAX2/auckland-18982 and IAX2/auckland-4754 [May 8 14:02:00] VERBOSE[17709] chan_iax2.c: -- Releasing IAX2/auckland-1687 and IAX2/auckland-17678 [May 8 14:08:14] VERBOSE[17712] chan_iax2.c: -- Releasing IAX2/auckland-3232 and IAX2/auckland-21935 [May 8 14:25:05] VERBOSE[17715] chan_iax2.c: -- Releasing IAX2/auckland-2375 and IAX2/auckland-17247 [May 8 14:42:11] VERBOSE[17711] chan_iax2.c: -- Releasing IAX2/auckland-2421 and IAX2/auckland-17997 [May 8 14:46:32] VERBOSE[17716] chan_iax2.c: -- Releasing IAX2/auckland-2231 and IAX2/auckland-16676 [May 16 10:47:28] VERBOSE[24320][C-000002cd] chan_iax2.c: -- Releasing IAX2/auckland-18387 and IAX2/auckland-10007 [May 16 13:34:18] VERBOSE[24317][C-000002f6] chan_iax2.c: -- Releasing IAX2/auckland-23355 and IAX2/auckland-10086 [May 16 13:47:45] VERBOSE[24320][C-000002fc] chan_iax2.c: -- Releasing IAX2/auckland-19838 and IAX2/auckland-13305 [May 23 11:09:22] VERBOSE[30826] chan_iax2.c: -- Releasing IAX2/auckland-5967 and IAX2/auckland-17597 [May 23 12:14:39] VERBOSE[30823][C-0000017b] chan_iax2.c: -- Releasing IAX2/auckland-17653 and IAX2/auckland-13270 [May 23 12:15:08] VERBOSE[30823] chan_iax2.c: -- Releasing IAX2/auckland-5609 and IAX2/auckland-17390 [May 23 12:41:15] VERBOSE[30817] chan_iax2.c: -- Releasing IAX2/auckland-3055 and IAX2/auckland-18415 [May 23 12:57:37] VERBOSE[30821][C-00000186] chan_iax2.c: -- Releasing IAX2/auckland-21292 and IAX2/auckland-1604 [May 23 13:25:55] VERBOSE[30822] chan_iax2.c: -- Releasing IAX2/auckland-10039 and IAX2/auckland-19104 [May 23 13:28:32] VERBOSE[30824] chan_iax2.c: -- Releasing IAX2/auckland-1960 and IAX2/auckland-19964 [May 31 13:35:11] VERBOSE[30817][C-0000051e] chan_iax2.c: -- Releasing IAX2/auckland-16855 and IAX2/auckland-9197 [May 31 14:10:53] VERBOSE[30825] chan_iax2.c: -- Releasing IAX2/auckland-8691 and IAX2/auckland-20680 {color:red} *strong*[May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c: -- Releasing IAX2/auckland-20457 and IAX2/auckland-13262 <<< these legs >>> *strong*[May 31 14:44:01] VERBOSE[30820] chan_iax2.c: -- Releasing IAX2/auckland-13262 and IAX2/auckland-20457 <<< both released by 2 threads >>> {color} {code} By: Alec Davis (alecdavis) 2013-06-04 04:02:22.844-0500 How the sequence below happens [May 31 14:44:01] VERBOSE[30820] chan_iax2.c: -- Channel 'IAX2/auckbdt-13262' ready to transfer [May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c: -- Channel 'IAX2/auckbdt-20457' ready to transfer [May 31 14:44:01] VERBOSE[30824][C-00000536] chan_iax2.c: -- Releasing IAX2/auckbdt-20457 and IAX2/auckbdt-13262 [May 31 14:44:01] VERBOSE[30820] chan_iax2.c: -- Releasing IAX2/auckbdt-13262 and IAX2/auckbdt-20457 this race condition happens in chan_iax.c line ~11529 {noformat} case IAX_COMMAND_TXREADY: if ((iaxs[fr->callno]->transferring == TRANSFER_BEGIN) || (iaxs[fr->callno]->transferring == TRANSFER_MBEGIN)) { if (iaxs[fr->callno]->transferring == TRANSFER_MBEGIN) iaxs[fr->callno]->transferring = TRANSFER_MREADY; else iaxs[fr->callno]->transferring = TRANSFER_READY; ast_verb(3, "Channel '%s' ready to transfer\n", iaxs[fr->callno]->owner ? ast_channel_name(iaxs[fr->callno]->owner) : "<Unknown>"); /* If the other thread involved in the transfer kicks in now we'll run the following code twice */ if (iaxs[fr->callno]->bridgecallno) { if ((iaxs[iaxs[fr->callno]->bridgecallno]->transferring == TRANSFER_READY) || (iaxs[iaxs[fr->callno]->bridgecallno]->transferring == TRANSFER_MREADY)) { /* They're both ready, now release them. */ {noformat} The example below is two IAX_COMMAND_TXREADYs are processed slightly after one another (enough time for the routine to exit), one on each leg of a call to be transferred. The code current expects the 1st IAX_COMMAND_TXREADY to arrive, setting 'transferring' to TRANSFER_READY/MREADY and that bridged->transferring is not READY. Then (on the bridged call leg) the 2nd IAX_COMMAND_TXREADY arrives, setting 'transferring' to TRANSFER_READY/MREADY, and bridged->transferring is already TRANSFER_READY/MREADY so then will release them. But when the 1st IAX_COMMAND_TXREADY thread is interrupted before it's got to '->transferring = TRANSFER_MEDIA' by the 2nd IAX_COMMAND_TXREADY, things go wrong, the Releasing code gets run twice. By: Alec Davis (alecdavis) 2013-06-04 04:48:06.915-0500 attached iax2_transfer.diff.txt lock the bridged callno before setting '->transferring=TRANSFER_READY/MRERADY', with appropriate deadlock avoidance. By: Alec Davis (alecdavis) 2013-06-04 06:55:49.652-0500 attached iax2_transfer_diff2.txt refactored IAX_COMMAND_TXREADY decision logic. .1 get a lock on both legs if bridgecallno exists. .2 these tests are now needed as we the pvt is unlocked for Deadlock avoidance .2a check if transferring == MBEGIN then MREADY .2b check if transferring == BEGIN then READY .2c else if transferrring!= above tests then exit .3a remove redundant checks of thisside->MREADY||READY, we can use the bridgedcallno-transferring. .3b when checking bridgedcallno->transferring positively identify MREADY or READY testing. many test call using ORIGINATE(), never wen't wrong CLI output: With locking, never would go wrong :) -- IAX2/auckland-20910 is proceeding passing it to IAX2/auckland-17388 -- IAX2/auckland-20910 answered IAX2/auckland-17388 -- Channel 'IAX2/auckland-20910' ready to transfer -- Channel 'IAX2/auckland-17388' ready to transfer -- Releasing IAX2/auckland-17388 and IAX2/auckland-20910 === Examples without locking: === Without locking bridgecallno I could get the order wrong! -- IAX2/auckland-19071 is proceeding passing it to IAX2/auckland-17663 -- IAX2/auckland-19071 answered IAX2/auckland-17663 -- Channel 'IAX2/auckland-19071' ready to transfer -- Releasing IAX2/auckland-19071 and IAX2/auckland-17663 -- Channel 'IAX2/auckland-17663' ready to transfer Another wrong sequence without locking, server is under load - disk back up. These 2 interleave each thread. -- Channel 'IAX2/auckbdt-19065' ready to transfer -- Channel 'IAX2/auckbdt-20047' ready to transfer -- Releasing IAX2/auckbdt-19065 and IAX2/auckbdt-20047 -- Releasing IAX2/auckbdt-20047 and IAX2/auckbdt-19065 -- Channel 'IAX2/auckbdt-19065' finished transfer (this is just before we exit the case statement) -- Channel 'IAX2/auckbdt-20047' finished transfer |