ASTERISK-18082: Deadlock of SIP or segfault when doing REFERs

[Home]

Summary: ASTERISK-18082: Deadlock of SIP or segfault when doing REFERs

Reporter: James Van Vleet (jvanvleet) Labels:

Date Opened: 2011-07-01 13:08:22 Date Closed: 2011-08-15 10:20:45

Priority: Critical Regression?

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 1.8.4 1.8.5.0 Frequency of
Occurrence

Related
Issues:
is a clone of ASTERISK-17352 [patch] Deadlock of SIP takes out server

is duplicated by ASTERISK-18136 r319652 causes deadlock with REFER

must be merged before resolving ASTERISK-18299 Merge: ASTERISK 18082

Environment: Attachments: ( 0) asterisk-restart-2011-04-11-1214.txt
( 1) backtrace.1.8.5.rc1.txt
( 2) backtrace.1.8-r325545.txt
( 3) backtrace.txt
( 4) backtrace.txt
( 5) backtrace1.txt
( 6) backtrace2.txt
( 7) backtrace3.txt
( 8) backtrace4.txt
( 9) bug18749.patch
(10) core-show-locks.txt
(11) coreshowlocks1.8.5.0
(12) core-show-locks2.txt
(13) core-show-locks3.txt
(14) core-show-locks4.txt
(15) gdb-bt1.8.5.0.txt
(16) refer_fix_1.8_v1.diff
(17) refer_fix_1.8_v2.diff

Description: This happens with both asterisk 1.8.0 and 1.8.2.3. The server will take calls for some amount of time (less than a day - within a few hours) can all of the sudden will stop responding on the SIP. I can see with a netstat -na that many of the receive queues (including 5060) are full of waiting packets.

I have compiled asterisk with DEBUG_THREADS and DONT_OPTIMIZE and was able to do both a core show locks and a backtrace for two of the crashes.

The majority of our calls are SRTP and also the majority use fastAGI if that helps.

-James

Comments: By: James Van Vleet (jvanvleet) 2011-07-01 13:14:50.628-0500

My other ticket has been closed without anyone being able to test it

I have run multiple concurrent unattended xfers on both 1.8.5-rc1 and SVN-branch-1.8-r325545. 325545 seems to always segfault and 1.8.5.rc1 locks up, deadlocks and segfaults.

By: James Van Vleet (jvanvleet) 2011-07-01 13:22:56.841-0500

Re-upload back traces as the clone made all of the files look the same (date wise)
By: David Vossel (dvossel) 2011-07-01 17:57:36.179-0500

Give us new back traces and core show locks for 1.8.5-rc1.
By: James Van Vleet (jvanvleet) 2011-07-05 10:14:33.011-0500

I did. Backtrace-1.8.5rc1 is the very thing. No core show locks possible as asterisk segfaulted instead of locking up.

I can try for more traces this afternoon but is this one bad in some way?
By: David Vossel (dvossel) 2011-07-06 09:42:36.261-0500

Sorry, I missed the new backtrace when I was looking over the uploaded files. It is not obvious to me how that crash is occurring, please provide another backtrace of the same crash. This will at least verify that it is consistently crashing in the same place.
By: James Van Vleet (jvanvleet) 2011-07-11 14:21:53.580-0500

an update: I am creating a bit more comprehensive automated test of the issues I am having. Hope to be sorted very soon.
By: James Van Vleet (jvanvleet) 2011-07-14 17:34:03.718-0500

Backtrace and core show locks of refer issue on 1.8.5.0 release
By: James Van Vleet (jvanvleet) 2011-07-14 17:41:02.226-0500

I have a attached core show locks and a gdb backtrace for 1.8.5.0 release. Sorry it took so long but the good news is that I can now synthetically reproduce the issue (instead of crashing my production system) whenever and as fast as you need. In 1.8.0 it seemed to always deadlock - in 1.8.5.0 it seems to segfault about 2/3 of the time and deadlock 1/3 of the time. Attached is a deadlock. Let me know if getting a backtrace of a segfault would also be handy - it is easy now.
By: James Van Vleet (jvanvleet) 2011-07-14 17:43:58.250-0500

I should add that all I do to reproduce is have one sip endpoint call another and have one endpoint REFER to an extension with a recorded announcement. In the synthetic environment there is no SRTP nor TLS but the extensions dialed are AGI controlled. The system always fails after about 150-250 of these take place, with about 20 concurrent. No other types of calls are going on at the time.
By: David Vossel (dvossel) 2011-07-19 10:54:52.308-0500

That's great news that you have an environment set up that can easily reproduce this.

The deadlock you are witnessing now is different from the previous ones. This is a scenario where right as one deadlock is fixed, it just moves the deadlock to another location. This is a result of some poor architecture used in processing the SIP REFER requests.

There needs to be some serious changes made in this area, but those types of changes are difficult to verify without a good testing environment. It sounds like you have one set up, so I'll do my part in making the big changes.
By: David Vossel (dvossel) 2011-07-19 15:11:52.716-0500

Please apply and test refer_fix_1.8_v1.diff
By: James Van Vleet (jvanvleet) 2011-07-19 15:25:01.178-0500

Thanks! I am on it!
By: James Van Vleet (jvanvleet) 2011-07-19 16:49:11.542-0500

I have done some initial testing. I can't say for certain if the deadlocks are gone (I will be running a LOT more calls through to determine that) I am seeing what appears to be an issue. After roughly 200ish calls I start receiving a fair number of:

[Jul 19 21:21:16] ERROR[26000] astobj2.c: user_data is NULL

errors in console. I enabled debug 1000 and pulled one up in the log:

[Jul 19 21:21:16] DEBUG[26000] chan_sip.c: = Looking for Call ID: 79-10281@10.240.62.22 (Checking From) --From tag 10281SIPpTag0079 --To-tag as06a9cd01
[Jul 19 21:21:16] DEBUG[26000] chan_sip.c: **** Received REFER (9) - Command in SIP REFER
[Jul 19 21:21:16] ERROR[26000] astobj2.c: user_data is NULL
[Jul 19 21:21:16] DEBUG[26000] chan_sip.c: Trying to put 'SIP/2.0 603' onto UDP socket destined for 174.129.33.138:5060
[Jul 19 21:21:16] DEBUG[26000] chan_sip.c: SIP message could not be handled, bad request: 79-10281@10.240.62.22

I don't yet if this log snip represents all of the "user_data is NULL" errors - it is one example.

I at first assumed it was my load test, but what is interesting is that if I stop the load test, allow all calls to expire, hangup any sip channels lying around and get the system back to a "zero calls" state, restarting the load test will cause these errors to come up immediately on the console. If I restart asterisk the system runs clean for the first couple hundred calls before showing signs of this again.

I repeated this whole process 3 times to be certain.

I will be doing some more testing tonight and tomorrow but that is the quick update.

By: David Vossel (dvossel) 2011-07-19 16:59:49.930-0500

Awesome, thanks for the update. I'll look into the errors.
By: David Vossel (dvossel) 2011-07-19 17:14:20.020-0500

That debug could be the result of a number of things, but I identified one area that would definitely cause it in the patch. Please try the refer_fix_1.8_v2.diff and see if you continue to see the errors.
By: James Van Vleet (jvanvleet) 2011-07-21 12:49:14.810-0500

Quick update. I started testing this morning with the patch applied to SVN 329103. The great news is that it appears you nailed the "user_data is NULL" issue and the system is more stable compared to when we started.

I did get a deadlock on one test run after about 1000 calls but I was not running with the debug compile flags set. I am now re-running with them compiled in to see if I can get it to happen again.

More updates later today.

By: James Van Vleet (jvanvleet) 2011-07-21 20:56:09.892-0500

Still no "user_data is NULL" issues, so that is licked. I also have not had any deadlocks beyond the one but I intend to test more tomorrow.

There does appear like there may be an issue where the general response time of the SIP module appears to slow down after about 100ish REFERS through the system (although I have not tried straight calls yet, hopefully tomorrow). The general report is that after an asterisk restart the system runs very well for the first 100ish calls and then I start to see my test harness do retries and then I start getting timeouts. Stopping the test and starting it again seems to have retries off the bat, but restarting asterisk appears to keep things clean again for a while.

I know this is much more difficult to find and I hope to get some more details tomorrow - but I mention it as if there is something you want me to enable/look for let me know. I disabled optimization and enabled debug threads so the CPS is really down anyhow so it takes a while to move through the testing - might even be part of the cause - hope to do more tomorrow.

By: David Vossel (dvossel) 2011-07-25 10:00:12.508-0500

The biggest hit to performance will be enabling debug threads. That option is extremely expensive for both memory usage and the CPU. If you are using a multicore CPU, debug threads makes it behave about on par, if not a bit worse, than a single core processor. If you are trying to test performance, you should disable that option.

Also, if you are interested we would like to integrate your REFER test into our automated testsuite. Is this test tool you are using something you can provide us with?
By: James Van Vleet (jvanvleet) 2011-07-26 13:12:57.842-0500

I would be happy to supply the test cases. I am actually using a version of ims_bench I modified that supports pcap_audio (I will be giving those changes back to the project in the next week or so).

As a general update the slowdown issues I mentioned are not related to the REFER issue, so at this point I don't have anything bad to say about the REFER fix. I have another day of testing before I should have the final stamp of approval.
By: David Vossel (dvossel) 2011-07-27 16:51:07.556-0500

Its good to hear that the testing results for this issue appear positive right now.

I'm interested in the slow down issue you are experiencing. If you can gather enough information for us to investigate, please open an issue for it.
By: James Van Vleet (jvanvleet) 2011-08-03 12:54:28.936-0500

Well, I tried and got shot down - even though it can be reproduced and certainly is a bug:

https://issues.asterisk.org/jira/browse/ASTERISK-18223

Thoughts?