Summary: | ASTERISK-17147: [patch] Asterisk produces many zombie processes while under load. | ||
Reporter: | Ernie Dunbar (ernied) | Labels: | |
Date Opened: | 2010-12-21 20:26:48.000-0600 | Date Closed: | 2011-01-19 14:33:32.000-0600 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) 20101221__issue18515.diff.txt | |
Description: | Once under a non-trivial load (trivial being the 6 phones in our office that we've been testing with for the past week, non-trivial being around a hundred sip clients and about 7 or 8 simultaneous DAHDI calls), Asterisk starts producing a high volume of zombie processes - say a few hundred a minute. If left alone, this crashes the operating system within a few hours, but not before calls start getting very flaky. ****** ADDITIONAL INFORMATION ****** We're running on the following software: Debian Linux 5.0 "Lenny" Asterisk 1.6.2.15 DAHDI Version: 2.4.0 Echo Canceller: MG2 Libpri 1.4.11.4 And on the following hardware: 2.4 Ghz Intel Pentium 4 S875WP1-E Motherboard 2.0 GB DDR400 RAM Digium Wildcard TE410P/TE405P (1st Gen, with Echo Cancellation) | ||
Comments: | By: Tilghman Lesher (tilghman) 2010-12-21 21:44:42.000-0600 Patch uploaded against 1.6.2. Please test and verify that it fixes the problem for you. By: Ernie Dunbar (ernied) 2010-12-30 11:49:22.000-0600 This patch does not fix the problem. This issue does not exist when load testing with SIPp either. Could this be an issue in the DAHDI driver? By: Tilghman Lesher (tilghman) 2010-12-30 13:57:15.000-0600 I am at a loss as to the cause of the zombies, then. Could you show a listing of the zombie processes, when the problem is occurring? By: Ernie Dunbar (ernied) 2010-12-30 14:09:23.000-0600 Also, this is our procedure for testing under load: 1. Live Asterisk server has a proper T1 connection from our telephone provider connected to an identical Digium Wildcard TE405P. 2. New Asterisk server has a pseudo T1 connection (or perhaps a real one, depending on your perspective I suppose) built from one of the free PRI ports on the Live server. 3. The Live server forwards incoming calls from its T1 connection to the new server if a SIP client is not connected to the Live server. ie, from extensions.conf on the Live server: exten => 778XXXXXXX,1,Dial(SIP/778XXXXXXX,20) exten => 778XXXXXXX,n,Dial(DAHDI/g4/778XXXXXXX,20) 4. DNS gets changed to change the hostname of the Live server to use the IP of the New server. 5. SIP clients start logging in when their registration expires. 6. Things go smoothly until around 6 or 7 consecutive DAHDI calls, when Asterisk starts producing zombie processes at a high rate (several hundred per minute). Typically by step 6, we switch our DNS back and things start going back to normal for our customers. By: Ernie Dunbar (ernied) 2010-12-30 14:52:10.000-0600 Testing right this moment. Actually, currently it seems that zombie processes cap out at around 256 and then die properly. Here's a reasonable snapshot of the system as the problem happened in one particular instance: Processes: ernied@voip2:~$ ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:08 init [2] 2 ? S< 0:00 [kthreadd] 3 ? S< 0:00 [migration/0] 4 ? S< 13:53 [ksoftirqd/0] 5 ? S< 0:00 [watchdog/0] 6 ? S< 24:45 [events/0] 7 ? S< 0:00 [khelper] 39 ? S< 0:02 [kblockd/0] 41 ? S< 0:00 [kacpid] 42 ? S< 0:00 [kacpi_notify] 115 ? S< 0:00 [kseriod] 151 ? S 0:00 [pdflush] 152 ? S 0:11 [pdflush] 153 ? S< 0:00 [kswapd0] 154 ? S< 0:00 [aio/0] 410 ? Ss 0:00 sshd: ernied [priv] 419 ? S 0:01 sshd: ernied@pts/1 421 pts/1 Ss 0:00 -bash 588 ? S< 0:00 [ksuspend_usbd] 589 ? S< 0:00 [khubd] 623 ? S< 0:00 [ata/0] 624 ? S< 0:00 [ata_aux] 688 ? S< 0:00 [scsi_eh_0] 691 ? S< 0:00 [scsi_eh_1] 833 ? S 0:03 /usr/sbin/apache2 -k start 882 ? S< 0:08 [kjournald] 958 ? S<s 0:00 udevd --daemon 1286 ? S< 0:00 [edac-poller] 1398 ? S< 0:00 [kpsmoused] 1651 ? S< 0:00 [kjournald] 1652 ? S< 0:00 [kjournald] 1653 ? S< 0:06 [kjournald] 1654 ? S< 0:34 [kjournald] 1948 ? Sl 0:49 /usr/sbin/rsyslogd -c3 1959 ? Ss 0:00 /usr/sbin/acpid 2128 ? Ss 0:15 /usr/sbin/sshd 2168 ? S 0:00 /bin/sh /usr/bin/mysqld_safe 2205 ? Sl 17:42 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/my 2206 ? S 0:00 logger -p daemon.err -t mysqld_safe -i -t mysqld 2535 ? Ss 0:00 /usr/sbin/exim4 -bd -q30m 2563 ? Ss 0:00 /usr/sbin/atd 2583 ? Ss 0:02 /usr/sbin/cron 2597 ? Ss 0:21 /usr/sbin/apache2 -k start 2658 ? Ss 0:58 /usr/sbin/munin-node 2698 tty1 Ss+ 0:00 /sbin/getty 38400 tty1 2702 tty2 Ss+ 0:00 /sbin/getty 38400 tty2 2704 tty3 Ss+ 0:00 /sbin/getty 38400 tty3 2706 tty4 Ss+ 0:00 /sbin/getty 38400 tty4 2708 tty5 Ss+ 0:00 /sbin/getty 38400 tty5 2709 tty6 Ss+ 0:00 /sbin/getty 38400 tty6 3659 ? Ss 0:00 sshd: ernied [priv] 3666 ? S 0:00 sshd: ernied@pts/3 3667 pts/3 Ss 0:00 -bash 4155 ? S 0:01 /usr/sbin/apache2 -k start 7387 ? S 0:01 /usr/sbin/apache2 -k start 7413 ? S 0:00 /usr/sbin/apache2 -k start 8632 ? S 0:00 /usr/sbin/apache2 -k start 11602 ? S 0:00 /usr/bin/php -q /var/lib/asterisk/agi-bin/a2billing.p 11605 ? Z 0:00 [asterisk] <defunct> 11606 ? Z 0:00 [asterisk] <defunct> 11607 ? Z 0:00 [asterisk] <defunct> 11612 ? Z 0:00 [asterisk] <defunct> 11613 ? Z 0:00 [asterisk] <defunct> 11614 ? Z 0:00 [asterisk] <defunct> 11615 ? Z 0:00 [asterisk] <defunct> 11616 ? Z 0:00 [asterisk] <defunct> 11621 ? Z 0:00 [asterisk] <defunct> 11622 ? Z 0:00 [asterisk] <defunct> 11623 ? Z 0:00 [asterisk] <defunct> 11624 ? Z 0:00 [asterisk] <defunct> 11625 ? Z 0:00 [asterisk] <defunct> 11630 ? Z 0:00 [asterisk] <defunct> 11631 ? Z 0:00 [asterisk] <defunct> 11632 ? Z 0:00 [asterisk] <defunct> 11633 ? Z 0:00 [asterisk] <defunct> 11634 ? Z 0:00 [asterisk] <defunct> 11639 ? Z 0:00 [asterisk] <defunct> 11640 ? Z 0:00 [asterisk] <defunct> 11641 ? Z 0:00 [asterisk] <defunct> 11642 ? Z 0:00 [asterisk] <defunct> 11643 ? Z 0:00 [asterisk] <defunct> 11648 ? Z 0:00 [asterisk] <defunct> 11649 ? Z 0:00 [asterisk] <defunct> 11650 ? Z 0:00 [asterisk] <defunct> 11651 ? Z 0:00 [asterisk] <defunct> 11652 ? Z 0:00 [asterisk] <defunct> 11657 ? Z 0:00 [asterisk] <defunct> 11658 ? Z 0:00 [asterisk] <defunct> 11659 ? Z 0:00 [asterisk] <defunct> 11660 ? Z 0:00 [asterisk] <defunct> 11661 ? Z 0:00 [asterisk] <defunct> 11666 ? Z 0:00 [asterisk] <defunct> 11667 ? Z 0:00 [asterisk] <defunct> 11668 ? Z 0:00 [asterisk] <defunct> 11669 ? Z 0:00 [asterisk] <defunct> 11670 ? Z 0:00 [asterisk] <defunct> 11675 ? Z 0:00 [asterisk] <defunct> 11676 ? Z 0:00 [asterisk] <defunct> 11677 ? Z 0:00 [asterisk] <defunct> 11678 ? Z 0:00 [asterisk] <defunct> 11679 ? Z 0:00 [asterisk] <defunct> 11686 ? Z 0:00 [asterisk] <defunct> 11687 ? Z 0:00 [asterisk] <defunct> 11688 ? Z 0:00 [asterisk] <defunct> 11689 ? Z 0:00 [asterisk] <defunct> 11690 ? Z 0:00 [asterisk] <defunct> 11691 ? Z 0:00 [asterisk] <defunct> 11698 ? Z 0:00 [asterisk] <defunct> 11699 ? Z 0:00 [asterisk] <defunct> 11700 ? Z 0:00 [asterisk] <defunct> 11701 ? Z 0:00 [asterisk] <defunct> 11702 ? Z 0:00 [asterisk] <defunct> 11707 ? Z 0:00 [asterisk] <defunct> 11710 ? Z 0:00 [asterisk] <defunct> 11711 ? Z 0:00 [asterisk] <defunct> 11712 ? Z 0:00 [asterisk] <defunct> 11713 ? Z 0:00 [asterisk] <defunct> 11718 ? Z 0:00 [asterisk] <defunct> 11719 ? Z 0:00 [asterisk] <defunct> 11720 ? Z 0:00 [asterisk] <defunct> 11721 ? Z 0:00 [asterisk] <defunct> 11722 ? Z 0:00 [asterisk] <defunct> 11727 ? Z 0:00 [asterisk] <defunct> 11728 ? Z 0:00 [asterisk] <defunct> 11729 ? Z 0:00 [asterisk] <defunct> 11730 ? Z 0:00 [asterisk] <defunct> 11731 ? Z 0:00 [asterisk] <defunct> 11736 ? Z 0:00 [asterisk] <defunct> 11737 ? Z 0:00 [asterisk] <defunct> 11738 ? Z 0:00 [asterisk] <defunct> 11739 ? Z 0:00 [asterisk] <defunct> 11740 ? Z 0:00 [asterisk] <defunct> 11745 ? Z 0:00 [asterisk] <defunct> 11746 ? Z 0:00 [asterisk] <defunct> 11747 ? Z 0:00 [asterisk] <defunct> 11748 ? Z 0:00 [asterisk] <defunct> 11749 ? Z 0:00 [asterisk] <defunct> 11754 ? Z 0:00 [asterisk] <defunct> 11755 ? Z 0:00 [asterisk] <defunct> 11756 ? Z 0:00 [asterisk] <defunct> 11757 ? Z 0:00 [asterisk] <defunct> 11758 ? Z 0:00 [asterisk] <defunct> 11763 ? Z 0:00 [asterisk] <defunct> 11764 ? Z 0:00 [asterisk] <defunct> 11765 ? Z 0:00 [asterisk] <defunct> 11766 ? Z 0:00 [asterisk] <defunct> 11767 ? Z 0:00 [asterisk] <defunct> 11772 ? Z 0:00 [asterisk] <defunct> 11773 ? Z 0:00 [asterisk] <defunct> 11774 ? Z 0:00 [asterisk] <defunct> 11775 ? Z 0:00 [asterisk] <defunct> 11776 ? Z 0:00 [asterisk] <defunct> 11780 pts/2 S+ 0:00 sleep 1 11781 ? Z 0:00 [asterisk] <defunct> 11782 pts/4 R+ 0:00 ps ax 11783 ? Z 0:00 [asterisk] <defunct> 11933 ? S 0:00 /usr/sbin/apache2 -k start 12431 ? S 0:00 /usr/sbin/apache2 -k start 13089 ? S 0:04 /usr/sbin/apache2 -k start 13167 ? S 0:04 /usr/sbin/apache2 -k start 17667 ? S 0:00 /bin/sh /usr/sbin/safe_asterisk 17672 ? Sl 1037:42 /usr/sbin/asterisk -f -vvvg -c 26738 pts/3 S+ 0:00 tail -f /var/log/asterisk/messages 26739 pts/3 S+ 0:00 grep WARNING 27575 ? Ss 0:00 sshd: ernied [priv] 27585 ? S 0:00 sshd: ernied@pts/4 27586 pts/4 Ss 0:00 -bash 30922 ? S 0:01 /usr/sbin/apache2 -k start 31493 pts/1 S+ 0:00 rasterisk r 32245 ? Ss 0:00 sshd: ernied [priv] 32249 ? S 0:00 sshd: ernied@pts/2 32250 pts/2 Ss 0:02 -bash Asterisk WARNING messages (from tail -f /var/log/asterisk/messages |grep WARNING): [Dec 30 12:41:23] WARNING[1082] chan_sip.c: '' is not a valid RTP hold time at line 0. Using default. [Dec 30 12:41:23] WARNING[1082] chan_sip.c: '' is not a valid RTP hold time at line 0. Using default. DAHDI Channels in use: voip2*CLI> dahdi show channels Chan Extension Context Language MOH Interpret Blocked State pseudo default default In Service 1 local default In Service 2 6043193773 local default In Service 3 local default In Service 4 1866607630 local default In Service 5 6045158681 local default In Service 6 local default In Service 7 local default In Service 8 local default In Service 9 local default In Service 10 local default In Service 11 local default In Service 12 local default In Service 13 local default In Service 14 local default In Service 15 local default In Service 16 local default In Service 17 local default In Service 18 local default In Service 19 local default In Service 20 local default In Service 21 local default In Service 22 local default In Service 23 local default In Service By: Ernie Dunbar (ernied) 2010-12-30 14:55:36.000-0600 If anything, zombies start to accumulate when these RTP hold time warning messages appear. Over the course of 1 minute, the zombie processes ramp up to about 300 under light load. I haven't seen it get any worse than that quite yet. By: Ernie Dunbar (ernied) 2010-12-30 15:36:38.000-0600 Load is ramping up now as the new server is going live, and zombie processes are basically being generated continuously, but get destroyed when their number reaches 300. This is better, but may not be optimal. Otherwise, we aren't getting a lot of complaints about quality of service with the customers that are connected to the new server, so I don't think there's anything to truly complain about right now. By: Tilghman Lesher (tilghman) 2010-12-30 15:43:44.000-0600 Your zombie output suggests one of the following: 1) You are calling "asterisk -rx" with a command in the dialplan. This shouldn't be a problem, but if this is the case, you should typically be investigating better ways to do this. 2) One of the commands that you are attempting to execute either in the dialplan or in one of the assorted "external" locations (e.g. externnotify in voicemail) does not actually exist, and so the attempt to execute that command fails. I just verified that zombie processes show the last executable, not the name of the parent. The existence of so many "asterisk" processes points to another problem, of which you are seeing merely a symptom. By: Digium Subversion (svnbot) 2011-01-19 14:13:26.000-0600 Repository: asterisk Revision: 302599 U branches/1.6.2/main/app.c ------------------------------------------------------------------------ r302599 | tilghman | 2011-01-19 14:13:25 -0600 (Wed, 19 Jan 2011) | 15 lines Kill zombies. When we ast_safe_fork() with a non-zero argument, we're expected to reap our own zombies. On a zero argument, however, the zombies are only reaped when there aren't any non-zero forked children alive. At other times, we accumulate zombies. This code is forward ported from res_agi in 1.4, so that forked children are always reaped, thus preventing an accumulation of zombie processes. (closes issue ASTERISK-17147) Reported by: ernied Patches: 20101221__issue18515.diff.txt uploaded by tilghman (license 14) Tested by: ernied ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=302599 By: Digium Subversion (svnbot) 2011-01-19 14:24:58.000-0600 Repository: asterisk Revision: 302634 _U branches/1.8/ U branches/1.8/main/app.c ------------------------------------------------------------------------ r302634 | tilghman | 2011-01-19 14:24:57 -0600 (Wed, 19 Jan 2011) | 22 lines Merged revisions 302599 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.6.2 ........ r302599 | tilghman | 2011-01-19 14:13:24 -0600 (Wed, 19 Jan 2011) | 15 lines Kill zombies. When we ast_safe_fork() with a non-zero argument, we're expected to reap our own zombies. On a zero argument, however, the zombies are only reaped when there aren't any non-zero forked children alive. At other times, we accumulate zombies. This code is forward ported from res_agi in 1.4, so that forked children are always reaped, thus preventing an accumulation of zombie processes. (closes issue ASTERISK-17147) Reported by: ernied Patches: 20101221__issue18515.diff.txt uploaded by tilghman (license 14) Tested by: ernied ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=302634 By: Digium Subversion (svnbot) 2011-01-19 14:33:31.000-0600 Repository: asterisk Revision: 302644 _U trunk/ U trunk/main/app.c ------------------------------------------------------------------------ r302644 | tilghman | 2011-01-19 14:33:30 -0600 (Wed, 19 Jan 2011) | 29 lines Merged revisions 302634 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.8 ................ r302634 | tilghman | 2011-01-19 14:24:57 -0600 (Wed, 19 Jan 2011) | 22 lines Merged revisions 302599 via svnmerge from https://origsvn.digium.com/svn/asterisk/branches/1.6.2 ........ r302599 | tilghman | 2011-01-19 14:13:24 -0600 (Wed, 19 Jan 2011) | 15 lines Kill zombies. When we ast_safe_fork() with a non-zero argument, we're expected to reap our own zombies. On a zero argument, however, the zombies are only reaped when there aren't any non-zero forked children alive. At other times, we accumulate zombies. This code is forward ported from res_agi in 1.4, so that forked children are always reaped, thus preventing an accumulation of zombie processes. (closes issue ASTERISK-17147) Reported by: ernied Patches: 20101221__issue18515.diff.txt uploaded by tilghman (license 14) Tested by: ernied ........ ................ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=302644 |