[Home]

Summary:ASTERISK-17147: [patch] Asterisk produces many zombie processes while under load.
Reporter:Ernie Dunbar (ernied)Labels:
Date Opened:2010-12-21 20:26:48.000-0600Date Closed:2011-01-19 14:33:32.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20101221__issue18515.diff.txt
Description:Once under a non-trivial load (trivial being the 6 phones in our office that we've been testing with for the past week, non-trivial being around a hundred sip clients and about 7 or 8 simultaneous DAHDI calls), Asterisk starts producing a high volume of zombie processes - say a few hundred a minute. If left alone, this crashes the operating system within a few hours, but not before calls start getting very flaky.



****** ADDITIONAL INFORMATION ******

We're running on the following software:

Debian Linux 5.0 "Lenny"
Asterisk 1.6.2.15
DAHDI Version: 2.4.0 Echo Canceller: MG2
Libpri 1.4.11.4

And on the following hardware:

2.4 Ghz Intel Pentium 4
S875WP1-E Motherboard
2.0 GB DDR400 RAM
Digium Wildcard TE410P/TE405P (1st Gen, with Echo Cancellation)
Comments:By: Tilghman Lesher (tilghman) 2010-12-21 21:44:42.000-0600

Patch uploaded against 1.6.2.  Please test and verify that it fixes the problem for you.

By: Ernie Dunbar (ernied) 2010-12-30 11:49:22.000-0600

This patch does not fix the problem.

This issue does not exist when load testing with SIPp either. Could this be an issue in the DAHDI driver?

By: Tilghman Lesher (tilghman) 2010-12-30 13:57:15.000-0600

I am at a loss as to the cause of the zombies, then.  Could you show a listing of the zombie processes, when the problem is occurring?

By: Ernie Dunbar (ernied) 2010-12-30 14:09:23.000-0600

Also, this is our procedure for testing under load:

1. Live Asterisk server has a proper T1 connection from our telephone provider connected to an identical Digium Wildcard TE405P.

2. New Asterisk server has a pseudo T1 connection (or perhaps a real one, depending on your perspective I suppose) built from one of the free PRI ports on the Live server.

3. The Live server forwards incoming calls from its T1 connection to the new server if a SIP client is not connected to the Live server. ie, from extensions.conf on the Live server:

exten => 778XXXXXXX,1,Dial(SIP/778XXXXXXX,20)
exten => 778XXXXXXX,n,Dial(DAHDI/g4/778XXXXXXX,20)

4. DNS gets changed to change the hostname of the Live server to use the IP of the New server.

5. SIP clients start logging in when their registration expires.

6. Things go smoothly until around 6 or 7 consecutive DAHDI calls, when Asterisk starts producing zombie processes at a high rate (several hundred per minute).

Typically by step 6, we switch our DNS back and things start going back to normal for our customers.

By: Ernie Dunbar (ernied) 2010-12-30 14:52:10.000-0600

Testing right this moment.

Actually, currently it seems that zombie processes cap out at around 256 and then die properly. Here's a reasonable snapshot of the system as the problem happened in one particular instance:

Processes:

ernied@voip2:~$ ps ax
 PID TTY      STAT   TIME COMMAND
   1 ?        Ss     0:08 init [2]  
   2 ?        S<     0:00 [kthreadd]
   3 ?        S<     0:00 [migration/0]
   4 ?        S<    13:53 [ksoftirqd/0]
   5 ?        S<     0:00 [watchdog/0]
   6 ?        S<    24:45 [events/0]
   7 ?        S<     0:00 [khelper]
  39 ?        S<     0:02 [kblockd/0]
  41 ?        S<     0:00 [kacpid]
  42 ?        S<     0:00 [kacpi_notify]
 115 ?        S<     0:00 [kseriod]
 151 ?        S      0:00 [pdflush]
 152 ?        S      0:11 [pdflush]
 153 ?        S<     0:00 [kswapd0]
 154 ?        S<     0:00 [aio/0]
 410 ?        Ss     0:00 sshd: ernied [priv]
 419 ?        S      0:01 sshd: ernied@pts/1
 421 pts/1    Ss     0:00 -bash
 588 ?        S<     0:00 [ksuspend_usbd]
 589 ?        S<     0:00 [khubd]
 623 ?        S<     0:00 [ata/0]
 624 ?        S<     0:00 [ata_aux]
 688 ?        S<     0:00 [scsi_eh_0]
 691 ?        S<     0:00 [scsi_eh_1]
 833 ?        S      0:03 /usr/sbin/apache2 -k start
 882 ?        S<     0:08 [kjournald]
 958 ?        S<s    0:00 udevd --daemon
1286 ?        S<     0:00 [edac-poller]
1398 ?        S<     0:00 [kpsmoused]
1651 ?        S<     0:00 [kjournald]
1652 ?        S<     0:00 [kjournald]
1653 ?        S<     0:06 [kjournald]
1654 ?        S<     0:34 [kjournald]
1948 ?        Sl     0:49 /usr/sbin/rsyslogd -c3
1959 ?        Ss     0:00 /usr/sbin/acpid
2128 ?        Ss     0:15 /usr/sbin/sshd
2168 ?        S      0:00 /bin/sh /usr/bin/mysqld_safe
2205 ?        Sl    17:42 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/my
2206 ?        S      0:00 logger -p daemon.err -t mysqld_safe -i -t mysqld
2535 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
2563 ?        Ss     0:00 /usr/sbin/atd
2583 ?        Ss     0:02 /usr/sbin/cron
2597 ?        Ss     0:21 /usr/sbin/apache2 -k start
2658 ?        Ss     0:58 /usr/sbin/munin-node
2698 tty1     Ss+    0:00 /sbin/getty 38400 tty1
2702 tty2     Ss+    0:00 /sbin/getty 38400 tty2
2704 tty3     Ss+    0:00 /sbin/getty 38400 tty3
2706 tty4     Ss+    0:00 /sbin/getty 38400 tty4
2708 tty5     Ss+    0:00 /sbin/getty 38400 tty5
2709 tty6     Ss+    0:00 /sbin/getty 38400 tty6
3659 ?        Ss     0:00 sshd: ernied [priv]
3666 ?        S      0:00 sshd: ernied@pts/3
3667 pts/3    Ss     0:00 -bash
4155 ?        S      0:01 /usr/sbin/apache2 -k start
7387 ?        S      0:01 /usr/sbin/apache2 -k start
7413 ?        S      0:00 /usr/sbin/apache2 -k start
8632 ?        S      0:00 /usr/sbin/apache2 -k start
11602 ?        S      0:00 /usr/bin/php -q /var/lib/asterisk/agi-bin/a2billing.p
11605 ?        Z      0:00 [asterisk] <defunct>
11606 ?        Z      0:00 [asterisk] <defunct>
11607 ?        Z      0:00 [asterisk] <defunct>
11612 ?        Z      0:00 [asterisk] <defunct>
11613 ?        Z      0:00 [asterisk] <defunct>
11614 ?        Z      0:00 [asterisk] <defunct>
11615 ?        Z      0:00 [asterisk] <defunct>
11616 ?        Z      0:00 [asterisk] <defunct>
11621 ?        Z      0:00 [asterisk] <defunct>
11622 ?        Z      0:00 [asterisk] <defunct>
11623 ?        Z      0:00 [asterisk] <defunct>
11624 ?        Z      0:00 [asterisk] <defunct>
11625 ?        Z      0:00 [asterisk] <defunct>
11630 ?        Z      0:00 [asterisk] <defunct>
11631 ?        Z      0:00 [asterisk] <defunct>
11632 ?        Z      0:00 [asterisk] <defunct>
11633 ?        Z      0:00 [asterisk] <defunct>
11634 ?        Z      0:00 [asterisk] <defunct>
11639 ?        Z      0:00 [asterisk] <defunct>
11640 ?        Z      0:00 [asterisk] <defunct>
11641 ?        Z      0:00 [asterisk] <defunct>
11642 ?        Z      0:00 [asterisk] <defunct>
11643 ?        Z      0:00 [asterisk] <defunct>
11648 ?        Z      0:00 [asterisk] <defunct>
11649 ?        Z      0:00 [asterisk] <defunct>
11650 ?        Z      0:00 [asterisk] <defunct>
11651 ?        Z      0:00 [asterisk] <defunct>
11652 ?        Z      0:00 [asterisk] <defunct>
11657 ?        Z      0:00 [asterisk] <defunct>
11658 ?        Z      0:00 [asterisk] <defunct>
11659 ?        Z      0:00 [asterisk] <defunct>
11660 ?        Z      0:00 [asterisk] <defunct>
11661 ?        Z      0:00 [asterisk] <defunct>
11666 ?        Z      0:00 [asterisk] <defunct>
11667 ?        Z      0:00 [asterisk] <defunct>
11668 ?        Z      0:00 [asterisk] <defunct>
11669 ?        Z      0:00 [asterisk] <defunct>
11670 ?        Z      0:00 [asterisk] <defunct>
11675 ?        Z      0:00 [asterisk] <defunct>
11676 ?        Z      0:00 [asterisk] <defunct>
11677 ?        Z      0:00 [asterisk] <defunct>
11678 ?        Z      0:00 [asterisk] <defunct>
11679 ?        Z      0:00 [asterisk] <defunct>
11686 ?        Z      0:00 [asterisk] <defunct>
11687 ?        Z      0:00 [asterisk] <defunct>
11688 ?        Z      0:00 [asterisk] <defunct>
11689 ?        Z      0:00 [asterisk] <defunct>
11690 ?        Z      0:00 [asterisk] <defunct>
11691 ?        Z      0:00 [asterisk] <defunct>
11698 ?        Z      0:00 [asterisk] <defunct>
11699 ?        Z      0:00 [asterisk] <defunct>
11700 ?        Z      0:00 [asterisk] <defunct>
11701 ?        Z      0:00 [asterisk] <defunct>
11702 ?        Z      0:00 [asterisk] <defunct>
11707 ?        Z      0:00 [asterisk] <defunct>
11710 ?        Z      0:00 [asterisk] <defunct>
11711 ?        Z      0:00 [asterisk] <defunct>
11712 ?        Z      0:00 [asterisk] <defunct>
11713 ?        Z      0:00 [asterisk] <defunct>
11718 ?        Z      0:00 [asterisk] <defunct>
11719 ?        Z      0:00 [asterisk] <defunct>
11720 ?        Z      0:00 [asterisk] <defunct>
11721 ?        Z      0:00 [asterisk] <defunct>
11722 ?        Z      0:00 [asterisk] <defunct>
11727 ?        Z      0:00 [asterisk] <defunct>
11728 ?        Z      0:00 [asterisk] <defunct>
11729 ?        Z      0:00 [asterisk] <defunct>
11730 ?        Z      0:00 [asterisk] <defunct>
11731 ?        Z      0:00 [asterisk] <defunct>
11736 ?        Z      0:00 [asterisk] <defunct>
11737 ?        Z      0:00 [asterisk] <defunct>
11738 ?        Z      0:00 [asterisk] <defunct>
11739 ?        Z      0:00 [asterisk] <defunct>
11740 ?        Z      0:00 [asterisk] <defunct>
11745 ?        Z      0:00 [asterisk] <defunct>
11746 ?        Z      0:00 [asterisk] <defunct>
11747 ?        Z      0:00 [asterisk] <defunct>
11748 ?        Z      0:00 [asterisk] <defunct>
11749 ?        Z      0:00 [asterisk] <defunct>
11754 ?        Z      0:00 [asterisk] <defunct>
11755 ?        Z      0:00 [asterisk] <defunct>
11756 ?        Z      0:00 [asterisk] <defunct>
11757 ?        Z      0:00 [asterisk] <defunct>
11758 ?        Z      0:00 [asterisk] <defunct>
11763 ?        Z      0:00 [asterisk] <defunct>
11764 ?        Z      0:00 [asterisk] <defunct>
11765 ?        Z      0:00 [asterisk] <defunct>
11766 ?        Z      0:00 [asterisk] <defunct>
11767 ?        Z      0:00 [asterisk] <defunct>
11772 ?        Z      0:00 [asterisk] <defunct>
11773 ?        Z      0:00 [asterisk] <defunct>
11774 ?        Z      0:00 [asterisk] <defunct>
11775 ?        Z      0:00 [asterisk] <defunct>
11776 ?        Z      0:00 [asterisk] <defunct>
11780 pts/2    S+     0:00 sleep 1
11781 ?        Z      0:00 [asterisk] <defunct>
11782 pts/4    R+     0:00 ps ax
11783 ?        Z      0:00 [asterisk] <defunct>
11933 ?        S      0:00 /usr/sbin/apache2 -k start
12431 ?        S      0:00 /usr/sbin/apache2 -k start
13089 ?        S      0:04 /usr/sbin/apache2 -k start
13167 ?        S      0:04 /usr/sbin/apache2 -k start
17667 ?        S      0:00 /bin/sh /usr/sbin/safe_asterisk
17672 ?        Sl   1037:42 /usr/sbin/asterisk -f -vvvg -c
26738 pts/3    S+     0:00 tail -f /var/log/asterisk/messages
26739 pts/3    S+     0:00 grep WARNING
27575 ?        Ss     0:00 sshd: ernied [priv]
27585 ?        S      0:00 sshd: ernied@pts/4
27586 pts/4    Ss     0:00 -bash
30922 ?        S      0:01 /usr/sbin/apache2 -k start
31493 pts/1    S+     0:00 rasterisk r
32245 ?        Ss     0:00 sshd: ernied [priv]
32249 ?        S      0:00 sshd: ernied@pts/2
32250 pts/2    Ss     0:02 -bash

Asterisk WARNING messages (from tail -f /var/log/asterisk/messages |grep WARNING):

[Dec 30 12:41:23] WARNING[1082] chan_sip.c: '' is not a valid RTP hold time at line 0.  Using default.
[Dec 30 12:41:23] WARNING[1082] chan_sip.c: '' is not a valid RTP hold time at line 0.  Using default.

DAHDI Channels in use:

voip2*CLI> dahdi show channels
  Chan Extension  Context         Language   MOH Interpret        Blocked    State    
pseudo            default                    default                         In Service
     1            local                      default                         In Service
     2 6043193773 local                      default                         In Service
     3            local                      default                         In Service
     4 1866607630 local                      default                         In Service
     5 6045158681 local                      default                         In Service
     6            local                      default                         In Service
     7            local                      default                         In Service
     8            local                      default                         In Service
     9            local                      default                         In Service
    10            local                      default                         In Service
    11            local                      default                         In Service
    12            local                      default                         In Service
    13            local                      default                         In Service
    14            local                      default                         In Service
    15            local                      default                         In Service
    16            local                      default                         In Service
    17            local                      default                         In Service
    18            local                      default                         In Service
    19            local                      default                         In Service
    20            local                      default                         In Service
    21            local                      default                         In Service
    22            local                      default                         In Service
    23            local                      default                         In Service

By: Ernie Dunbar (ernied) 2010-12-30 14:55:36.000-0600

If anything, zombies start to accumulate when these RTP hold time warning messages appear. Over the course of 1 minute, the zombie processes ramp up to about 300 under light load. I haven't seen it get any worse than that quite yet.



By: Ernie Dunbar (ernied) 2010-12-30 15:36:38.000-0600

Load is ramping up now as the new server is going live, and zombie processes are basically being generated continuously, but get destroyed when their number reaches 300.

This is better, but may not be optimal. Otherwise, we aren't getting a lot of complaints about quality of service with the customers that are connected to the new server, so I don't think there's anything to truly complain about right now.

By: Tilghman Lesher (tilghman) 2010-12-30 15:43:44.000-0600

Your zombie output suggests one of the following:
1) You are calling "asterisk -rx" with a command in the dialplan.  This shouldn't be a problem, but if this is the case, you should typically be investigating better ways to do this.
2) One of the commands that you are attempting to execute either in the dialplan or in one of the assorted "external" locations (e.g. externnotify in voicemail) does not actually exist, and so the attempt to execute that command fails.

I just verified that zombie processes show the last executable, not the name of the parent.  The existence of so many "asterisk" processes points to another problem, of which you are seeing merely a symptom.

By: Digium Subversion (svnbot) 2011-01-19 14:13:26.000-0600

Repository: asterisk
Revision: 302599

U   branches/1.6.2/main/app.c

------------------------------------------------------------------------
r302599 | tilghman | 2011-01-19 14:13:25 -0600 (Wed, 19 Jan 2011) | 15 lines

Kill zombies.

When we ast_safe_fork() with a non-zero argument, we're expected to reap our
own zombies.  On a zero argument, however, the zombies are only reaped when
there aren't any non-zero forked children alive.  At other times, we
accumulate zombies.  This code is forward ported from res_agi in 1.4, so that
forked children are always reaped, thus preventing an accumulation of zombie
processes.

(closes issue ASTERISK-17147)
Reported by: ernied
Patches:
     20101221__issue18515.diff.txt uploaded by tilghman (license 14)
Tested by: ernied

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=302599

By: Digium Subversion (svnbot) 2011-01-19 14:24:58.000-0600

Repository: asterisk
Revision: 302634

_U  branches/1.8/
U   branches/1.8/main/app.c

------------------------------------------------------------------------
r302634 | tilghman | 2011-01-19 14:24:57 -0600 (Wed, 19 Jan 2011) | 22 lines

Merged revisions 302599 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.6.2

........
 r302599 | tilghman | 2011-01-19 14:13:24 -0600 (Wed, 19 Jan 2011) | 15 lines
 
 Kill zombies.
 
 When we ast_safe_fork() with a non-zero argument, we're expected to reap our
 own zombies.  On a zero argument, however, the zombies are only reaped when
 there aren't any non-zero forked children alive.  At other times, we
 accumulate zombies.  This code is forward ported from res_agi in 1.4, so that
 forked children are always reaped, thus preventing an accumulation of zombie
 processes.
 
 (closes issue ASTERISK-17147)
 Reported by: ernied
 Patches:
       20101221__issue18515.diff.txt uploaded by tilghman (license 14)
 Tested by: ernied
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=302634

By: Digium Subversion (svnbot) 2011-01-19 14:33:31.000-0600

Repository: asterisk
Revision: 302644

_U  trunk/
U   trunk/main/app.c

------------------------------------------------------------------------
r302644 | tilghman | 2011-01-19 14:33:30 -0600 (Wed, 19 Jan 2011) | 29 lines

Merged revisions 302634 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.8

................
 r302634 | tilghman | 2011-01-19 14:24:57 -0600 (Wed, 19 Jan 2011) | 22 lines
 
 Merged revisions 302599 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.6.2
 
 ........
   r302599 | tilghman | 2011-01-19 14:13:24 -0600 (Wed, 19 Jan 2011) | 15 lines
   
   Kill zombies.
   
   When we ast_safe_fork() with a non-zero argument, we're expected to reap our
   own zombies.  On a zero argument, however, the zombies are only reaped when
   there aren't any non-zero forked children alive.  At other times, we
   accumulate zombies.  This code is forward ported from res_agi in 1.4, so that
   forked children are always reaped, thus preventing an accumulation of zombie
   processes.
   
   (closes issue ASTERISK-17147)
   Reported by: ernied
   Patches:
         20101221__issue18515.diff.txt uploaded by tilghman (license 14)
   Tested by: ernied
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=302644