[Home]

Summary:ASTERISK-16023: [patch] UDP ports not freed/ports leaking
Reporter:Kenneth Van Velthoven (kvveltho)Labels:
Date Opened:2010-04-28 06:23:53Date Closed:2011-05-11 13:51:04
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) branch-1.6.1-r307181-sip-dos-mem-leak-fix.diff
( 1) branch-1.6.1-r308370-sip-dos-mem-leak-fix.diff
( 2) branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
( 3) branch-1.6.2-r308370-sip-dos-mem-leak-fix.diff
( 4) branch-1.8-r307669-sip-dos-mem-leak-fix.diff
( 5) branch-1.8-r308202-sip-dos-mem-leak-fix.diff
( 6) branch-1.8-r308370-sip-dos-mem-leak-fix.diff
( 7) crash20110207.txt
( 8) issue-17255-branch-1.8-r318281.diff
( 9) issue-17255-trunk-r318281.diff
(10) rtp_leak_wild_guess_1.6.2_rev267441.diff
(11) rtp.conf
(12) sip-dos-mem-leak-fix-readme-r307181.txt
(13) tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
(14) tag-1.6.2.16.1-r308370-sip-dos-mem-leak-fix.diff
(15) tag-1.6.2.16.2-r308525-sip-dos-mem-leak-fix.diff
(16) tag-1.6.2.17-r309252-sip-dos-mem-leak-fix.diff
(17) trunk-r307669-sip-dos-mem-leak-fix.diff
(18) trunk-r308202-sip-dos-mem-leak-fix.diff
(19) trunk-r308370-sip-dos-mem-leak-fix.diff
(20) UDP.txt
Description:The following is not fixed:

These release candidates resolve an issue with UDP ports not being freed, per
issue ASTERISK-15581 (https://issues.asterisk.org/view.php?id=16774)
Comments:By: Paul Belanger (pabelanger) 2010-04-28 08:32:26

kvveltho: We need more information.  IE: outputs confirming this or logs.  At this point there is not enough information to do anything.

--
Thank you for taking the time to report this bug and helping to make Asterisk better.

Unfortunately, we cannot work on this bug because your description did not include enough information.

You may find it helpful to read the Asterisk Issue Guidelines http://www.asterisk.org/developers/bug-guidelines.

We\'d be grateful if you would then provide a more complete description of the problem.

At a minimum, we need:
1. the specific steps or actions you took that caused you to encounter the problem,
2. the behavior you expected, and
3. the behavior you actually encountered (in as much detail as possible).

This likely includes output from the console with debug level logging, a SIP trace (if this is SIP related), and configuration information such as dialplan (e.g. extensions.conf) and channel configuration (e.g. sip.conf).

Thanks!

By: Leif Madsen (lmadsen) 2010-04-30 10:50:03

Setting to feedback while awaiting information from the reporter.

By: Kenneth Van Velthoven (kvveltho) 2010-05-02 14:32:12

I can't imagine being te only one having that problem.  See UDP.txt for netstat UDP ports not being freed.  There's no activity (active calls) on the Asterisk server now.

With version 1.6.0.26 there's no problem.

By: Walter Doekes (wdoekes) 2010-05-03 04:29:44

kvveltho: I understand the fix in https://issues.asterisk.org/view.php?id=16774 is first applied in 1.6.2.8. You say you're running 1.6.2.7-rc2?

(I run 1.6.1.18 and I see the patch from http://svnview.digium.com/svn/asterisk/branches/1.6.1/channels/chan_sip.c?r1=257208&r2=257207&pathrev=257208 has not been applied there. I've applied the patch now and will monitor if it fixes the issue.)

By: Kenneth Van Velthoven (kvveltho) 2010-05-03 05:14:17

I must have confused both issues.   I'll wait for 1.6.2.8 then and get back with feedback.

By: Walter Doekes (wdoekes) 2010-05-03 06:26:10

Hah. Disregard my comments. I'm looking at 1.6.2.7-rc2 now and the patch is already applied there.

So, if you're still experiencing the issue with said version it is either not fixed or you are seeing the manifestation of a different bug.

By: Kenneth Van Velthoven (kvveltho) 2010-05-03 14:12:05

In 1.6.2.7-rc3 the problem seems to fixed.  Installed it yesterday,  had about 30000 calls and no UDP ports open.  I'm testing the next days and get back.

By: Leif Madsen (lmadsen) 2010-05-04 15:57:21

I'm going to close this issue as fixed for now -- if that is NOT the case, then please reopen this issue. Thanks!

By: Kenneth Van Velthoven (kvveltho) 2010-05-04 16:08:52

RC3 also leaking some ports:  netstat -a extract while no calls are active

udp        0      0 *:13610                 *:*
udp        0      0 *:14890                 *:*
udp        0      0 *:13611                 *:*
udp        0      0 *:14891                 *:*
udp        0      0 *:12980                 *:*
udp        0      0 *:15412                 *:*
udp        0      0 *:12981                 *:*
udp        0      0 *:15413                 *:*

By: Walter Doekes (wdoekes) 2010-05-05 14:59:45

Confirmed here, unfortunately.

# asterisk -rx 'core show version' ; netstat -apnAinet | grep udp.*asterisk ; asterisk -rx 'core show channels'
Asterisk 1.6.2.7-rc3 built by osso @ xyz on a x86_64 running Linux on 2010-05-03 13:39:52 UTC
udp        0      0 0.0.0.0:4566            0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:4590            0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:10228           0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:10229           0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:17452           0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:17453           0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:4536            0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:16442           0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:16443           0.0.0.0:*                           25045/asterisk  
udp        0      0 0.0.0.0:5060            0.0.0.0:*                           25045/asterisk  
Channel              Location             State   Application(Data)            
0 active channels
0 active calls
12726 calls processed

By: frawd (frawd) 2010-05-05 17:46:53

Hello!

How many SIP peers does your system handle? Do you use qualify? Do you use outbound registrations?

By: Walter Doekes (wdoekes) 2010-05-06 01:40:12

Hi frawd,

SIP peers:

mysql> select count(*) from asterisk_sipreg where regseconds <> 0;
+----------+
| count(*) |
+----------+
|      244 |
+----------+


Qualify, for exactly two static peers:

mysql> select * from asterisk_staticconf_xyz where var_name = 'qualify' order by cat_metric, var_metric;
+------------+------------+----------+----------------+----------+---------+-----------+
| cat_metric | var_metric | filename | category       | var_name | var_val | commented |
+------------+------------+----------+----------------+----------+---------+-----------+
|          1 |         21 | sip.conf | general        | qualify  | no      |         0 |
|          2 |      32778 | sip.conf | xxxxxxx001     | qualify  | yes     |         0 |
|          3 |      32778 | sip.conf | xxxxxxx002     | qualify  | yes     |         0 |
|          4 |      32774 | sip.conf | xxxxxxxtelecom | qualify  | no      |         0 |
|          5 |      32774 | sip.conf | voxbone_be     | qualify  | no      |         0 |
|          6 |      32774 | sip.conf | voxbone_cn     | qualify  | no      |         0 |
|          7 |      32774 | sip.conf | voxbone_de     | qualify  | no      |         0 |
|          8 |      32774 | sip.conf | voxbone_usla   | qualify  | no      |         0 |
|          9 |      32774 | sip.conf | voxbone_usny   | qualify  | no      |         0 |
+------------+------------+----------+----------------+----------+---------+-----------+


Outbound registrations, the same two static peers that use qualify:

mysql> select * from asterisk_staticconf_xyz where var_name = 'register' order by cat_metric, var_metric;
+------------+------------+----------+----------+----------+------------------------------------------+-----------+
| cat_metric | var_metric | filename | category | var_name | var_val                                  | commented |
+------------+------------+----------+----------+----------+------------------------------------------+-----------+
|          1 |      32770 | sip.conf | general  | register | xxxxxxx001:zzzzzzzzzzzzzzz@xx.xxx.xxx.xx |         0 |
|          1 |      32771 | sip.conf | general  | register | xxxxxxx002:zzzzzzzzzzzzzzz@xx.xxx.xxx.xx |         0 |
+------------+------------+----------+----------+----------+------------------------------------------+-----------+

By: frawd (frawd) 2010-05-06 03:19:03

Sorry again, what are the values of rtpstart and rtpend in rtp.conf?

By: frawd (frawd) 2010-05-06 03:21:09

nevermind, just saw the attached rtp.conf

By: Walter Doekes (wdoekes) 2010-05-06 03:30:53

Mm.. that would be kvveltho's conf. But mine is similar:

mysql> select * from asterisk_staticconf_xyz where filename = 'rtp.conf' order by cat_metric, var_metric;
+------------+------------+----------+----------+-----------+---------+-----------+
| cat_metric | var_metric | filename | category | var_name  | var_val | commented |
+------------+------------+----------+----------+-----------+---------+-----------+
|          1 |          1 | rtp.conf | general  | rtpstart  | 10000   |         0 |
|          1 |          2 | rtp.conf | general  | rtpend    | 20000   |         0 |
|          1 |          3 | rtp.conf | general  | strictrtp | no      |         0 |
+------------+------------+----------+----------+-----------+---------+-----------+

By: Walter Doekes (wdoekes) 2010-05-07 04:51:24

Qualify and registrations do not appear to be the culprits.

I've disabled the registration (set static ip at provider) and set
qualify to no -- both without restarting asterisk -- yesterday, and the
open UDP sockets are still increasing:

# egrep 'UDP socket|active calls' Thu-0002.log
UDP sockets: 11
0 active calls
# egrep 'UDP socket|active calls' Fri-0002.log
UDP sockets: 26
0 active calls
# asterisk -rx 'core show channels' | grep processed
24469 calls processed

By: frawd (frawd) 2010-05-07 05:27:42

Seems so, now does any of your providers or peers send you qualify/keep-alive packets?

By: Walter Doekes (wdoekes) 2010-05-07 09:51:47

I've counted the following incoming keep-alive packets over the course of 6 minutes:
10 INFO
1486 NOTIFY
229 OPTIONS

So yes. And I cannot block these, as I would be cutting off my nated clients ;)

By: Kenneth Van Velthoven (kvveltho) 2010-05-08 13:45:05

Tested 1.6.2.8 RC1,  problem also presenbt.

Here's a sample of the 60 open ports after a couple of 1000 calls:
udp        0      0 *:15124                 *:*
udp        0      0 *:12308                 *:*
udp        0      0 *:11029                 *:*
udp        0      0 *:15125                 *:*
udp        0      0 *:12309                 *:*
udp        0      0 *:16662                 *:*
udp        0      0 *:16663                 *:*
udp        0      0 *:10394                 *:*
Meanwile I went back to 1.6.0.26 where there's no problem.

By: Walter Doekes (wdoekes) 2010-05-17 01:20:59

@frawd: the number of leaked sockets look like they may have stabilized at 67, so qualify and registrations are back in the picture:

# egrep 'UDP socket|active calls|process' *-0202.log
Tue-0202.log:UDP sockets: 56
Tue-0202.log:0 active calls
Tue-0202.log:45190 calls processed

Wed-0202.log:UDP sockets: 68
Wed-0202.log:2 active calls
Wed-0202.log:54323 calls processed

Thu-0202.log:UDP sockets: 68
Thu-0202.log:2 active calls
Thu-0202.log:63413 calls processed

Fri-0202.log:UDP sockets: 68
Fri-0202.log:2 active calls
Fri-0202.log:66765 calls processed

Sat-0202.log:UDP sockets: 68
Sat-0202.log:2 active calls
Sat-0202.log:73576 calls processed

Sun-0202.log:UDP sockets: 68
Sun-0202.log:2 active calls
Sun-0202.log:77739 calls processed

Mon-0202.log:UDP sockets: 68
Mon-0202.log:2 active calls
Mon-0202.log:80643 calls processed

(those two active calls are stalled calls that aren't doing udp traffic. probably an unrelated issue.)

I've scheduled an asterisk restart. I'll report back if I get new leaks.

By: frawd (frawd) 2010-05-17 03:10:57

I've seen the same type of output on an implementation I have with hundreds of users. What I could see is that it represents ports currently in use by open SIP dialogs (notify, subscribe, qualify, ...), probably towards peers that have network problems and take time to reply (or just time out).
In netstat I see ports released after less than a minute and new ports opened (in a time with no calls).

Could you count the number of dialogs from "sip show objects" and see if they correspond more or less to the number of UDP ports from netstat?

The strange thing would be that it does not happen to 1.6.0 versions.

By: Alessandro Bertoldo (beres) 2010-05-20 11:33:45

With 1.6.2.8 rc1 with only 1 peer, used to send fax with 0.0.6pre17.
Here is an image http://yfrog.com/4vsocketp that show a graph of the sockets used by asterisk, it does not crash or segfault but freeze the server.

If you need to know any kind of informations, fell free to ask

By: Walter Doekes (wdoekes) 2010-06-03 04:47:27

frawd: sip show objects does not correlate to the UDP socket count

After the latest restart:

* date: tue 03:02 / wed 03:02 / thu 03:02
* active calls: 0 / 0 / 0
* processed calls: 7500 / 17099 / 26503
* udp sockets: 1 / 7 / 13
* static peer objects: 9 (all refcount 2) / 9 (all refcount 2) / 9 (all refcount 2)
* dialog objects: 48 (all refcount 2) / 62 (all refcount 2) / 49 (all refcount 2)


The SIP dialog count (sip show channels) always corresponds with the dialog count in sip show objects.

This latest restart wasn't planned, so I lost a bit of history. The previous log stated that it was using 52 sockets (friday, 2 active calls) and 58 (saturday, 2 active calls). It does not seem to leak as heavily as before I disabled the registrations.


Output of lsof -p for the mentioned points in time:

root@voipgrid-live0:~/fdlog# grep asterisk.*UDP Tue-0302.log
asterisk 13620 asterisk   12u  IPv4           64350467      0t0      UDP *:sip
root@voipgrid-live0:~/fdlog# grep asterisk.*UDP Wed-0302.log
asterisk 13620 asterisk   12u  IPv4           64350467      0t0      UDP *:sip
asterisk 13620 asterisk   73w  IPv4           65129850      0t0      UDP *:17910
asterisk 13620 asterisk   74u  IPv4           65129851      0t0      UDP *:17911
asterisk 13620 asterisk   75u  IPv4           65129852      0t0      UDP *:4524
asterisk 13620 asterisk  163w  IPv4           65370332      0t0      UDP *:13206
asterisk 13620 asterisk  190r  IPv4           65370333      0t0      UDP *:13207
asterisk 13620 asterisk  201r  IPv4           65370334      0t0      UDP *:4565
root@voipgrid-live0:~/fdlog# grep asterisk.*UDP Thu-0302.log
asterisk 13620 asterisk   12u  IPv4           64350467      0t0      UDP *:sip
asterisk 13620 asterisk   30u  IPv4           66394165      0t0      UDP *:14418
asterisk 13620 asterisk   31u  IPv4           66394166      0t0      UDP *:14419
asterisk 13620 asterisk   32u  IPv4           66394167      0t0      UDP *:4552
asterisk 13620 asterisk   73w  IPv4           65129850      0t0      UDP *:17910
asterisk 13620 asterisk   74u  IPv4           65129851      0t0      UDP *:17911
asterisk 13620 asterisk   75u  IPv4           65129852      0t0      UDP *:4524
asterisk 13620 asterisk   98u  IPv4           66115116      0t0      UDP *:15310
asterisk 13620 asterisk   99u  IPv4           66115117      0t0      UDP *:15311
asterisk 13620 asterisk  100w  IPv4           66115118      0t0      UDP *:4593
asterisk 13620 asterisk  163w  IPv4           65370332      0t0      UDP *:13206
asterisk 13620 asterisk  190r  IPv4           65370333      0t0      UDP *:13207
asterisk 13620 asterisk  201r  IPv4           65370334      0t0      UDP *:4565


What else can I check?

By: Walter Doekes (wdoekes) 2010-06-03 05:01:44

I don't think it's relevant, but for completeness sake, the following non-standard can be said about my configuration:

* I'm using the Hoard allocator.
* I've (only) loaded the following modules:

CLI> module show
Module                         Description                              Use Count
res_odbc.so                    ODBC resource                            0        
res_config_odbc.so             Realtime ODBC configuration              0        
res_monitor.so                 Call Monitoring Resource                 0        
app_chanspy.so                 Listen to the audio of an active channel 0        
app_dial.so                    Dialing Application                      21        
app_exec.so                    Executes dialplan applications           0        
app_fax.so                     Simple FAX Application                   0        
app_forkcdr.so                 Fork The CDR into 2 separate entities    0        
app_macro.so                   Extension Macros                         0        
app_playback.so                Sound File Playback Application          0        
app_playtones.so               Playtones Application                    0        
app_stack.so                   Dialplan subroutines (Gosub, Return, etc 0        
app_transfer.so                Transfers a caller to another extension  0        
app_verbose.so                 Send verbose output                      0        
app_voicemail.so               Comedian Mail (Voicemail System)         0        
cdr_odbc.so                    ODBC CDR Backend                         0        
chan_local.so                  Local Proxy Channel (Note: used internal 0        
chan_sip.so                    Session Initiation Protocol (SIP)        152      
codec_alaw.so                  A-law Coder/Decoder                      0        
codec_gsm.so                   GSM Coder/Decoder                        0        
codec_ulaw.so                  mu-Law Coder/Decoder                     0        
format_gsm.so                  Raw GSM data                             0        
format_pcm.so                  Raw/Sun uLaw/ALaw 8KHz (PCM,PCMA,AU), G. 0        
format_sln.so                  Raw Signed Linear Audio support (SLN)    0        
format_sln16.so                Raw Signed Linear 16KHz Audio support (S 0        
format_wav.so                  Microsoft WAV format (8000Hz Signed Line -3        
format_wav_gsm.so              Microsoft WAV format (Proprietary GSM)   0        
func_callerid.so               Caller ID related dialplan functions     0        
func_cdr.so                    Call Detail Record (CDR) dialplan functi 0        
func_channel.so                Channel information dialplan functions   0        
func_cut.so                    Cut out information from a string        0        
func_global.so                 Variable dialplan functions              0        
func_logic.so                  Logical dialplan functions               0        
func_odbc.so                   ODBC lookups                             0        
func_rand.so                   Random number dialplan function          0        
func_realtime.so               Read/Write/Store/Destroy values from a R 0        
func_strings.so                String handling dialplan functions       0        
func_timeout.so                Channel timeout dialplan functions       0        
pbx_config.so                  Text Extension Configuration             0        
pbx_spool.so                   Outgoing Spool Support                   0        
res_musiconhold.so             Music On Hold Resource                   1        
res_realtime.so                Realtime Data Lookup/Rewrite             0        
42 modules loaded

By: frawd (frawd) 2010-06-03 05:43:52

wdoekes, in your case at least (and probably beres'), it looks like the leak happens only on T.38 sessions, as ports leaked seem to always be by 3 looking at the information you provided (I can't believe i missed that before!):

- an RTP port (10000-20000)
- an RTCP port (RTP+1)
- a UDPTL port (4500-4999)

Ex. from about everything you pasted, you can see it clearly things like:
asterisk 13620 asterisk 30u IPv4 66394165 0t0 UDP *:14418
asterisk 13620 asterisk 31u IPv4 66394166 0t0 UDP *:14419
asterisk 13620 asterisk 32u IPv4 66394167 0t0 UDP *:4552

You can try to confirm by disabling T.38 if you can. In that case the good news is it would narrow the problem down quite a bit.

The bad news is I'm no expert at all in all the REF/UNREF things in Asterisk, I will try to get a look and help, but probably a real Asterisk dev will have to correct the issue, and believe me this could take time (start learning C, that's what I did while waiting for answers on this bug tracker). :-)

By: frawd (frawd) 2010-06-03 05:56:38

Basing myself on the correction for issue 16774 changing some AST_SCHED_DEL to AST_SCHED_DEL_UNREF, and looking for something like that related to T.38, here is a small patch that might make it work (for Asterisk 1.6.2 branch, but should apply to 1.6.2.8).

If you have guts you can test it, or else wait for a dev to say how great or stupid this patch is, as I have really no idea how that works and what this means... I don't think it should break things badly if it's a wrong approach, but you never know.

By: Walter Doekes (wdoekes) 2010-06-07 04:24:09

I did a lot of T38 testing on friday, and the refcount on my static peers is indeed up there:

[friday 03:02]
name: xyz
type: peer
objflags: 0
refcount: 9

[saturday 03:02]
name: xyz
type: peer
objflags: 0
refcount: 35


I can't disable T.38 on production right now. I'm considering whether I dare to deploy your patch on production ;-)

By: frawd (frawd) 2010-06-07 04:36:51

Hehe, sorry for not being confident enough with asterisk's source code.

I might be able to test if it doesn't break anything (i don't have any t38 devices or providers for testing the actual problem), but only in a few weeks for if you prefer to wait for that confirmation.

You could prepare some quick restore procedure (if you do it from source: have a non-patched directory compiled and ready for 'make install') to restore everything in a couple of minutes in case you have any problems.

By: Walter Doekes (wdoekes) 2010-06-19 07:09:29

Okay, I've been running the patch for 2.5 days and I see the following:
(1) UDP leaks still occur :(
(2) The refcount for my peers is steady at 2 (probably better than the increase I witnessed before), but it may be a bit early to be sure.

By: Walter Doekes (wdoekes) 2010-07-06 04:37:46

The socket count seemed to be steady at 76 for a while, but today I found an increase again and we're up at 85 sockets.

Between monday 03:00 and tuesday 03:00 I find that 3 sets of sockets opened (3x rtp+rtcp+uptl):

+asterisk 13993 asterisk   73w  IPv4          104515623      0t0       UDP *:11720
+asterisk 13993 asterisk   77w  IPv4          104515624      0t0       UDP *:11721
+asterisk 13993 asterisk   79u  IPv4          104515625      0t0       UDP *:26262

+asterisk 13993 asterisk  360u  IPv4          104534109      0t0       UDP *:16282
+asterisk 13993 asterisk  361u  IPv4          104534110      0t0       UDP *:16283
+asterisk 13993 asterisk  365u  IPv4          104534111      0t0       UDP *:22772

+asterisk 13993 asterisk  440u  IPv4          104531083      0t0       UDP *:17730
+asterisk 13993 asterisk  441r  IPv4          104531084      0t0       UDP *:17731
+asterisk 13993 asterisk  479r  IPv4          104531085      0t0       UDP *:23004

What's interesting is that they all got added between 11:00 and 12:00 on monday:

# zgrep -E '104515623|104534109|104531083' Mon-1102.log | wc -l
0
# zgrep -E '104515623|104534109|104531083' Mon-1202.log | wc -l
3

In this period, I have only a few warnings that look like they could be related:

# grep '2010-07-05 11:' /var/log/asterisk/messages.log.1 | grep -vE 'NOTICE|func_strings|Unable to create channel|"Forbidden"'
[2010-07-05 11:11:49] WARNING[7131] app_fax.c: Error transmitting fax. result=49: The call dropped prematurely.
[2010-07-05 11:11:49] WARNING[7131] app_fax.c: Transmission error
[2010-07-05 11:29:26] WARNING[8925] app_dial.c: Unable to write frame
[2010-07-05 11:43:23] WARNING[14009] chan_sip.c: Maximum retries exceeded on transmission 0c9f7e062a42db646883585a746c2577@x.y.192.80 for seqno 102 (Non-critical Request) -- See doc/sip-retransmit.txt.

Unfortunately is none of these messages a guarantee for leaks as these messages occur more often than there are leaks.



In the mean time, the refcount on (all of) the static sip peers has gone up by one, twice, but that was later that day (between 2 and 4 P.M.):

# diff -uw Mon-1402.log Mon-1502.log | grep refcount | head -n4
-refcount: 22
+refcount: 23
-refcount: 22
+refcount: 23
# diff -uw Mon-1502.log Mon-1602.log | grep refcount | head -n4
-refcount: 23
+refcount: 24
-refcount: 23
+refcount: 24


Possibly relevant message are:

# grep ' 1[45]:' /var/log/asterisk/messages.log.1 | grep -vE 'func_strings|Unable to create channel|"Forbidden"|no rule .t.|override the T38FaxMaxDatagram|Registration from|stale nonce'
[2010-07-05 14:04:31] WARNING[23837] app_dial.c: Unable to write frame
[2010-07-05 14:31:50] WARNING[26659] app_dial.c: Unable to write frame
[2010-07-05 14:48:50] WARNING[14009] chan_sip.c: Maximum retries exceeded on transmission 5df7e75360478df07db0bcb563a0c301@x.y.225.78 for seqno 102 (Non-critical Response) -- See doc/sip-retransmit.txt.
[2010-07-05 14:58:43] ERROR[14009] rtp.c: RTCP RR transmission error, rtcp halted: Operation not permitted
[2010-07-05 15:08:08] NOTICE[29896] rtp.c: Unknown RTP codec 126 received from '62.180.226.26'
[2010-07-05 15:17:13] NOTICE[14009] chan_sip.c: Unable to create/find SIP channel for this INVITE
[2010-07-05 15:17:33] WARNING[14009] chan_sip.c: Maximum retries exceeded on transmission 9a324ba4e7ca7d0d161ffc3de2e0a3bc@192.168.1.240 for seqno 1051752821 (Critical Response) -- See doc/sip-retransmit.txt.
[2010-07-05 15:21:54] WARNING[14009] chan_sip.c: Maximum retries exceeded on transmission 25d42727-3d187798@192.168.178.38 for seqno 102 (Critical Response) -- See doc/sip-retransmit.txt.
[2010-07-05 15:55:20] WARNING[2862] app_dial.c: Unable to write frame
[2010-07-05 15:59:11] NOTICE[14009] chan_sip.c: Call from 's' to extension 'LC-201' rejected because extension not found.
[2010-07-05 15:59:11] NOTICE[14009] chan_sip.c: Call from 's' to extension 'LC-' rejected because extension not found.
[2010-07-05 15:59:13] NOTICE[14009] chan_sip.c: Call from 's' to extension 'LC-201' rejected because extension not found.
[2010-07-05 15:59:13] NOTICE[14009] chan_sip.c: Call from 's' to extension 'LC-' rejected because extension not found.
[2010-07-05 15:59:15] NOTICE[14009] chan_sip.c: Call from 's' to extension 'RG-601-0031610797807#' rejected because extension not found.



B.T.W. I'm still running frawds patch. Which doesn't seem to have positive or negative effects.

By: frawd (frawd) 2010-07-06 07:56:35

wdoekes, I guess you should remove the patch, if it doesn't have positive effects it must have negative ones (maybe they will occur as time passes).

By: leearcher (leearcher) 2010-07-23 14:44:55

Will this be fixed with 1.4?  I have several servers doing the same thing.

By: Leif Madsen (lmadsen) 2010-07-26 13:57:10

This issue has not yet crept up the stack to being assigned to a developer. It will be assigned once time and resources allow. For now, anyone from the community is welcome to take over this issue.

By: Loic Didelot (voipgate) 2010-10-14 05:22:19

I notice the issues on asterisk 1.8 RC3. Its leaking RTP and UDPTL ports quite heavily.

By: Elazar Broad (ebroad) 2010-10-14 11:09:48

See https://issues.asterisk.org/view.php?id=18121 for the UDPTL issue.

By: Rob Gagnon (rgagnon) 2011-01-25 12:02:28.000-0600

As an FYI, we are experiencing this bug under 1.6.2.16.1, and actively looking for a solution as well.  We put about 15,000 calls/hour on each of our servers, so we burn through UDP fast under this version.  Running 1.4.26.3 does not exhibit the behavior.

By: Rob Gagnon (rgagnon) 2011-01-26 23:37:19.000-0600

More info for anyone that wants it.  I've found under 1.6.2.16.1 that it only APPEARS that the ports are leaking when session-timers are in use.  This actually looks more like a potential DoS problem for high-use servers where AGI is in play, and the a UAC sends a large value for "Session-Expires" (IE: 3600).

If you wait long enough, the session refresh timer will eventually go off and close the ports.  My thinking is there is nothing removing the session timer from the scheduler when AGI (or some other force) tells a dialog to end, thus the timer/scheduler is holding a reference to the dialog (and in-turn the UDP ports), so they don't close.

Imagine this scenario:
sip.conf: (defaults)
session-timers=accept
session-expires=1800
session-minse=90
session-refresher=uas

Server receives a call, with headers requesting a 3600 second session timer (both min-se and session-expires set to 3600)

RTP channels are bound (along with RTCP, UDPTL, and maybe a video and chat UDP port as well)--- Up to 7 UDP ports are bound to the dialog.

Under RFC4028, asterisk properly picks 3600 as the session interval.

chan_sip will do an ast_sched_add from start_session_timer() with a "when" of 900,000 ms (3600 * 1000 / 2)

Now, the call is denied...  Maybe your AGI decides it doesn't like the calling party, or there is congestion on all routes.

UAC is told to drop, UAC stops sending SIP traffic.

All those UDP ports that were bound to the dialog earlier are NOT free'd.  Only after the 900,000ms (15 minutes) goes by, are the ports free'd.

So.  They are not leaking, but they aren't able to be used.

In my case, all available UDP ports are exhausted really quick if all our clients start using session timers, and we need to deny access to some of them via AGI.... but then they come back with another call request for some other destination, and another 3 ports are tied up (one RTP, one RTCP, and one UDPTL for T38).

Repeat this loop 15,000 times in an hour, and it doesn't take long at all to lose all ability to bind a UDP port.

I'm still investigating this on my side for a solution, but I wanted to leave this analysis here for anyone else doing the same.



By: Rob Gagnon (rgagnon) 2011-02-07 00:47:22.000-0600

Over the past week and a half or so, I have a working patch for this and maybe other issues.

As the changes in chan_sip are kind of big, or spread out, I've uploaded a readme file along with the four patch files for this fix.

See the https://issues.asterisk.org/file_download.php?file_id=28518&type=bug file for a full explanation of all the changes made in the patch files.

The patches related to this fix are now:
branch-1.6.1-r306540-sip-dos-mem-leak-fix.diff
branch-1.6.2-r306540-sip-dos-mem-leak-fix.diff
branch-1.8-r306540-sip-dos-mem-leak-fix.diff
trunk-r306540-sip-dos-mem-leak-fix.diff

They apply to the repos as named obviously.  They fix reference leaks which lead to memory leaks, and denial of service.

I suggest you also apply the patch from ASTERISK-17359 if running 1.8 or newer as it fixes the ability to get proper output from the refcounter utility which I used excessively to find the bugs fixed in this patch.

This might allow closing of ASTERISK-17311, ASTERISK-16794, ASTERISK-15581, ASTERISK-16698, and ASTERISK-16710, and maybe others.

Prior this this patch, our production system running 1.6.2.16.1 would run out of UDP ports within 20 minutes of operation, causing a massive denial of service.  With this patch, we've been up and running for an unknown number of hours without leaking ports, or memory, and without a denial of service.



By: Walter Doekes (wdoekes) 2011-02-07 02:43:04.000-0600

Thank you for your continued work on this.

I must however report failure for the patch (on 1.6.2 at least). I've tried it on 1.6.2.16.1 and can get it to crash quite quickly.

All I do is call a number (which does progress, playtones, and hangup) a couple of times, and it crashes near astobj2 unref stuff.

See crash20110207.txt

By: Rob Gagnon (rgagnon) 2011-02-07 04:34:47.000-0600

Thanks for the feedback, wdoekes.

I just tried the same thing with clean 1.6.2.16.1, applied the branch 1.6.2 patch, and then executed a test using a test extension of 9999 like this:

exten => 9999,1,Answer()
exten => 9999,2,Playtones(congestion)
exten => 9999,3,Congestion(4)
exten => 9999,4,Hangup()

To be clear, I did exactly this:
wget http://downloads.asterisk.org/pub/telephony/asterisk/asterisk-1.6.2.16.1.tar.gz
tar -xvzf asterisk-1.6.2.16.1.tar.gz
cd asterisk-1.6.2.16.1
patch -p0 < /tmp/branch-1.6.2-r306540-sip-dos-mem-leak-fix.diff
./configure
make install

I did not get a crash. Odd.  I am running 64bit FC8, and whether I have the compiler flags for DONT_OPTIMIZE and DEBUG_THREADS enabled or disabled, I get the same result.  The only time I crashed was when I would try to run 1.8 on the machine when I forgot to move my g729 and g723 codecs out of the way first.



By: Walter Doekes (wdoekes) 2011-02-07 05:55:27.000-0600

I'm running 64bit lenny. I did what you did, except for:

./configure --prefix=/usr/local --sysconfdir=/etc --localstatedir=/var --with-hoard=/usr/local/lib

(but I did no menuconfig and therefore did not tick the USE_HOARD)


However, I am only using realtime (uncached) accounts.

The dialplan I hit, looks basically like this, but I doubt that the dialplan here is relevant:

exten => _i-!,n,Progress()
exten => _i-!,n,Wait(0.5)
exten => _i-!,n,PlayTones(info) ; 3 octaves from low to high and then a pause
exten => _i-!,n,Wait(3.9)
exten => _i-!,n,Hangup(UNALLOCATED) ; (SIP/404)

By: Rob Gagnon (rgagnon) 2011-02-07 14:02:13.000-0600

I think this is completely related to realtime (obviously)... I am working on setting up a realtime environment so I can replicate this, but in the meantime, I did find a line of code in sip_destroy_peer() that should probably be changed.  This function is called whenever peer's are destroyed (as in realtime operation), and it affects the reference counter on the peer's related call (which is where you are crashing).

If you want to test before I get it my tests up and running, here's what you can change (without me posting a patch that might not solve it completely):

After applying the 1.6.2 patch to your 1.6.2.16.1 tarball, edit chan_sip.c.

At line 4726 (inside sip_destroy_peer), you should see:
  /* Delete it, it needs to disappear */
  if (peer->call) {
     dialog_unlink_all(peer->call, TRUE, TRUE);
     peer->call = dialog_unref(peer->call, "peer->call is being unset");
  }

Change the order of the two lines in the if-statement to be:
  /* Delete it, it needs to disappear */
  if (peer->call) {
     dialog_unref(peer->call, "peer->call is being unset");
     peer->call = dialog_unlink_all(peer->call, TRUE, TRUE);
  }

I cannot say for certain this is 100% your issue, but I've had problems that caused core dumps in two other areas with this same pattern of code.

By: Walter Doekes (wdoekes) 2011-02-08 01:51:42.000-0600

Alas,

[2011-02-08 08:48:00] WARNING[24268]: chan_sip.c:22850 stop_session_timer: Null stimer in stop_session_timer - 66ad2dfd030a81c814ec99197e628099@127.0.1.1

boom.

(gdb) back
#0  0x00007f05b2d9ee2c in ?? () from /lib/libc.so.6
#1  0x00007f05b2da07e8 in calloc () from /lib/libc.so.6
#2  0x00000000004a3f13 in ast_log (level=<value optimized out>, file=0x537684 "astobj2.c", line=321, function=0x537861 "__ao2_ref", fmt=0x5376ff "refcount %d on object %p\n") at /usr/src/asterisk-1.6.2.16.1/include/asterisk/utils.h:462
#3  0x0000000000439bf8 in __ao2_ref (user_data=0x1e7f108, delta=<value optimized out>) at astobj2.c:321
#4  0x00007f05ad1251f4 in __sip_autodestruct (data=<value optimized out>) at chan_sip.c:1770
ASTERISK-1  0x00000000004f5875 in ast_sched_runq (con=0x1e3b910) at sched.c:621
...

By: Walter Doekes (wdoekes) 2011-02-08 03:18:01.000-0600

Further info: it is actually ~ 27 seconds after the start of the call that it crashes. Only a single call was needed.

I undid the changes you made in __sip_autodestruct, and then it stops crashing.

I do still get these messages every now and then though:

[2011-02-08 10:11:40] WARNING[11023]: chan_sip.c:22850 stop_session_timer: Null stimer in stop_session_timer - 5db0f35878a8f352226a15264ff8ee87@127.0.1.1
Really destroying SIP dialog '5db0f35878a8f352226a15264ff8ee87@127.0.1.1' Method: REGISTER
[2011-02-08 10:11:40] WARNING[11023]: chan_sip.c:22850 stop_session_timer: Null stimer in stop_session_timer - 1bcaa1774337eedc4dc7c70149810f71@127.0.1.1
Really destroying SIP dialog '1bcaa1774337eedc4dc7c70149810f71@127.0.1.1' Method: REGISTER

Running with or without your suggested change at /* Delete it, it needs to disappear */ does not have a noticable effect.


(edit: tilde-number refers to message #n from waaay back)



By: Rob Gagnon (rgagnon) 2011-02-08 03:23:36.000-0600

In my chan_sip.c with the 1.6.2 patch applied, that WARNING log message would be generated on line 22848, not 22850.  I'm curious how the patch would not have applied to your source the same as it does to mine.

The message is most likely being printed out because __sip_destroy() free'd the memory for the session timer, and so dialog_unlink_all() is just trying to ensure the stopped timer.  This can be modified to avoid such a message, but it would not cause the core dump.

I have configured my test system to use realtime SIP peers and users, with rtcachefriends set to "no".  I am not using realtime extensions.  Is this how yours is setup?

Unfortunately I still cannot replicate the crash you are having.  How many core files occur in your /tmp/ directory?

With the simple sample dialplan you gave above, I can dial "i2" (for example) to hit it, and i just get tones, and a hangup as expected without any core file.

By: Walter Doekes (wdoekes) 2011-02-08 03:51:12.000-0600

(1) The line difference is probably because I commented out the original at /* Delete it, it needs to disappear */ instead of removing it. Besides, I was testing on my modified build again. But the changes there are really trivial and are not affecting anything related to this.

(2) Yes, the message is not causing a core dump. Like I said. The message is still there, but the crashing stops after reverting __sip_autodestruct.

(3) Indeed, only realtime SIP (and a couple of static peers) and a static dialplan (extensions).

As far as core dumps, I have them land in /var/spool/asterisk. I get them for every crash when I start asterisk via init.d.

I'll try to make some time to create a really simple realtime setup to see if I can replicate the crash. I'm running a rather complex setup which I cannot reduce easily.

By: Rob Gagnon (rgagnon) 2011-02-08 03:57:53.000-0600

Thanks for the help with this!  I will look into __sip_autodestruct() in the morning (4am here for me right now).

It appears that MAYBE the dialog object is not supposed to be completely removed from memory in that method, but the change I made would ensure it is.

If you look throughout the code, anywhere you see a call to dialog_unlink_all(), it is followed by one more dialog_unref()... This is to account for the single reference applied to the object during sip_alloc().  I had added the extra unref there in order to account for this pattern everywhere else in the code.

I will have to wait till tomorrow to see if reverting the change in that function will still NOT hang up memory and UDP ports, or maybe it will need to be conditional in order to work for static file configs as well as realtime.

By: Walter Doekes (wdoekes) 2011-02-08 04:31:44.000-0600

Okay. I was looking in the wrong place. It turned out that it wasn't the realtime peers. My ~ 27 seconds were wrong too. It was ~ 32 seconds after startup.

This config reproduces it (I used a clean 1.6.2.16.1 with your patch):

# for x in /etc/asterisk/*.conf ; do echo $x: ; echo ; cat $x ; echo ; done
/etc/asterisk/asterisk.conf:

[directories]
astetcdir => /etc/asterisk
astmoddir => /usr/local/lib/asterisk/modules
astvarlibdir => /var/lib/asterisk
astdbdir => /var/lib/asterisk
astkeydir => /var/lib/asterisk
astdatadir => /var/lib/asterisk
astagidir => /var/lib/asterisk/agi-bin
astspooldir => /var/spool/asterisk
astrundir => /var/run/asterisk
astlogdir => /var/log/asterisk

/etc/asterisk/modules.conf:

[modules]
autoload=no
load => chan_sip.so

/etc/asterisk/sip.conf:

[general]
register => <account>:<password>@sip.provider.com/<account>



Taking out the register line => no crash. With register line => crash.

By: Stefan Schmidt (schmidts) 2011-02-08 05:02:07.000-0600

you can see the refcounter of sip dialogs by using the cli command sip show objects. if you see a very high ref value than something is messy.

maybe you should try to enable the refdebug to see the real ref actions on the sip dialogs.

32 seconds sound like the typical T1x64 timeout.

By: Rob Gagnon (rgagnon) 2011-02-08 11:14:46.000-0600

wdoekes:  That is awesome information... I think I know right where to look.  Can you just comment here whether or not the response from your SIP provider for that register statement is ever received, or does it get dropped for bad password, or cannot reach host?  I'd imagine it's under perfect conditions now, but it would be handy to know just in case.

schmidts:  Exactly what I was thinking.

I used the refcounter and #define REF_DEBUG feature extensively in order to get the code patched up.  What I believe is happening is that the newly corrected reference balancing code is causing a bug that no one saw before to show up in the area where that 32 second timer is set/caught.  Should be easy to find now.



By: Walter Doekes (wdoekes) 2011-02-08 11:44:16.000-0600

As far as I know, the registration(s) went fine. I did a sip set debug on often after startup and I did not see any register packets (so I assume it went fine before the debug was enabled). I can double-check in a while.

(edit: double-checked with ngrep: REG, 401, REG, 200. sip show registry also thinks it's registered)



By: Rob Gagnon (rgagnon) 2011-02-08 18:23:44.000-0600

I believe I found and fixed the issue in the 1.6.2.16.1 branch.  I was able to easily replicate your crash condition with a "register" in sip.conf without regard to any realtime settings.

I am currently testing the change (and it directly related to fixing the balancing of the reference counters on the dialog object).... the "register" code was not a place that appeared out of balance, until it was really looked at.

Currently, I am just verifying that the register does not keep attempting registration forever, and that it will give up properly after the max attempts are reached.  It however no longer crashes at 32 seconds.



By: Rob Gagnon (rgagnon) 2011-02-09 01:15:44.000-0600

Testing went really well!  No crashes during register calls now.  After running over an hour on v1.6.2.16.1 patched with the above r307181 file, there were no crashes, no reference leaks (meaning no memory leaks), as well as no UDP port leakage (meaning no denial of service)

README file ( https://issues.asterisk.org/file_download.php?file_id=28552&type=bug ) updated to reflect the new fixes added.

The files that are now considered most recent fixes for this issue are:
 sip-dos-mem-leak-fix-readme-r307181.txt
 branch-1.6.1-r307181-sip-dos-mem-leak-fix.diff
 branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
 tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
 branch-1.8-r307181-sip-dos-mem-leak-fix.diff
 trunk-r307181-sip-dos-mem-leak-fix.diff

Special thanks for wdoekes for testing, and nailing down the conditions of that one bug during "register"



By: Ronald Chan (loloski) 2011-02-09 02:45:19.000-0600

I will ask the management tonight for their approval if i can let this patch in and will report it back asap, I'm going to apply this to one of our heavily production server with 8,000+ calls per day doing outbound call only, hope this will not break anything. prior to this patch i reboot the server on a daily basis because of this port leaks.

*Edit just to say thank you!!!!!*



By: Ronald Chan (loloski) 2011-02-09 02:52:11.000-0600

To all people affected with this issue, could you please join in so that we can finally put this to end. the more tester the better. thanks

By: Rob Gagnon (rgagnon) 2011-02-09 03:01:21.000-0600

Woot, loloski.  We've got 14 servers in 4 locations around the world running about 15,000 calls an hour each.  I put this on two of them for an hour to see how it performed.

Not a single leak under 1.6.2.16.1.  Next week we are deploying 32 more servers in Europe, and this will be what we're starting them up on.

By: Ronald Chan (loloski) 2011-02-09 06:09:12.000-0600

Wow!, that's a lot of calls, i'm too excited to push this to upper management. I'm gonna use your words as my leverage and proof :D, Hope this will fly
we are going to test your patch on SVN 1.8 and 1.6 branch respectively.

Be back in the next 24 hours if this has been approved, hope they will.

Regards,

Ronald

By: Walter Doekes (wdoekes) 2011-02-09 08:12:58.000-0600

Good job Rob. No crashes yet. Let's see if I dare deploy it on production :)

By: Ronald Chan (loloski) 2011-02-09 19:25:08.000-0600

rgagnon,

As per the management they decided to simmer your patch using sipp first with satisfying results.

root@pbx:~# sipp -sn uac -d 20000 -s 9999 192.168.0.101 -l 100 -r 25
Resolving remote host '192.168.0.101'... Done.
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
 Call-rate(length)   Port   Total-time  Total-calls  Remote-host
25.0(20000 ms)/1.000s   5061   30356.76 s       104108  192.168.0.101:5060(UDP)

 0 new calls during 28.603 s period     1 ms scheduler resolution
 100 calls (limit 100)                  Peak was 100 calls, after 4 s
 0 Running, 0 Paused, 0 Woken up
 313 dead call msg (discarded)          98 out-of-call msg (discarded)      
 3 open sockets                        

                                Messages  Retrans   Timeout   Unexpected-Msg
     INVITE ---------->         104108    397575    79498              
        100 <----------         24427     0         0         178      
        180 <----------         24287     0         0         0        
        183 <----------         24306     0         0         30      
        200 <----------  E-RTD1 24400     0         0         0        
        ACK ---------->         24400     0                            
      Pause [    20.0s]         24400                         3        
        BYE ---------->         24299     1616      100                
        200 <----------         24199     0         0         0        

------------------------------ Test Terminated --------------------------------


----------------------------- Statistics Screen ------- [1-9]: Change Screen --
 Start Time             | 2011-02-09 23:01:22:871 1297263682.871822            
 Last Reset Time        | 2011-02-10 07:26:51:058 1297294011.058730            
 Current Time           | 2011-02-10 07:27:19:663 1297294039.663554            
-------------------------+---------------------------+--------------------------
 Counter Name           | Periodic value            | Cumulative value
-------------------------+---------------------------+--------------------------
 Elapsed Time           | 00:00:28:604              | 08:25:56:791            
 Call Rate              |    0.000 cps              |    3.429 cps            
-------------------------+---------------------------+--------------------------
 Incoming call created  |        0                  |        0                
 OutGoing call created  |        0                  |   104108                
 Total Call created     |                           |   104108                
 Current Call           |      100                  |                          
-------------------------+---------------------------+--------------------------
 Successful call        |        0                  |    24199                
 Failed call            |        0                  |    79809                
-------------------------+---------------------------+--------------------------

On a non patch Asterisk

root@pbx:/usr/src/asterisk# sipp -sn uac -d 20000 -s 9999 192.168.0.101 -l 100 -r 25
Resolving remote host '192.168.0.101'... Done.
------------------------------ Scenario Screen -------- [1-9]: Change Screen --
 Call-rate(length)   Port   Total-time  Total-calls  Remote-host
25.0(20000 ms)/1.000s   5061     216.34 s          700  192.168.0.101:5060(UDP)

 0 new calls during 44.480 s period     1 ms scheduler resolution
 100 calls (limit 100)                  Peak was 100 calls, after 4 s
 0 Running, 0 Paused, 0 Woken up
 0 dead call msg (discarded)            0 out-of-call msg (discarded)        
 3 open sockets                        

                                Messages  Retrans   Timeout   Unexpected-Msg
     INVITE ---------->         700       400       0                  
        100 <----------         600       0         0         0        
        180 <----------         600       0         0         0        
        183 <----------         600       0         0         0        
        200 <----------  E-RTD1 600       62        0         0        
        ACK ---------->         600       62                          
      Pause [    20.0s]         600                           0        
        BYE ---------->         600       900       100                
        200 <----------         500       0         0         0        

------------------------------ Test Terminated --------------------------------


----------------------------- Statistics Screen ------- [1-9]: Change Screen --
 Start Time             | 2011-02-10 08:32:00:107 1297297920.107316            
 Last Reset Time        | 2011-02-10 08:34:52:004 1297298092.004321            
 Current Time           | 2011-02-10 08:35:36:485 1297298136.485391            
-------------------------+---------------------------+--------------------------
 Counter Name           | Periodic value            | Cumulative value
-------------------------+---------------------------+--------------------------
 Elapsed Time           | 00:00:44:481              | 00:03:36:378            
 Call Rate              |    0.000 cps              |    3.235 cps            
-------------------------+---------------------------+--------------------------
 Incoming call created  |        0                  |        0                
 OutGoing call created  |        0                  |      700                
 Total Call created     |                           |      700                
 Current Call           |      100                  |                          
-------------------------+---------------------------+--------------------------
 Successful call        |        0                  |      500                
 Failed call            |        0                  |      100                
-------------------------+---------------------------+--------------------------
 Response Time 1        | 00:00:00:000              | 00:00:01:506            
 Call Length            | 00:00:09:904              | 00:00:24:357            
------------------------------ Test Terminated --------------------------------

2011-02-10 08:34:52:696 1297298092.696356: Aborted call with Call-ID '700-11506@127.0.1.1'.
sipp: There were more errors, enable -trace_err to log them.

Therefore, we decided to apply your patch on our 4 major boxes. will let you know how is it going. Thank you!!!!!!


Regards,

Ronald

By: Rob Gagnon (rgagnon) 2011-02-10 10:40:42.000-0600

Good to know, Ronald.  Thanks for the testing!



By: Ronald Chan (loloski) 2011-02-11 02:31:45.000-0600

Rob, patches has now in place in 4 boxes, while we are waiting for our linux engineers to report the stability of the patch, do you think it's too early to post your patch into reviewboard so that digium and other developers can qualify your patch :D.

Anyway thank you so much again for you and others who are involve in making this to happen.

Regards,

Ronald

By: Rob Gagnon (rgagnon) 2011-02-11 03:06:53.000-0600

Review board link:  https://reviewboard.asterisk.org/r/1101/

Note new patches uploaded only for trunk, and branch-1.8:
trunk-r307669-sip-dos-mem-leak-fix.diff
branch-1.8-r307669-sip-dos-mem-leak-fix.diff

These are changed because part of the code included in my previous patch was accepted into the repo earlier, and these patches duplicated the changes, so I had to re-create the patch to avoid a conflict on astobj2.c (see ASTERISK-17359)

If a moderator could please remove the following 3 files please:
trunk-r307181-sip-dos-mem-leak-fix.diff
branch-1.8-r307181-sip-dos-mem-leak-fix.diff
asterisk-1.6.2.16.1-sip-udp-dos-fix.diff



By: Stefan Schmidt (schmidts) 2011-02-11 03:15:14.000-0600

your welcome

By: Ronald Chan (loloski) 2011-02-11 21:43:50.000-0600

Rob, Success!!! one of our engineer just called me on the phone as of this
writing and attest that they no longer have to reboot one of our major server, it was still responsive and NO lingering udp port after our operation, though we haven't check for memory leaks we are truly happy that your patch works for us also.

Thanks!!!!

By: Walter Doekes (wdoekes) 2011-02-15 01:54:13.000-0600

Success here as well! Three days of running. No leaks after 75K calls. Normally I would get at least a few leaked sockets every day.

Thanks a lot :)

By: Vitaliy Vitaliy (vetal) 2011-02-15 02:18:27.000-0600

Hi, I have same bug.
I have installed Asterisk 1.6.2.16.1
I tried to make sources of it with
branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
and
tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff

but after this asterisk do not start.
What I do wrong?

By: Theo Belder (tbelder) 2011-02-15 02:23:43.000-0600

Are the memory leaks or UDP port leaks also occuring in Asterisk 1.4?
In other words, should the patch also be created for 1.4?

By: Walter Doekes (wdoekes) 2011-02-15 02:43:15.000-0600

As far as I'm aware, there is no such leak in 1.4. A different customer of mine is running a similar setup with 1.4 (without T38 though), and I haven't noticed any leaks there.

By: Walter Doekes (wdoekes) 2011-02-15 02:46:39.000-0600

vetal:

> What I do wrong?

(1) This is not the right place to ask how to apply a patch.
(2) You're not writing what you're doing, so it is impossible to help you.

Try #asterisk on IRC (freenode.net), someone there might be able to help you if you ask the right questions.

By: Vitaliy Vitaliy (vetal) 2011-02-15 03:38:36.000-0600

wdoekes:

I apply patch correctly and make proccess success.

I think that patch do not suitable for my asterisk version.
So this question is to this issue.

Thanx

By: Rob Gagnon (rgagnon) 2011-02-15 04:03:54.000-0600

This note is to confirm no bug like these under 1.4 as there was no ast_obj2.c which was setting up and destroying objects.

vetal: Although this is not the place

Login to your linux machine as root, and then perform the following:

 cd /usr/src
 wget http://downloads.asterisk.org/pub/telephony/asterisk/asterisk-1.6.2.16.1.tar.gz
 wget https://issues.asterisk.org/file_download.php?file_id=28555&type=bug
 tar -xvzf asterisk-1.6.2.16.1.tar.gz
 cd asterisk-1.6.2.16.1
 patch -p0 < ../tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
 ./configure
 make

Now, if all goes well, you can then run "make install" to replace your existing asterisk with a working one.  In the above, you will have to be sure not to include the "[^]" characters that this ticket automatically puts on hyperlinks.

Of course, this should all be done in a non-production environment first.



By: Ronald Chan (loloski) 2011-02-15 20:53:52.000-0600

Updates: All of our production server has now compiled with Rob patches with very good Results!!!, no memory leak has been observed since then, as always thanks!!!

By: Rob Gagnon (rgagnon) 2011-02-20 01:53:52.000-0600

Complete set of new patches uploaded.  No functional changes.  Only changes to keep the files here in line with changes on reviewboard.

Current files are now:
 trunk-r308370-sip-dos-mem-leak-fix.diff
 branch-1.8-r308370-sip-dos-mem-leak-fix.diff
 branch-1.6.2-r308370-sip-dos-mem-leak-fix.diff
 branch-1.6.1-r308370-sip-dos-mem-leak-fix.diff
 tag-1.6.2.16.2-r308525-sip-dos-mem-leak-fix.diff

The following are now obsolete:
 branch-1.8-r307669-sip-dos-mem-leak-fix.diff
 branch-1.8-r308202-sip-dos-mem-leak-fix.diff
 trunk-r307669-sip-dos-mem-leak-fix.diff
 trunk-r308202-sip-dos-mem-leak-fix.diff
 branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
 branch-1.6.1-r307181-sip-dos-mem-leak-fix.diff
 tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
 tag-1.6.2.16.1-r308370-sip-dos-mem-leak-fix.diff



By: Rob Gagnon (rgagnon) 2011-02-22 00:32:20.000-0600

Part of me wonders if the bug that this patch fixes should be considered as a security vulnerability as it is easy to exploit the denial of service by simply placing a very large number of seconds in the SIP message when requesting that sessions be enabled.

For anyone using 1.6.2.16.1, or 1.6.2.16.2, this patch is the correct one for those versions:
 tag-1.6.2.16.2-r308525-sip-dos-mem-leak-fix.diff

The file "tag-1.6.2.16.1-r308370-sip-dos-mem-leak-fix.diff" is incorrect now.  When reviewing it again, I found it has more files modified in it than there should be--Nothing that should break anything, but it is not as pure as I wanted it.  

"tag-1.6.2.16.2-r308370-sip-dos-mem-leak-fix.diff" should be removed from this ticket.



By: Ronald Chan (loloski) 2011-02-23 22:33:31.000-0600

Good day guys, how can we move this forward? thanks!!

By: Walter Doekes (wdoekes) 2011-02-25 04:09:14.000-0600

Right now, I'm experiencing a rather large memory leak in chan_sip.c.

UDP leaks are gone, but a memory leak exists.

I have not tested if this patch is to blame or if the leak was already there (or caused by one of my other patches although unlikely due to their trivial nature). I was used to restarting asterisk every day before this patch, so I had never looked out for memory leaks.

I shall have more information about this in 48 hours.

By: Walter Doekes (wdoekes) 2011-02-25 04:27:26.000-0600

P.S.1. Memory show allocations shows about half the memory that /proc/PID/status does: "749376389 bytes allocated (17520 in caches) in 3058903 allocations", while status reports a VmSize of about 1.5GB.
(This could be causes by allocation alignment perhaps? Perhaps something to do with the hoard allocator, which I'm using.)

P.S.2. If I uniq -c the memory show allocations, these four allocations stand out:
1731240 add_var at line 23940 of chan_sip.c
434741 __ao2_link at line   549 of astobj2.c
432828 build_peer at line 24106 of chan_sip.c
432828 build_peer at line 24109 of chan_sip.c
(same list, with byte count)
432810 115 bytes allocated in add_var at line 23940 of chan_sip.c
432810 86 bytes allocated in add_var at line 23940 of chan_sip.c
432810 91 bytes allocated in add_var at line 23940 of chan_sip.c
432810 96 bytes allocated in add_var at line 23940 of chan_sip.c
432828 520 bytes allocated in build_peer at line 24109 of chan_sip.c
432828 760 bytes allocated in build_peer at line 24106 of chan_sip.c
434741 24 bytes allocated in __ao2_link at line 549 of astobj2.c
(and I suppose this one is attached to one or more of the peers)
8738 24 bytes allocated in ast_append_ha at line 287 of acl.c

By: Rob Gagnon (rgagnon) 2011-02-25 10:35:03.000-0600

Definitely something to look into, but it could be likely a totally different bug if so.

In all of my current tests, when compiled with REF_DEBUG defined in chan_sip.c... If I shutdown asterisk with "core stop gracefully", the refcounter utlity will show no problems at all.  So anything that was ao2_alloc()'d is being properly removed from memory.

This patch only messed with anything ao2_alloc'd or ao2_ref'd, so other plain ast_malloc() calls were not modified.

Would you be able to test the same conditions, and get the same reports on a version of the code without the patch I've created so that any differences in the reports can be seen?

By: Walter Doekes (wdoekes) 2011-02-27 01:51:42.000-0600

Ok :)
The bad news: I have a memory leak.
The good news: without this patch, I also have a memory leak. (*)

As you other guys seem unaffected, I suggest continuing the checkin of this patch.

I'll be filing a different bug report for my leak when I've collected enough information.


(*) I undid this patch on one server and undid ASTERISK-17400 on another server. Both are still leaking. The other good news is that it's leaking so badly that it should be easy to find/reproduce ;)

By: Rob Gagnon (rgagnon) 2011-03-02 12:56:21.000-0600

With the release of asterisk 1.6.2.17, I have uploaded the matching patch to go with it:

  tag-1.6.2.17-r309252-sip-dos-mem-leak-fix.diff

By: Walter Doekes (wdoekes) 2011-03-07 01:55:38.000-0600

The other good news was good. The memory leak I was having was for uncached realtime peers only. ASTERISK-17510 holds the fix.

By: Rob Gagnon (rgagnon) 2011-03-07 15:20:52.000-0600

Cool.  I guess its time to get more developers to look at this so this almost 1-year old bug can be put to rest.

By: Tilghman Lesher (tilghman) 2011-03-08 02:23:56.000-0600

If you want this to go in sooner rather than later, I'd suggest that you remove a bit of the extra cruft that has crept in, stuff like changing out ao2_t_ref with dialog_unref.  If you can distill the changeset down to the minimum needed for 1.6.2 and 1.8, I can take the time to look at it and get it committed.

By: Rob Gagnon (rgagnon) 2011-03-08 03:05:30.000-0600

I really don't see any "cruft".  The changes from ao2_t_ref() to dialog_unref() is simply to make the code standard when dealing with dialog referencing.   I think it should all be one way, or the other instead of a mix of both.  The changes made got there mostly because they aided in getting the complete problem debugged and resolved.

The text on reviewboard I think detailed each block of the patch well.  Is there more that can be said?

By: Rob Gagnon (rgagnon) 2011-03-08 12:38:35.000-0600

Another thing to add... by using dialog_ref/unref, and unref/ref_peer(), etc... then the output from the refcounter utility is actually useful since it will properly report the line numbers of the reference additions and removals.

By calling ao2_t_ref(), you don't get a proper line number reference in order to find problems when the REF_DEBUG macro is set on.

By: Ronald Chan (loloski) 2011-03-22 09:14:15

Can this be now be reviewed? thorougly, it's really hard from every release to patch the code from time to time :)

By: Stefan Tichy (st) 2011-04-22 14:33:27

Maybe I made a mistake, but here still some UDP ports not freed.

- Asterisk 1.6.2.17.3

- https://issues.asterisk.org/file_download.php?file_id=28810&type=bug

- Ringing; Wait(1); Answer; Wait(1); Playback(beep); Wait(12); Hangup

- sipp -trace_err -d 10000 -t un -rtp_echo -sn uac -r 10 ipaddr

Testet with 1100 calls and have 52 open UDP (RTP Range) ports.

By: Rob Gagnon (rgagnon) 2011-04-25 08:22:20

Lmadsen:

This affects version 1.8 as well as the trunk.  I mention this because the target version of this ticket changed but I don't see 1.8 included

By: Leif Madsen (lmadsen) 2011-04-26 11:18:21

rgagnon: that's because mantis doesn't let me select multiple versions. You can always assume that if it is assigned to a lower branch version it will be merged up to newer branches where appropriate.

By: Terry Wilson (twilson) 2011-04-26 11:20:47

I have reviewed the patch on reviewboard and found a couple of additional ref leaks (which I posted a review for at http://reviewboard.asterisk.org/r/1198). Looks good to me, but I'm having one more developer take a look at it since there are a lot of changes.

By: Ronald Chan (loloski) 2011-05-02 00:35:34

Can this be now safely committed?, thanks

By: Rob Gagnon (rgagnon) 2011-05-09 13:09:58

loloski:  There is a good discussion going back and forth on the reviewboard under review numbers 1101, 1198, and 1207

In the meantime, I've updated the patches here for trunk and branch-1.8 due to conflicts created as asterisk code changes while we work on this.  (Patches created against revision 318218)

issue-17255-branch-1.8-r318281.diff is the lastest for branch 1.8
issue-17255-trunk-r318281.diff is the latest for trunk.

By: Terry Wilson (twilson) 2011-05-09 13:17:05

And if anyone is using 1.6.2, the patch from review 1207 would love your testing. :)

By: leearcher (leearcher) 2011-05-10 11:17:33

Is this going to make it into 1.6.2 as I've a ticket open for something else and I've been told bug fixes aren't going in to this version anymore?

By: Rob Gagnon (rgagnon) 2011-05-10 11:30:52

From my point-of-view, this is not a bug fix, but a security/denial-of-service fix

By: Terry Wilson (twilson) 2011-05-10 11:32:00

leearcher: yes, this one is scheduled for inclusion in the last 1.6.2 release. Issues that are going in should have a Target Version of "1.4/1.6.2 Issues (Final Release)"

By: sanms (sanms) 2011-05-11 04:26:57

I do long time load test for this patch on 1.6.2.10 with sipp(20calll/s,hold 10s). Yesterday my test started, today asterisk core dump, it runs about 20 hours.

This is the second core dump on my asterisk with this patch;First it core dump at hash table handle(i lost compile environment,so can not get stack information), so strange.

core stack is here:
http://pastebin.com/vfDMvyLV



By: Terry Wilson (twilson) 2011-05-11 10:12:47

sanms: Which patch are you trying? One attached to this issue or the one at https://reviewboard.asterisk.org/r/1207 ?

By: Digium Subversion (svnbot) 2011-05-11 12:15:42

Repository: asterisk
Revision: 318548

U   branches/1.6.2/channels/chan_sip.c

------------------------------------------------------------------------
r318548 | twilson | 2011-05-11 12:15:41 -0500 (Wed, 11 May 2011) | 19 lines

Clean up several chan_sip reference leaks

Several situations in the code could lead to peers or sip_pvt references
being leaked. This would cause RTP ports to never be destroyed (leading
to exhaustion of all available RTP ports) and memory leaks.

The original patch for this issue from rgagnon was the result of an
obscene amount of testing and hard work, for which I am very grateful. I
did some cleanup and added a few additional refcount fixes that I found.

(closes issue ASTERISK-16023)
Reported by: kvveltho
Patches:
     tag-1.6.2.17-r309252-sip-dos-mem-leak-fix.diff uploaded by rgagnon (license 1202)
Tested by: rgagnon, twilson, wdoekes, loloski

Review: https://reviewboard.asterisk.org/r/1101/
Review: https://reviewboard.asterisk.org/r/1207/

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=318548

By: Digium Subversion (svnbot) 2011-05-11 13:39:50

Repository: asterisk
Revision: 318549

_U  branches/1.8/
U   branches/1.8/channels/chan_sip.c

------------------------------------------------------------------------
r318549 | twilson | 2011-05-11 13:39:49 -0500 (Wed, 11 May 2011) | 27 lines

Merged revisions 318548 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.6.2

........
 r318548 | twilson | 2011-05-11 12:15:39 -0500 (Wed, 11 May 2011) | 19 lines
 
 Clean up several chan_sip reference leaks
 
 Several situations in the code could lead to peers or sip_pvt references
 being leaked. This would cause RTP ports to never be destroyed (leading
 to exhaustion of all available RTP ports) and memory leaks.
 
 The original patch for this issue from rgagnon was the result of an
 obscene amount of testing and hard work, for which I am very grateful. I
 did some cleanup and added a few additional refcount fixes that I found.
 
 (closes issue ASTERISK-16023)
 Reported by: kvveltho
 Patches:
       tag-1.6.2.17-r309252-sip-dos-mem-leak-fix.diff uploaded by rgagnon (license 1202)
 Tested by: rgagnon, twilson, wdoekes, loloski
 
 Review: https://reviewboard.asterisk.org/r/1101/
 Review: https://reviewboard.asterisk.org/r/1207/
 Review: https://reviewboard.asterisk.org/r/1210/
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=318549

By: Digium Subversion (svnbot) 2011-05-11 13:50:53

Repository: asterisk
Revision: 318551

_U  trunk/
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r318551 | twilson | 2011-05-11 13:50:53 -0500 (Wed, 11 May 2011) | 34 lines

Merged revisions 318549 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.8

................
 r318549 | twilson | 2011-05-11 13:39:48 -0500 (Wed, 11 May 2011) | 27 lines
 
 Merged revisions 318548 via svnmerge from
 https://origsvn.digium.com/svn/asterisk/branches/1.6.2
 
 ........
   r318548 | twilson | 2011-05-11 12:15:39 -0500 (Wed, 11 May 2011) | 19 lines
   
   Clean up several chan_sip reference leaks
   
   Several situations in the code could lead to peers or sip_pvt references
   being leaked. This would cause RTP ports to never be destroyed (leading
   to exhaustion of all available RTP ports) and memory leaks.
   
   The original patch for this issue from rgagnon was the result of an
   obscene amount of testing and hard work, for which I am very grateful. I
   did some cleanup and added a few additional refcount fixes that I found.
   
   (closes issue ASTERISK-16023)
   Reported by: kvveltho
   Patches:
         tag-1.6.2.17-r309252-sip-dos-mem-leak-fix.diff uploaded by rgagnon (license 1202)
   Tested by: rgagnon, twilson, wdoekes, loloski
   
   Review: https://reviewboard.asterisk.org/r/1101/
   Review: https://reviewboard.asterisk.org/r/1207/
   Review: https://reviewboard.asterisk.org/r/1210/
 ........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=318551

By: Rob Gagnon (rgagnon) 2011-06-28 16:18:56.239-0500

The above appears to show the patch committed to branch 1.6.2, branch 1.8, and the trunk.  However, I do not see the changes in any current tarball download.

I ask because we build our installations of asterisk as RPMs, which requires code patching before compilation, and I have yet to need to remove my current patch during RPM building in order to get a working build.  If this patch were included in the tarball downloads, I could remove my patch from my spec file, (and others would have a chan_sip that doesn't leak memory, etc)

By: leearcher (leearcher) 2011-07-05 09:21:08.141-0500

Is anyone else running 1.6.2.19 yet?  I still have a lot of open UDP ports.

By: David Woolley (davidw) 2014-11-12 06:25:57.240-0600

The scope of this fix is rather larger than the description implies.  It also fixes a leak of internal timers in the session timer logic, which actually caused a functional failure of the session timers.  If Asterisk issued a non-session timers re-invite, whilst session timers were running, a new timer would be started.  In the case actually seen, this resulted in multiple refreshes being initiated in the same millisecond, which caused general confusion with some of the refreshes being treated as lost.