ASTERISK-15721: [patch] Qualify frequency has big pauses. Asterisk stops sending SIP OPTIONS to keep NAT alive

[Home]

Summary: ASTERISK-15721: [patch] Qualify frequency has big pauses. Asterisk stops sending SIP OPTIONS to keep NAT alive

Reporter: Jesper (ib2) Labels:

Date Opened: 2010-03-01 13:54:09.000-0600 Date Closed: 2010-05-06 09:04:38

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) asterisk-1.6.2.7-rc1-buildheap.patch
( 1) qualifybug.txt
( 2) qualifypause.log

Description: We have several SIP phone peers that that becomes UNREACHABLE since upgrading to Asterisk 1.6.2.x

[10:08:44] chan_sip.c: Peer '202_117' is now UNREACHABLE! Last qualify: 100
[10:11:25] chan_sip.c: Peer '202_117' is now Reachable. (86ms / 2000ms)
[11:59:03] chan_sip.c: Peer '202_117' is now UNREACHABLE! Last qualify: 91
[12:11:27] chan_sip.c: Peer '202_117' is now Reachable. (85ms / 2000ms)
[13:17:21] chan_sip.c: Peer '202_117' is now UNREACHABLE! Last qualify: 90
[13:41:27] chan_sip.c: Peer '202_117' is now Reachable. (92ms / 2000ms)

The phone is UNREACHABLE until it registers again. The phone does not know that it is UNREACHABLE.
Asterisk reports the phone as UNREACHABLE after a big pause in sending SIP OPTIONS to keep NAT alive. Therefore NAT table is lost and asterisk cannot receive SIP OK reply from the phone.

The typical interval between the occurrence is shown above

****** ADDITIONAL INFORMATION ******

Dump with tcpdump/wireshark shows the big pause:

Delta Time Source Destination Info
0.100713 87.54.x.x 217.118.x.x Request: REGISTER sip:217.118.x.x
0.061308 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
0.000033 217.118.x.x 87.54.x.x Status: 200 OK (1 bindings)
0.086161 87.54.x.x 217.118.x.x Status: 200 OK
60.000145 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
0.090604 87.54.x.x 217.118.x.x Status: 200 OK
59.999505 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
0.091204 87.54.x.x 217.118.x.x Status: 200 OK
918.732258 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
0.999051 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
1.000103 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
1.000107 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
1.001110 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
0.999086 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
1.000103 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
22.000342 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
0.999037 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
1.000071 217.118.x.x 87.54.x.x Request: OPTIONS sip:202@192.168.x.x:2048
After this, the peer is reported UNREACHABLE

Comments: By: Leif Madsen (lmadsen) 2010-03-03 09:43:22.000-0600

Could you please provide the following information?

* sip.conf configuration (relevant parts)
* SIP console trace
* SIP history (configured per sip.conf)

If you can show the time prior to and after the UNREACHABLE that would be ideal.
By: Jesper (ib2) 2010-03-03 14:22:11.000-0600

sip.conf:

[general]
language=aa
port=5060
bindaddr=0.0.0.0
srvlookup=no
defaultexpirey=120
disallow=all
allow=alaw
allow=ulaw
allow=g726
allow=ilbc
allow=gsm
callevents=yes
musicclass=default
relaxdtmf=yes
rtpiholdtimeout=300
rtptimeout=60
tos=184
useragent=PBXIreland
nat=yes
dtmfmode=auto
registerattempts=0
registertimeout=40
localnet=192.168.0.0/255.255.0.0
localnet=10.0.0.0/255.0.0.0
localnet=172.16.0.0/12
localnet=169.254.0.0/255.255.0.0
externip=217.118.x.x
limitonpeer = yes

[202]
type=friend
host=dynamic
canreinvite=no
qualify=2000
call-limit=4
context=phones
callerid=202 <202>
defaultuser=202
fromdomain=217.118.x.x
secret=verysecret
nat=yes
By: Jesper (ib2) 2010-03-03 15:06:26.000-0600

I don't know what info else to give you.
The CLI with verbose can be quiet for a half a minute both before and after the message Peer '202' is now UNREACHABLE
We have more than 50 peers on the server all behind NAT. Some fails.

SIP history ??? do you mean dumphistory ?
How do I enable this filtered on a single peer ?
Note: I have to wait for hours for this to occur
By: John Todd (jtodd) 2010-03-18 14:57:10

How many SIP devices do you have associated with this Asterisk system? Just two? Or many more? [update: whoops, I see you have 50. Sorry, didn't read closely enough]

By: Leif Madsen (lmadsen) 2010-03-18 14:58:18

We still need the SIP trace along with the SIP history (dumphistory=yes) from the Asterisk console. You can write this to a separate log file using the logger.conf file.

For a single peer, you can do something like, "sip set debug ip 12.0.3.2"
By: Jesper (ib2) 2010-03-22 15:21:30

qualifypause.log is debug,verbose where sip qualify pauses, and when it continues it cannot receive SIP OK reply from the phone and is reported UNREACHABLE:
[Mar 21 04:49:56] NOTICE[3775] chan_sip.c: Peer '221_114' is now UNREACHABLE! Last qualify: 103

This continues every hour...

If you need more data, I will need exact descriptions on how to dump it.
It is reproducible, but I need to dump megabytes of data while waiting for it to occur.
By: rsw686 (rsw686) 2010-03-22 15:28:58

I might have experienced this issue. I installed Asterisk 1.6.2.2 for a day and had complaints about transfers and phones (Snom) going directly to voicemail. I didn't have much time to look into it and sip show peers was still showing OK. I ended up rolling back to my previous 1.6.1 version. Now that everything is stable I am hesitant to change it as we make a few thousand calls a day.
By: Jesper (ib2) 2010-03-22 15:35:15

Thanks rsw686. That's how the user sees it. A phone that is reported UNREACHABLE will be busy or go directly to voicemail.
By: Leif Madsen (lmadsen) 2010-03-24 10:33:19

Can anyone reproduce this in their test environment? I find it odd that people passing a few thousand calls a day don't have test plans they follow prior to upgrading :)
By: Leif Madsen (lmadsen) 2010-03-24 10:33:42

(Although I understand it is nie impossible to test for everything!)
By: SplatNIX (uxbod) 2010-03-26 11:12:49

Seeing the same issue, with Snom 360s and M3s, on Asterisk 1.6.1.14 and 1.6.2.6.

What type of phone is the OP using ?

What is very odd though is I have not see this issue with 870's and they are attached to both versions of Asterisk as-well.

By: Jesper (ib2) 2010-04-01 08:55:51

I'm using different snom phones primarily snom320, but have also seen this issue with snom 360s and M3s. Trace and dumps are generated using the snom320/7.3.30
By: xiaoyem (xiaoyem) 2010-04-19 22:59:19

I experienced this problem when I using 1.6.2.7-rc1 under Xen. I enabled the devel mode (--enable-dev-mode), selected the DEBUG_SCHEDULER and DUMP_SCHEDULER flags. It turns out that the time in Xen is a little faster, in my case, about 100 secs per hour. I have a crontab script that synchronizes time every hour. After synchronization, the time in Xen is turned back 100 secs, thus long pause occurs.
By: laszlovl (lvl) 2010-04-20 05:31:19

We are experiencing exactly the same situation on an 1.6.2.5 machine. At some point, Asterisk will just (randomly) stop sending its minutly OPTIONS packet to the phone, and by the time it resumes it, the NAT mapping has disappeared and the phone will be UNREACHABLE until the phone itself realises it has disconnected.

The setup on this machine is nothing out of the ordinary, except for that we make major use of notifications/hints with the Custom devstates

---- SIP.conf:

[general]
limitonpeers=yes

[phone](!)
type=friend
host=dynamic
nat=yes
qualify=yes
call-limit=100

----

This is a prime example of the problem. Asterisk sends an OPTION packet to this phone every :30, until 11:06:30. After that, it suddenly halts, just to resume at 11:13:40 as if nothing has ever happened; but by that time the NAT mapping has disappeared, and after 4 retransmits, asterisk considers the phone unreachable.

[Apr 20 11:01:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:02:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:03:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:04:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:05:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:06:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:13:40] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
[Apr 20 11:13:41] VERBOSE[1163] chan_sip.c: Retransmitting #1 (NAT) to a.a.a.a:2128:

Inbetween 11:06:30 and 11:13:40 there is no attempted communication with this host at all; neither is there any error message in the dialog at 11:06:30.

[Apr 20 11:06:30] VERBOSE[1163] chan_sip.c: Reliably Transmitting (NAT) to a.a.a.a:2128:
OPTIONS sip:phone3@o.o.o.o:5060 SIP/2.0
Via: SIP/2.0/UDP a.a.a.a:5060;branch=z9hG4bK2514da63;rport
Max-Forwards: 70
From: "asterisk" <sip:asterisk@a.a.a.a>;tag=as7cb0d683
To: <sip:phone3@o.o.o.o:5060>
Contact: <sip:asterisk@a.a.a.a>
Call-ID: 462573744d998b5b6ed97e272e2a7154@a.a.a.a
CSeq: 102 OPTIONS
User-Agent: Asterisk PBX 1.6.2.5-0ubuntu1
Date: Tue, 20 Apr 2010 09:06:30 GMT
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO
Supported: replaces, timer
Content-Length: 0

[Apr 20 11:06:30] VERBOSE[1163] chan_sip.c:
<--- SIP read from UDP:a.a.a.a:2128 --->
SIP/2.0 200 OK
To: <sip:phone3@o.o.o.o:5060>;tag=c9760b6edbc9f2b0i0
From: "asterisk" <sip:asterisk@a.a.a.a>;tag=as7cb0d683
Call-ID: 462573744d998b5b6ed97e272e2a7154@a.a.a.a
CSeq: 102 OPTIONS
Via: SIP/2.0/UDP a.a.a.a:5060;branch=z9hG4bK2514da63
Server: Linksys/SPA942-6.1.5(a)
Content-Length: 0
Allow: ACK, BYE, CANCEL, INFO, INVITE, NOTIFY, OPTIONS, REFER
Supported: replaces

The timing is never exactly the same; sometimes it only takes about 2 minutes for asterisk to resume its qualify messages, other times it will take up to 15. I've checked all logfiles and debug messages, but couldn't find any explanation for the problem.
By: Leif Madsen (lmadsen) 2010-04-20 08:03:57

Huh, I've seen something similar to this, and it seemed like Asterisk was sending the OPTIONS packets and the remote phone wouldn't respond.

I'm not sure what else to report here or to ask for. Perhaps when this is happening, if someone can gather a 'core show locks' and a backtrace from the running Asterisk process (see http://svn.asterisk.org/svn/asterisk/trunk/doc/backtrace.txt) using 'thread apply all bt' we can see if something is happening in terms of a deadlock.

That is just a guess as at this point I'm not entirely sure what to ask for.
By: cappucinoking (cappucinoking) 2010-04-30 11:04:40

I'm seeing the exact same behaviour, but with two asterisk machines talking to each other using SIP. Both are non-NAT however.

Only seems to happen when other calls are going through the system, rarely happens over night.

Can't see anything in core show locks around the times when this is happening.
But since it is pretty random when it is happening, it is very difficult to catch at the right time.

From packet trace, I can't see anything that appears out of the ordinary.

Seems to not be much memory free on the machine though, going to upgrade to see if it helps.

Is there some way to see the schedule/time of when the asterisk box is going to next schedule an options request? i.e. when this is happening, is there a timer that I can look at, that might indicate when asterisk next thinks it should be sending the options request?
By: cappucinoking (cappucinoking) 2010-04-30 11:16:32

p.s. this is on a 1.6.2.7-rc2 talking to 1.6.2.7rc-1
I've tried versions before these and I'm having the same problems.
Does anyone know for definate a version this doesn't happen on?

rsw686 - you mention not having problem on your previous 1.6.1 release - what release number are you using?

This is happening approx 3 - 8 times per day.
But not over weekends, when traffic is fairly minimal anyway.

I can see in the logs

[Apr 30 10:17:12] NOTICE[18576] chan_sip.c: Peer '0301000' is now Reachable. (1ms / 4000ms)
[Apr 30 11:57:33] NOTICE[18576] chan_sip.c: Peer '0301000' is now Reachable. (2ms / 4000ms)
[Apr 30 12:20:09] NOTICE[18576] chan_sip.c: Peer '0301000' is now Reachable. (1ms / 4000ms)
[Apr 30 13:14:45] NOTICE[18576] chan_sip.c: Peer '0301000' is now Reachable. (1ms / 4000ms)
[Apr 30 14:24:25] NOTICE[18576] chan_sip.c: Peer '0301000' is now Reachable. (1ms / 4000ms)
[Apr 30 15:10:12] NOTICE[18576] chan_sip.c: Peer '0301000' is now Reachable. (1ms / 4000ms)

However, there is no corresponding UNREACHABLE message.
Doing sip show peers it sees it is offline - but why isn't this getting logged as an unreachable?
By: Leif Madsen (lmadsen) 2010-04-30 14:05:27

xiaoyem: please note that your license was rejected.
By: rsw686 (rsw686) 2010-05-01 15:25:06

I am using 1.6.1.18 which doesn't have the issue.
By: cappucinoking (cappucinoking) 2010-05-03 17:39:32

Apparently the 1.6.2 branch uses some different scheduling code. It is this scheduler that determines when to send out the qualify packets (peer_poke).
It now uses a maxheap structure, which turned out to have a problem when removing and re-adding items (something which the peer_poke does a lot)

See related issue id:17277

The heap-fix.diff patch appears to have fixed the scheduling problem for me, but I'm waiting for some traffic tomorrow, before I can see whether or not it fixes the qualify pauses.
By: cappucinoking (cappucinoking) 2010-05-04 06:29:29

The fix for issue 17277 appears to have resolved the issue for us.
Can anyone else confim?
By: Chris Walton (crjw) 2010-05-04 17:53:20

I had a similar problem:
Asterisk is supposed to re-register to my SIP provider every 75 seconds.
Logging always shows:
"Expiry for xxx.xxx.xxx is 90 sec (Scheduling reregistration in 75 s)".

With the latest trunk, most re-registrations took place after 75s, but once every hour or two, there was an unexplained delays of up to 15 minutes.
It appeared the the delays were often preceded by several of my Polycom phones updating their subscription information for "hints".

I installed the heap-fix.diff patch from issue ASTERISK-16044 about 12 hours ago.
I have not seen any of the delays since installing the patch. Every thing looks perfect now!
By: Digium Subversion (svnbot) 2010-05-06 08:58:08

Repository: asterisk
Revision: 261496

U trunk/main/heap.c

------------------------------------------------------------------------
r261496 | russell | 2010-05-06 08:58:07 -0500 (Thu, 06 May 2010) | 40 lines

Fix handling of removing nodes from the middle of a heap.

This bug surfaced in 1.6.2 and does not affect code in any other released
version of Asterisk. It manifested itself as SIP qualify not happening when
it should, causing peers to go unreachable. This was debugged down to scheduler
entries sometimes not getting executed when they were supposed to, which was in
turn caused by an error in the heap code.

The problem only sometimes occurs, and it is due to the logic for removing an entry
in the heap from an arbitrary location (not just popping off the top). The scheduler
performs this operation frequently when entries are removed before they run (when
ast_sched_del() is used).

In a normal pop off of the top of the heap, a node is taken off the bottom,
placed at the top, and then bubbled down until the max heap property is restored
(see max_heapify()). This same logic was used for removing an arbitrary node
from the middle of the heap. Unfortunately, that logic is full of fail. This
patch fixes that by fully restoring the max heap property when a node is thrown
into the middle of the heap. Instead of just pushing it down as appropriate, it
first pushes it up as high as it will go, and _then_ pushes it down.

Lastly, fix a minor problem in ast_heap_verify(), which is only used for
debugging. If a parent and child node have the same value, that is not an
error. The only error is if a parent's value is less than its children.

A huge thanks goes out to cappucinoking for debugging this down to the scheduler,
and then producing an ast_heap test case that demonstrated the breakage. That
made it very easy for me to focus on the heap logic and produce a fix. Open source
projects are awesome.

(closes issue ASTERISK-15721)
Reported by: ib2
Tested by: cappucinoking, crjw

(closes issue ASTERISK-16044)
Reported by: cappucinoking
Patches:
heap-fix.rev2.diff uploaded by russell (license 2)
Tested by: cappucinoking, russell

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261496
By: Digium Subversion (svnbot) 2010-05-06 09:02:22

Repository: asterisk
Revision: 261497

_U branches/1.6.1/
U branches/1.6.1/main/heap.c

------------------------------------------------------------------------
r261497 | russell | 2010-05-06 09:02:22 -0500 (Thu, 06 May 2010) | 47 lines

Merged revisions 261496 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
r261496 | russell | 2010-05-06 08:58:07 -0500 (Thu, 06 May 2010) | 40 lines

Fix handling of removing nodes from the middle of a heap.

This bug surfaced in 1.6.2 and does not affect code in any other released
version of Asterisk. It manifested itself as SIP qualify not happening when
it should, causing peers to go unreachable. This was debugged down to scheduler
entries sometimes not getting executed when they were supposed to, which was in
turn caused by an error in the heap code.

The problem only sometimes occurs, and it is due to the logic for removing an entry
in the heap from an arbitrary location (not just popping off the top). The scheduler
performs this operation frequently when entries are removed before they run (when
ast_sched_del() is used).

In a normal pop off of the top of the heap, a node is taken off the bottom,
placed at the top, and then bubbled down until the max heap property is restored
(see max_heapify()). This same logic was used for removing an arbitrary node
from the middle of the heap. Unfortunately, that logic is full of fail. This
patch fixes that by fully restoring the max heap property when a node is thrown
into the middle of the heap. Instead of just pushing it down as appropriate, it
first pushes it up as high as it will go, and _then_ pushes it down.

Lastly, fix a minor problem in ast_heap_verify(), which is only used for
debugging. If a parent and child node have the same value, that is not an
error. The only error is if a parent's value is less than its children.

A huge thanks goes out to cappucinoking for debugging this down to the scheduler,
and then producing an ast_heap test case that demonstrated the breakage. That
made it very easy for me to focus on the heap logic and produce a fix. Open source
projects are awesome.

(closes issue ASTERISK-15721)
Reported by: ib2
Tested by: cappucinoking, crjw

(closes issue ASTERISK-16044)
Reported by: cappucinoking
Patches:
heap-fix.rev2.diff uploaded by russell (license 2)
Tested by: cappucinoking, russell
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261497
By: Digium Subversion (svnbot) 2010-05-06 09:04:36

Repository: asterisk
Revision: 261498

_U branches/1.6.2/
U branches/1.6.2/main/heap.c

------------------------------------------------------------------------
r261498 | russell | 2010-05-06 09:04:36 -0500 (Thu, 06 May 2010) | 47 lines

Merged revisions 261496 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
r261496 | russell | 2010-05-06 08:58:07 -0500 (Thu, 06 May 2010) | 40 lines

Fix handling of removing nodes from the middle of a heap.

This bug surfaced in 1.6.2 and does not affect code in any other released
version of Asterisk. It manifested itself as SIP qualify not happening when
it should, causing peers to go unreachable. This was debugged down to scheduler
entries sometimes not getting executed when they were supposed to, which was in
turn caused by an error in the heap code.

The problem only sometimes occurs, and it is due to the logic for removing an entry
in the heap from an arbitrary location (not just popping off the top). The scheduler
performs this operation frequently when entries are removed before they run (when
ast_sched_del() is used).

In a normal pop off of the top of the heap, a node is taken off the bottom,
placed at the top, and then bubbled down until the max heap property is restored
(see max_heapify()). This same logic was used for removing an arbitrary node
from the middle of the heap. Unfortunately, that logic is full of fail. This
patch fixes that by fully restoring the max heap property when a node is thrown
into the middle of the heap. Instead of just pushing it down as appropriate, it
first pushes it up as high as it will go, and _then_ pushes it down.

Lastly, fix a minor problem in ast_heap_verify(), which is only used for
debugging. If a parent and child node have the same value, that is not an
error. The only error is if a parent's value is less than its children.

A huge thanks goes out to cappucinoking for debugging this down to the scheduler,
and then producing an ast_heap test case that demonstrated the breakage. That
made it very easy for me to focus on the heap logic and produce a fix. Open source
projects are awesome.

(closes issue ASTERISK-15721)
Reported by: ib2
Tested by: cappucinoking, crjw

(closes issue ASTERISK-16044)
Reported by: cappucinoking
Patches:
heap-fix.rev2.diff uploaded by russell (license 2)
Tested by: cappucinoking, russell
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261498