[Home]

Summary:ASTERISK-28161: Removal of Previous Patch Causes PJSIP Timer Issues
Reporter:Ross Beer (rossbeer)Labels:pjsip
Date Opened:2018-11-12 04:25:57.000-0600Date Closed:2020-06-15 17:13:01
Priority:MajorRegression?Yes
Status:Closed/CompleteComponents:Resources/res_pjsip
Versions:GIT Frequency of
Occurrence
Frequent
Related
Issues:
causesASTERISK-28490 Segfault in chan_pjsip in grp_lock_dec_ref
duplicatesASTERISK-28338 Asterisk crashes with ERROR *** /usr/sbin/asterisk': corrupted size vs. prev_size: 0x00007f77400cfcf0 *** when there are 48 SIP outbound calls.(almost 144 SIP Channels)
is related toASTERISK-28941 segfault in pjsip timer
Environment:CentOS 7.5Attachments:( 0) core-asterisk-114868-1543230597-thread1.txt
( 1) core-asterisk-145932-1554311130-thread1.txt
( 2) core-asterisk-174659-1559054674-thread1.txt
( 3) core-asterisk-175043-1553248984-thread1.txt
( 4) core-asterisk-180205-host-1542295191-thread1.txt
( 5) core-asterisk-197853-host-1542017635-thread1.txt
( 6) core-asterisk-3163-thread1.txt
( 7) core-asterisk-33635-1554479440-thread1.txt
( 8) core-asterisk-47221-1554826091-thread1.txt
( 9) core-asterisk-82410-1554735609-thread1.txt
(10) core-asterisk-9435-1554290719-thread1.txt
(11) core-asterisk-99802-1554218364-thread1.txt
Description:The removal of 'third-party/pjproject/patches/0010-timer-Clean-up-usage-of-timer-heap.patch' in commit https://gerrit.asterisk.org/#/c/asterisk/+/10487/ has started a segfault related to PJSIP Timers.

Please see the attached backtrace.
Comments:By: Asterisk Team (asteriskteam) 2018-11-12 04:26:00.022-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Kevin Harwell (kharwell) 2018-11-13 15:33:49.743-0600

[~rossbeer] Just curious, but have you added the patch back in and experienced no more segfaults? And/Or removed it again and they started up again?

By: Ross Beer (rossbeer) 2018-11-14 05:07:05.836-0600

I have been plagued with timer issues for quite some time, removing the patch has moved the issue to a different place. Previous segfaults before this change were observed here:

{noformat}
Thread 1 (Thread 0x7f9682b7d700 (LWP 150053)):
#0  0x00007f9997c2a85c in copy_node (ht=0x284f800, slot=5050, moved_node=0x7f96b55275b0) at ../src/pj/timer.c:137
#1  0x00007f9997c2ac95 in reheap_up (ht=0x284f800, moved_node=0x7f96b5d1e4e0, slot=5050, parent=2524) at ../src/pj/timer.c:208
#2  0x00007f9997c2aea3 in remove_node (ht=0x284f800, slot=5050) at ../src/pj/timer.c:254
       parent = 2524
       moved_node = 0x7f96b5d1e4e0
       removed_node = 0x7f96ee3d1b28
#3  0x00007f9997c2b18d in cancel (ht=0x284f800, entry=0x7f96ee3d1b28, flags=7) at ../src/pj/timer.c:353
       timer_node_slot = 5050
#4  0x00007f9997c2b6a2 in cancel_timer (ht=0x284f800, entry=0x7f96ee3d1b28, flags=6, id_val=0) at ../src/pj/timer.c:594
       count = 0
{noformat}

The reported issue for this relates to a SIP retransmission, probably related to a reliable connection such as TCP/TLS. Is this correct?

By: Kevin Harwell (kharwell) 2018-11-14 17:41:52.349-0600

I'm not entirely sure.

Do you have, or can you get, any more info that might shed some more light on this issue, or might help us to replicate the problem? For instance a full debug log with sip trace and/or a description of what's happening on the system at the time?

Also can you attach an entire backtrace? If lucky it might show another thread acting up.

Thanks!

By: Ross Beer (rossbeer) 2018-11-15 09:45:42.916-0600

I've emailed the full backtrace to you directly for privacy. However in another segfault, the backtrace looks the same as this issue but this time the CLI showed the following:

{noformat}
[2018-11-15 15:18:48] ERROR[26108]: res_pjsip_session.c:1881 ast_sip_session_defer_termination: FRACK!, Failed assertion !session->defer_terminate (0)
Got 31 backtrace records
#0: [0x607c9c] /usr/sbin/asterisk(__ast_assert_failed+0x84) [0x607c9c]
#1: [0x7fa2eef09dab] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0x4dab) [0x7fa2eef09dab]
#2: [0x7fa2eef0f0a2] /usr/lib64/asterisk/modules/res_pjsip_session.so(ast_sip_session_defer_termination+0x5c) [0x7fa2eef0f0a2]
#3: [0x7fa1b3f31aa6] /usr/lib64/asterisk/modules/res_pjsip_refer.so(+0x4aa6) [0x7fa1b3f31aa6]
#4: [0x7fa1b3f33203] /usr/lib64/asterisk/modules/res_pjsip_refer.so(+0x6203) [0x7fa1b3f33203]
#5: [0x7fa1b3f334a9] /usr/lib64/asterisk/modules/res_pjsip_refer.so(+0x64a9) [0x7fa1b3f334a9]
#6: [0x7fa2eef10d1e] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0xbd1e) [0x7fa2eef10d1e]
#7: [0x7fa2eef10f7f] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0xbf7f) [0x7fa2eef10f7f]
#8: [0x7fa2eef119ff] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0xc9ff) [0x7fa2eef119ff]
#9: [0x7fa4e25f090d] /usr/lib64/libasteriskpj.so.2(+0x6690d) [0x7fa4e25f090d]
#10: [0x7fa4e263bdf1] /usr/lib64/libasteriskpj.so.2(pjsip_dlg_on_tsx_state+0xac) [0x7fa4e263bdf1]
#11: [0x7fa4e263c65f] /usr/lib64/libasteriskpj.so.2(+0xb265f) [0x7fa4e263c65f]
#12: [0x7fa4e2634907] /usr/lib64/libasteriskpj.so.2(+0xaa907) [0x7fa4e2634907]
#13: [0x7fa4e26368c8] /usr/lib64/libasteriskpj.so.2(+0xac8c8) [0x7fa4e26368c8]
#14: [0x7fa4e2635860] /usr/lib64/libasteriskpj.so.2(pjsip_tsx_recv_msg+0xc1) [0x7fa4e2635860]
#15: [0x7fa4e263b544] /usr/lib64/libasteriskpj.so.2(pjsip_dlg_on_rx_request+0x3fd) [0x7fa4e263b544]
#16: [0x7fa4e263d145] /usr/lib64/libasteriskpj.so.2(+0xb3145) [0x7fa4e263d145]
#17: [0x7fa4e2619ea4] /usr/lib64/libasteriskpj.so.2(pjsip_endpt_process_rx_data+0x1ac) [0x7fa4e2619ea4]
#18: [0x7fa2ec044947] /usr/lib64/asterisk/modules/res_pjsip.so(+0x2d947) [0x7fa2ec044947]
#19: [0x5ef1d5] /usr/sbin/asterisk(ast_taskprocessor_execute+0x10d) [0x5ef1d5]
#20: [0x5f8c5a] /usr/sbin/asterisk() [0x5f8c5a]
#21: [0x5ef1d5] /usr/sbin/asterisk(ast_taskprocessor_execute+0x10d) [0x5ef1d5]
#22: [0x5f6ea9] /usr/sbin/asterisk() [0x5f6ea9]
#23: [0x5f8548] /usr/sbin/asterisk() [0x5f8548]
#24: [0x5f8301] /usr/sbin/asterisk() [0x5f8301]
#25: [0x604b40] /usr/sbin/asterisk() [0x604b40]
{noformat}

By: Jonathan Sullivan (flightnut5304) 2019-01-30 17:33:26.479-0600

I seem to be having the same issue here, can someone more experienced please confirm?
Asterisk 16.0.0 built by mockbuild @ jenkins7 on a x86_64 running Linux on 2018-10-10 17:15:57 UTC
Brief.txt - https://pastebin.freepbx.org/view/72b0d661
Full.txt - https://pastebin.freepbx.org/view/c131fc6a
Locks.txt - https://pastebin.freepbx.org/view/117ac566
Thread1.txt - https://pastebin.freepbx.org/view/10b0451d





By: Kevin Harwell (kharwell) 2019-01-31 09:55:15.560-0600

The backtrace appears to show a crash on the same call. I'd suspect it is a very similar, if not the same issue.

By: Jonathan Sullivan (flightnut5304) 2019-01-31 12:44:36.339-0600

Just upgraded to Asterisk 16.1.1 on FreePBX and the issues with the segfaults are still present,  as promised here are my updated backtraces from the two separate crashes I had this morning. Can somebody tell me if these are BOTH related to the same underlying issue?

Thread- https://pastebin.freepbx.org/view/002604d0
Brief- https://pastebin.freepbx.org/view/70937d82
Full- https://pastebin.freepbx.org/view/b543b6aa
Locks- https://pastebin.freepbx.org/view/54155437
________________________________________
Thread- https://pastebin.freepbx.org/view/bb93954e
Brief- https://pastebin.freepbx.org/view/1feac3cd
Full- https://pastebin.freepbx.org/view/96a32c4c
Locks- https://pastebin.freepbx.org/view/1323c484


By: Jonathan Sullivan (flightnut5304) 2019-01-31 13:53:49.977-0600

Updated traces above

By: Ross Beer (rossbeer) 2019-01-31 14:35:15.300-0600

There was recently a patch made for PJSIP and then also added to asterisk which may resolve this issue:

https://gerrit.asterisk.org/#/c/asterisk/+/10903/

I think this will be in the next release

By: Jonathan Sullivan (flightnut5304) 2019-01-31 14:39:35.160-0600

Any eta for the release?

By: Sean Bright (seanbright) 2019-03-25 15:04:01.416-0500

16.2.0 and 13.25.0 were released on 2019-02-15 - are you still having this problem with these new releases?

By: Ross Beer (rossbeer) 2019-03-26 04:33:46.985-0500

I am still getting crashes in the PJSIP timers, please see the latest backtrace.

By: Sean Bright (seanbright) 2019-03-26 13:02:21.993-0500

[~rossbeer], could you try manually applying this patch and see if it helps:

https://trac.pjsip.org/repos/changeset/5934

*Edit:* Actually, doing that may be a pain if you are using bundled pjproject. Let me whip up a quick gerrit review you can try:

13: https://gerrit.asterisk.org/c/asterisk/+/11182

If you need it for a different release version of Asterisk, let me know.

By: Ross Beer (rossbeer) 2019-03-26 14:24:05.906-0500

Great, thank you! I'll test and feedback if there are any issues

By: Sean Bright (seanbright) 2019-04-01 09:41:13.397-0500

[~rossbeer], have you had a chance to apply this patch and test to see if it resolves the crash you are experiencing?

By: Ross Beer (rossbeer) 2019-04-01 09:46:48.189-0500

I believe the patch will resolve the issue, there have been no further segfaults since applying the patch. However, the issue happens at random and therefore may happen at any time, from multiple times a day or after a few weeks.

By: Friendly Automation (friendly-automation) 2019-04-02 10:17:10.135-0500

Change 11190 merged by Friendly Automation:
pjproject: Add timer patch from pjproject r5934

[https://gerrit.asterisk.org/c/asterisk/+/11190|https://gerrit.asterisk.org/c/asterisk/+/11190]

By: Ross Beer (rossbeer) 2019-04-02 10:28:46.775-0500

I have just had a reoccurrence of this issue, please see the attached backtrace.

By: Asterisk Team (asteriskteam) 2019-04-02 10:28:47.401-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Friendly Automation (friendly-automation) 2019-04-02 10:30:02.582-0500

Change 11182 merged by George Joseph:
pjproject: Add timer patch from pjproject r5934

[https://gerrit.asterisk.org/c/asterisk/+/11182|https://gerrit.asterisk.org/c/asterisk/+/11182]

By: Friendly Automation (friendly-automation) 2019-04-02 10:30:31.377-0500

Change 11189 merged by George Joseph:
pjproject: Add timer patch from pjproject r5934

[https://gerrit.asterisk.org/c/asterisk/+/11189|https://gerrit.asterisk.org/c/asterisk/+/11189]

By: Sean Bright (seanbright) 2019-04-02 10:33:17.287-0500

[~rossbeer], what specific version of Asterisk 13 are you running?

By: Ross Beer (rossbeer) 2019-04-02 10:34:30.010-0500

The latest GIT showing: GIT-13-13.15.0-rc1-2191-g664f706M

Plus the patch from https://gerrit.asterisk.org/#/c/asterisk/+/11182/

By: Ross Beer (rossbeer) 2019-04-03 08:26:53.279-0500

This issue appears to be happening more since the latest patch has been applied

By: Ross Beer (rossbeer) 2019-04-03 08:42:11.590-0500

One thing that differs on our code base is that there is a patch that changes the following so that more timers can exist:

-typedef int pj_timer_id_t;
+typedef long int pj_timer_id_t;

The number shown in the backtrace appears to show this working as the ID is -140068368185622.




By: Sean Bright (seanbright) 2019-04-03 08:49:29.488-0500

[~rossbeer], we can't support code that you have modified.

By: Ross Beer (rossbeer) 2019-04-03 08:50:35.819-0500

One of the Digium devs gave me the patch, I will remove this and continue testing

By: Ross Beer (rossbeer) 2019-04-04 03:32:27.786-0500

A further segfault has occurred, please see lates backtrace.

By: Ross Beer (rossbeer) 2019-04-08 10:21:46.958-0500

Further timer crashes are happening within PJSIP

By: Ross Beer (rossbeer) 2019-04-23 04:20:27.573-0500

PJSIP patch https://trac.pjsip.org/repos/changeset/5971 deals with timer issues, could this help resolve this issue?

I note that removing the previous patch https://gerrit.asterisk.org/c/asterisk/+/11182 has lessened the frequency of the crashes.

By: Ross Beer (rossbeer) 2019-04-29 09:41:39.019-0500

I'm pretty sure this is related to the recent PJSIP patch above. If anyone could provide a patch to asterisk it would be appreciated.

I will happily test.

By: Friendly Automation (friendly-automation) 2019-05-06 05:45:28.616-0500

Change 11344 merged by Friendly Automation:
pjproject-bundled:  Add upstream timer fixes

[https://gerrit.asterisk.org/c/asterisk/+/11344|https://gerrit.asterisk.org/c/asterisk/+/11344]

By: Friendly Automation (friendly-automation) 2019-05-22 09:17:15.544-0500

Change 11396 merged by George Joseph:
pjproject-bundled:  Add upstream timer fixes

[https://gerrit.asterisk.org/c/asterisk/+/11396|https://gerrit.asterisk.org/c/asterisk/+/11396]

By: Friendly Automation (friendly-automation) 2019-05-22 11:39:52.614-0500

Change 11392 merged by Friendly Automation:
pjproject-bundled:  Add upstream timer fixes

[https://gerrit.asterisk.org/c/asterisk/+/11392|https://gerrit.asterisk.org/c/asterisk/+/11392]

By: Friendly Automation (friendly-automation) 2019-05-22 12:09:54.275-0500

Change 11393 merged by Friendly Automation:
pjproject-bundled:  Add upstream timer fixes

[https://gerrit.asterisk.org/c/asterisk/+/11393|https://gerrit.asterisk.org/c/asterisk/+/11393]

By: Friendly Automation (friendly-automation) 2019-05-22 12:10:45.873-0500

Change 11391 merged by Friendly Automation:
pjproject-bundled:  Add upstream timer fixes

[https://gerrit.asterisk.org/c/asterisk/+/11391|https://gerrit.asterisk.org/c/asterisk/+/11391]

By: Ross Beer (rossbeer) 2019-05-29 05:39:31.226-0500

The upstream patches do not resolve the issue.

By: Asterisk Team (asteriskteam) 2019-05-29 05:39:31.764-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.

By: Ross Beer (rossbeer) 2019-05-30 04:48:29.094-0500

Adding a further backtrace from another segfault

By: Kevin Harwell (kharwell) 2020-06-11 13:35:43.079-0500

[~rossbeer] Not sure if you are still having problems. If so you might want to try pjproject 2.10 if you have not already (I think they worked on some timer stuff). If you want a head start you can get our patch for bundled here:

https://gerrit.asterisk.org/c/asterisk/+/14413

Otherwise it'll go out in the next release in a few weeks.

By: Ross Beer (rossbeer) 2020-06-11 14:35:09.584-0500

I've been running with two pjproject timer patches for about a year which resolves the issue.

Thank you for including these in the code base, I will test these now.

By: Kevin Harwell (kharwell) 2020-06-15 17:12:50.544-0500

Ah okay from a previous comment I was under the impression you were still possibly having an issue. Glad to hear you are not.

Since that is the case I'm going to close this issue.