[Home]

Summary:ASTERISK-26307: res_pjsip_caller_id: Crash on outgoing change
Reporter:Bill Brigden (billbrigden)Labels:
Date Opened:2016-08-19 06:22:52Date Closed:2016-10-27 15:15:58
Priority:MajorRegression?
Status:Closed/CompleteComponents:Resources/res_pjsip_caller_id
Versions:13.10.0 Frequency of
Occurrence
Occasional
Related
Issues:
Environment:Linux xx 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux Attachments:( 0) bt-220816.txt
( 1) bt-full.txt
Description:I've had a couple of random crashes, shown in syslog as:

{quote}
Aug 19 11:50:04 pbx-02 kernel: [22213515.179507] asterisk[15754]: segfault at 40 ip 00007ff54b0b367a sp 00007ff55bffe820 error 4 in res_pjsip_caller_id.so[7ff54b0b2000+3000]
{quote}

Will attach BT (both normal and full). Oddly, I have compiled with: DONT_OPTIMIZE, DEBUG_THREADS, BETTER_BACKTRACES however it shows that some values are optimized out.
Comments:By: Asterisk Team (asteriskteam) 2016-08-19 06:22:52.425-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Bill Brigden (billbrigden) 2016-08-19 06:23:15.375-0500

BT (normal and full)

By: Rusty Newton (rnewton) 2016-08-19 12:32:23.647-0500

Bill, you said these were random so I suppose you can't reproduce them.

Did you happen to have any logging running when the crashes happened? Sharing those logs or at least the last few hundred or thousand lines before the crash could be helpful.

By: Bill Brigden (billbrigden) 2016-08-19 12:43:43.291-0500

Sadly not - I've now enabled debug logging so hopefully something picks up the issue. There was nothing in the normal log as well prior to the event either.

Looking at the source for the file mentioned (res_pjsip_caller_id.c:418) - it suggests that there was a malformed spam-ish message coming which couldnt something couldnt be copied out of for the response however there were no call initiations logged.

If needed, I can put some full SIP packet tracing in however hopefully its not needed

By: Rusty Newton (rnewton) 2016-08-19 12:49:03.766-0500

You might also see if you are running an older version of GDB and upgrade if possible. One developer that briefly looked at the issue pointed out that "Unhandled dwarf expression" may indicate that GDB is older and not understanding some of the data GCC provides.

By: Bill Brigden (billbrigden) 2016-08-19 12:54:27.930-0500

Thanks for the pointer - I'm upto date with packages though it is running Deb 7 - will look at taking it to Deb 8.

Is it likely the issue can be debugged without further info or is a re-production of the fault the only way?

Cheers

By: Bill Brigden (billbrigden) 2016-08-20 09:40:22.808-0500

FYI - for anyone that comes across this - Debian 7 has a broken gdb ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=701160 ).

For this bug, I've upgraded to Deb 8 so if it happens again, I can get a full backtrace.

By: Bill Brigden (billbrigden) 2016-08-22 09:23:42.210-0500

Crash occured again. This is now on Deb 8 however even with dont_optimize set, it's not showing the values near the crash.

Any ideas?

By: Bill Brigden (billbrigden) 2016-08-24 05:32:21.282-0500

Had another crash.. I've had full SIP packet logging to try and work out whats going on. I can see now that a call established exactly 30 mins prior to the crash and the last packet exchange between the PBX and the handset was a SIP UPDATE initiated from the server (which appears to be constructed correctly), however the stacktrace from the seg fault references the adding of the P-Asserted-Identity header which isn't present in the SIP UPDATE packet. I also note that the default timers_sess_expires value is being used of 1800 seconds. Coincidence?

This is the SIP Update incase it is of assistance..

{quote}

U 2016/08/24 10:30:38.873610 PBX-IP:5060 -> PHONE-IP:32376

UPDATE sip:7e256804-1847-4fdf-b307-cbb4cf6d64fe@PHONE-IP:32376;line=yhi3h002 SIP/2.0.
Via: SIP/2.0/UDP PBX-IP:5060;rport;branch=z9hG4bKPj059d99a1-79f2-4e10-b0be-b363f58618b6.
From: <sip:08002790000@voip-02.provider.net;user=phone>;tag=16e3b601-8f02-48b4-8ead-eaff9f9aa692.
To: "Handset" <sip:7e256804-1847-4fdf-b307-cbb4cf6d64fe@voip-02.provider.net>;tag=0oqjhb5tzz.
Contact: <sip:PBX-IP:5060>.
Call-ID: 57bd622b95cb-rbrxr0frj1ct.
CSeq: 32237 UPDATE.
Supported: timer, replaces, norefersub.
Session-Expires: 3600;refresher=uac.
Min-SE: 90.
Max-Forwards: 70.
User-Agent: VoIP.
Content-Length:  0.
.

#
U 2016/08/24 10:30:38.895381 PHONE-IP:32376 -> PBX-IP:5060

SIP/2.0 200 Ok.
Via: SIP/2.0/UDP PBX-IP;rport=5060;branch=z9hG4bKPj059d99a1-79f2-4e10-b0be-b363f58618b6.
From: <sip:08002790000@voip-02.provider.net;user=phone>;tag=16e3b601-8f02-48b4-8ead-eaff9f9aa692.
To: "Handset" <sip:7e256804-1847-4fdf-b307-cbb4cf6d64fe@voip-02.provider.net>;tag=0oqjhb5tzz.
Call-ID: 57bd622b95cb-rbrxr0frj1ct.
CSeq: 32237 UPDATE.
Contact: <sip:7e256804-1847-4fdf-b307-cbb4cf6d64fe@PHONE-IP:32376;line=yhi3h002>;reg-id=1.
User-Agent: snom710/8.7.3.25.9.
Supported: timer, 100rel, replaces, from-change.
Require: timer.
Session-Expires: 3600;refresher=uac.
Content-Length: 0.
.
{quote}
Happy to debug further given requests!

By: Rusty Newton (rnewton) 2016-08-24 20:12:05.407-0500

Hmm, can you also provide us a full log that shows VERBOSE and DEBUG both at level 5 or above up to the crash?  Even if it doesn't seem related we still need to look at it. It may be more helpful than you imagine.

If you can get the SIP trace integrated with that (the logger should do it as long as you have it running). That would be awesome.

By: Bill Brigden (billbrigden) 2016-08-25 05:54:37.095-0500

I'll sort that - could I perhaps share the log in a non-public way since it will include call logs and I'd rather not have to redact lots of info!

By: Rusty Newton (rnewton) 2016-08-25 09:40:20.694-0500

Sure. I'd still try to redact anything critical, but you could send it to asteriskteam at digium.com which goes to the core team at Digium.

By: Bill Brigden (billbrigden) 2016-08-30 07:44:56.784-0500

I've sent you logs from the most recently crash.

I also saw a report here - ASTERISK-25942 - which references a similar (but non crashing scenario) where a SIP update  / re-invite needs to determine values to set? Not sure if related..

By: Rusty Newton (rnewton) 2016-08-31 17:41:15.217-0500

Thanks for the logs Bill.

[~kharwell] took a look at the issue with me and he is curious to get:

* the pjsip configuration for the endpoint/s involved in the call flow where the crashes occur.
* In addition the dialplan involved would be very helpful.

The issue may be occurring when the URI is cloned, but we are not sure what happens to get it into a funky state. We'll likely have to duplicate it, or get more data to find out. Understanding more about the call flow and how the endpoints are configured may help out.

If you can build a test system where you have the same call-flow and session expiry and it reproduces the problem.. that would really help. Likely the real trigger is more complicated, but you never know.

As it is right now, with the current data, we aren't sure that any further investigation will lead to root of the issue.

By: Bill Brigden (billbrigden) 2016-09-01 06:56:21.684-0500

Hi Rusty / Kevin,

I've sent to your email (asteriskteam) an output from the realtime db (endpoints + aors).. including the rough dialplan that is being hit for this call.

I'll see about setting up a test system to duplicate however could you let me know if:

- I can disable SIP UPDATES's mid call?
- Which timer is causing the SIP UPDATE to trigger (ie. to decrease it so make it easier to reproduce)?
- Any other pointers to duplicate easily?

Cheers,

By: Kevin Harwell (kharwell) 2016-09-01 10:59:33.430-0500

[~billbrigden] Thanks for the information.

{quote}
   I can disable SIP UPDATES's mid call?
{quote}
Someone else can correct me if I am wrong, but I don't think that is possible to do mid call.
{quote}
   Which timer is causing the SIP UPDATE to trigger (ie. to decrease it so make it easier to reproduce)?
{quote}
I think your session timers guess is a good one. Try lowering the _timers_sess_expires_ value and see if it still crashes or crashes earlier.
{quote}
   Any other pointers to duplicate easily?
{quote}
Not really sure about other things to try to make it crash, but you could try a couple of things to see if it doesn't crash. For instance, try setting _send_pai_ and/or _send_rpid_ equal to no for the endpoint in question.

You said it was random, but do you know if it happens on the same endpoint each time of different ones? If the same it'd be worth comparing the differences between endpoints. Also looking at the debug log it appears it is missing some expected debug. If the debug level was previously set below 5 would it be possible to try to capture it again with debug set higher?

By: Bill Brigden (billbrigden) 2016-09-06 08:44:46.941-0500

Hi Kevin / Rusty,

I've just emailed you another crash log (with stack + logs). If you are able to take a look today that'd be great since I now have a test system ready for this.

Cheers,

By: Bill Brigden (billbrigden) 2016-09-06 10:51:56.127-0500

Also, by setting both send_rpid and send_pai to "no" - this is stopped the crashing so that works as a temporary fix, though this is now affecting some presentation issues with Connected Lines. I initially tried setting only send_pai to no, and that gave a similar crash, only referencing us Remote-Party-ID:

{quote}
Program terminated with signal SIGSEGV, Segmentation fault.
#0  create_new_id_hdr (hdr_name=0x7f79a97450b0 <pj_rpid_name>, id=0x7f79c7ffe990, tdata=<optimized out>, base=<optimized out>) at res_pjsip_caller_id.c:418
418             id_name_addr = pjsip_uri_clone(tdata->pool, base->uri);
#0  create_new_id_hdr (hdr_name=0x7f79a97450b0 <pj_rpid_name>, id=0x7f79c7ffe990, tdata=<optimized out>, base=<optimized out>) at res_pjsip_caller_id.c:418
       id_hdr = 0x7f79c00937a8
#1  0x00007f79a9543953 in add_rpid_header (session=<optimized out>, id=<optimized out>, tdata=<optimized out>) at res_pjsip_caller_id.c:617
       base = 0x7f79c00937c0
       rpid_hdr = 0x7f79f4026798
       old_rpid = 0x7f79f80095c8
       pj_rpid_name = ptr = 0x7f79a9544478 "Remote-Party-ID", slen = 15
#2  add_id_headers (session=0x7f79c0093830, tdata=0x7f79a9544478, id=0xf) at res_pjsip_caller_id.c:647
{quote}

By: Kevin Harwell (kharwell) 2016-09-06 14:53:07.996-0500

Bill,

Still not sure what may be going on, but it does seem to be an issue so I'm going to mark it as such (i.e. move it out of triage) as a more in-depth look is warranted. Once a developer can take a more detailed look at the code perhaps they will be able to find something or replicate the issue.

One thing to note, on a session refresh it appears that pjsip defaults to sending an UPDATE if "allowed". Meaning if UPDATE is found in the "Allow" header then pjsip uses an UPDATE instead of an INVITE. If it is possible to remove the UPDATE from the "Allow" header on the incoming invite I'd be curious if it still crashes.


By: Joshua C. Colp (jcolp) 2016-10-26 08:21:51.692-0500

I'm looking into this issue but was wondering what version of PJSIP is in use and how was it installed?

By: Joshua C. Colp (jcolp) 2016-10-26 11:41:52.846-0500

Actually I think I've figured out what's going on!

By: Friendly Automation (friendly-automation) 2016-10-27 08:30:11.529-0500

Change 4202 had a related patch set uploaded by Joshua Colp:
res_pjsip_caller_id: Fix crash on session timers UPDATE on inbound calls.

[https://gerrit.asterisk.org/4202|https://gerrit.asterisk.org/4202]

By: Friendly Automation (friendly-automation) 2016-10-27 08:30:48.892-0500

Change 4203 had a related patch set uploaded by Joshua Colp:
res_pjsip_caller_id: Fix crash on session timers UPDATE on inbound calls.

[https://gerrit.asterisk.org/4203|https://gerrit.asterisk.org/4203]

By: Friendly Automation (friendly-automation) 2016-10-27 08:30:57.885-0500

Change 4204 had a related patch set uploaded by Joshua Colp:
res_pjsip_caller_id: Fix crash on session timers UPDATE on inbound calls.

[https://gerrit.asterisk.org/4204|https://gerrit.asterisk.org/4204]

By: Bill Brigden (billbrigden) 2016-10-27 09:43:17.576-0500

Ah excellent. Thanks for taking the time to work on it. I guess it'll take a bit of time to get released into a future release?

By: Joshua C. Colp (jcolp) 2016-10-27 09:45:05.960-0500

It's currently in review so it'll go in over the next few days and end up in the release, probably 4-6 weeks out.

By: Friendly Automation (friendly-automation) 2016-10-27 15:15:58.853-0500

Change 4204 merged by zuul:
res_pjsip_caller_id: Fix crash on session timers UPDATE on inbound calls.

[https://gerrit.asterisk.org/4204|https://gerrit.asterisk.org/4204]

By: Friendly Automation (friendly-automation) 2016-10-27 15:32:59.912-0500

Change 4203 merged by Joshua Colp:
res_pjsip_caller_id: Fix crash on session timers UPDATE on inbound calls.

[https://gerrit.asterisk.org/4203|https://gerrit.asterisk.org/4203]

By: Friendly Automation (friendly-automation) 2016-10-27 16:48:26.866-0500

Change 4202 merged by zuul:
res_pjsip_caller_id: Fix crash on session timers UPDATE on inbound calls.

[https://gerrit.asterisk.org/4202|https://gerrit.asterisk.org/4202]