ASTERISK-28972: FRACK! + task processor queue issue

[Home]

Summary: ASTERISK-28972: FRACK! + task processor queue issue

Reporter: Cyril Ramière (Cyril.r) Labels: webrtc

Date Opened: 2020-06-30 09:14:12 Date Closed: 2020-07-15 12:00:02

Priority: Major Regression?

Status: Closed/Complete Components: Resources/res_pjsip

Versions: 16.10.0 Frequency of
Occurrence One Time

Related
Issues:

Environment: AWS Ubuntu 16.04 Attachments: ( 0) pjsip.conf

Description: Hello everyone,

Today I had a big issue with multiple asterisks across multiple machines.

The asterisks have not crashed, but they was in an unstable state (some calls working, some not) with an unusual 9-10 seconds delay on all calls.

For the incoming calls, when our SBC send the INVITE, asterisk had a 9-10 seconds delay replying to that invite.
For outgoing calls (I even tried directly in the console using channel originate) the same delay was present between the time that I hit enter and the time when the INVITE is really send.

I saw on the log multiples messages related to the task processor queue, those messages remained even after there was no active calls.
It seems that there are no correlation between the issue and the amount of calls that my machines had, some had like 50 calls others like 300 calls...

Here are some samples of error messages:
----------

[Jun 30 08:44:20] ERROR[5871] res_pjsip_session.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x7f8ff11751b0 (0)
[Jun 30 08:44:20] ERROR[5871] : Got 13 backtrace records
# 0: /usr/sbin/asterisk() [0x45c977]
# 1: /usr/sbin/asterisk() [0x45ff1b]
# 2: /usr/sbin/asterisk(__ao2_find+0x28) [0x460108]
# 3: /usr/lib/asterisk/modules/res_pjsip_session.so(ast_sip_session_get_datastore+0x31) [0x7f8f8d0a2491]
# 4: /usr/lib/asterisk/modules/res_pjsip_header_funcs.so(+0x1f56) [0x7f8f8ba52f56]
# 5: /usr/lib/asterisk/modules/res_pjsip.so(+0x11730) [0x7f8f8f51f730]
# 6: /usr/sbin/asterisk(ast_taskprocessor_execute+0xce) [0x59b61e]
# 7: /usr/sbin/asterisk() [0x5a2d10]
# 8: /usr/sbin/asterisk(ast_taskprocessor_execute+0xce) [0x59b61e]
# 9: /usr/sbin/asterisk() [0x5a34b0]
#10: /usr/sbin/asterisk() [0x5ab34c]
#11: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f8ff96a76ba]
#12: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f8ff8c8141d]

[Jun 30 08:56:09] WARNING[28633][C-000085dc] taskprocessor.c: The 'stasis/m:rtp:all-0000064a' task processor queue reached 500 scheduled tasks.

[Jun 30 09:01:48] WARNING[1977] taskprocessor.c: The 'stasis/m:rtp:all-0000064a' task processor queue reached 500 scheduled tasks again.

[...]

[Jun 30 09:41:38] ERROR[21458] res_pjsip_session.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x7f8ff0c6bc70 (0)
[Jun 30 09:41:38] ERROR[21458] : Got 13 backtrace records
# 0: /usr/sbin/asterisk() [0x45c977]
# 1: /usr/sbin/asterisk() [0x45ff1b]
# 2: /usr/sbin/asterisk(__ao2_find+0x28) [0x460108]
# 3: /usr/lib/asterisk/modules/res_pjsip_session.so(ast_sip_session_get_datastore+0x31) [0x7f8f8d0a2491]
# 4: /usr/lib/asterisk/modules/res_pjsip_header_funcs.so(+0x1f56) [0x7f8f8ba52f56]
# 5: /usr/lib/asterisk/modules/res_pjsip.so(+0x11730) [0x7f8f8f51f730]
# 6: /usr/sbin/asterisk(ast_taskprocessor_execute+0xce) [0x59b61e]
# 7: /usr/sbin/asterisk() [0x5a2d10]
# 8: /usr/sbin/asterisk(ast_taskprocessor_execute+0xce) [0x59b61e]
# 9: /usr/sbin/asterisk() [0x5a34b0]
#10: /usr/sbin/asterisk() [0x5ab34c]
#11: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f8ff96a76ba]
#12: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f8ff8c8141d]
----------

Since this is running in production, I don't have a coredump, and I wasn't able to reproduce (yet) the issue in develop.

I checked everything I can (instances, network, dns resolution, ...) and found nothing unusual.
Restarting asterisk fixed the issue, now I'm fearing that it will happen again.

Any thoughts? I don't know what I'm looking, the the FRACK messages means nothing for me.

Thanks

Comments: By: Asterisk Team (asteriskteam) 2020-06-30 09:14:13.063-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.
By: Joshua C. Colp (jcolp) 2020-06-30 10:25:31.032-0500

What is the usage of the system? How many calls are going on? How many calls per second? Do you have dialplan that shows common usage? Has this happened once? Are you using realtime?
By: Cyril Ramière (Cyril.r) 2020-06-30 10:45:48.811-0500

Hello Joshua,

Usage of the system is to handle calls from our customers.
Some of the impacted instances had very low trafic like 20-50 calls @ ~1 CPS
Others were more busy with 200-300 calls @ 2-3 CPS
We use realtime, so the dialplan is a one-liner that goes to stasis when a call enters.
It never happened before during the testing of this version, it was deployed since about 10 days.
On the impacted instances, even without any active calls, trying to do an outgoing call using "channel originate PJSIP [...]" (without stasis) resulted in a 9-10 sec delay between the time I pressed enter and the time our SBC received the INVITE.
We use WebRTC & recording.

After reviewing the logs I have one instance on which we don't use WebRTC and it has the issue too:

[Jun 30 09:00:47] WARNING[3924][C-00020378] taskprocessor.c: The 'stasis/m:rtp:all-00000646' task processor queue reached 500 scheduled tasks again.
[Jun 30 09:01:21] WARNING[2154] taskprocessor.c: The 'stasis/m:rtp:all-00000646' task processor queue reached 500 scheduled tasks again.
[Jun 30 09:01:41] ERROR[2889] chan_pjsip.c: Session already DISCONNECTED [reason=200 (OK)]
[Jun 30 09:01:56] WARNING[4521][C-000203a9] taskprocessor.c: The 'stasis/m:rtp:all-00000646' task processor queue reached 500 scheduled tasks again.
[Jun 30 09:02:39] WARNING[2154] taskprocessor.c: The 'stasis/m:rtp:all-00000646' task processor queue reached 500 scheduled tasks again.
[later .. FRACK/ bad magic error]

So, it seems that is not related to WebRTC.

By: Joshua C. Colp (jcolp) 2020-06-30 15:55:51.351-0500

How is realtime used? Is it used for PJSIP itself? If so is the database okay at the time? Do CLI commands for listing endpoints and such work? Do you use caching?
By: Cyril Ramière (Cyril.r) 2020-07-01 03:33:31.293-0500

Realtime is used to match the caller number to a Stasis with args in database.

( context + exten + priority + app + appdata)
Like :
"test","+1234",1,"Stasis","stasisApp,999,3948,39"

That's it.

DBs were fine during the issue, I don't remember testing CLI command pjsip show endpoints.
We use caching, here is the config:

{code:title=caching configuration}
[res_pjsip]
endpoint/cache = memory_cache,maximum_objects=2000,expire_on_reload=yes,object_lifetime_maximum=3600,object_lifetime_stale=60
endpoint=config,pjsip.conf,criteria=type=endpoint
endpoint=realtime,ps_endpoints

auth/cache = memory_cache,maximum_objects=2000,expire_on_reload=yes,object_lifetime_maximum=3600,object_lifetime_stale=60
auth=config,pjsip.conf,criteria=type=auth
auth=realtime,ps_auths

aor/cache = memory_cache,maximum_objects=2000,expire_on_reload=yes,object_lifetime_maximum=3600,object_lifetime_stale=60
aor=config,pjsip.conf,criteria=type=aor
aor=realtime,ps_aors

domain_alias=config,pjsip.conf,criteria=type=domain_alias
global=config,pjsip.conf,criteria=type=global
system=config,pjsip.conf,criteria=type=system
transport=config,pjsip.conf,criteria=type=transport

contact/cache=memory_cache,maximum_objects=2000,expire_on_reload=yes,object_lifetime_maximum=900,object_lifetime_stale=60
contact=astdb,registrator

[res_pjsip_endpoint_identifier_ip]
identify=config,pjsip.conf,criteria=type=identify
{code}

Regards
By: Joshua C. Colp (jcolp) 2020-07-01 03:54:11.699-0500

And all instances are connected to the same database? Did the problem happen at the same time for the ones that were exhibiting the issue? Do you have sample PJSIP configuration that matches your environment?
By: Cyril Ramière (Cyril.r) 2020-07-01 04:08:53.772-0500

All instances are connected to the same cluster, except for our dev instance that is on another cluster, and had the same issue :/

I forgot to mention that while our dev instance had exactly the same issue, it was not spitting out messages about taskprocessor overload. (but this instance has 90% of the time no active calls, and when used it has <5 calls max so maybe it was struggling less).

Here is the pjsip.conf file that we use, I've redacted all the ips/hostnames.
By: Joshua C. Colp (jcolp) 2020-07-01 04:12:08.716-0500

And it was at the same time that it occurred for them? (You weren't clear on that)

What I'm trying to determine are the common parts between things, because having most experience an issue at the same time is extremely suspicious that something common was the underlying trigger of it.
By: Joshua C. Colp (jcolp) 2020-07-01 04:15:41.609-0500

As well I think we'd need a deadlock backtrace[1] while it is going on to see the state of the system and what threads are doing what to confirm that internally Asterisk isn't getting blocked on things. The output of "core show taskprocessors" would also show what things are seeing a lot of load (rtp for sure would due to the message being output). Are you using AMI at all? You could also try changing the configuration for PJSIP to limit its rate limiting to be triggered only by PJSIP traffic[2]. It doesn't solve the problem but it would allow SIP traffic to continue.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc.
[2] https://github.com/asterisk/asterisk/blob/master/configs/samples/pjsip.conf.sample#L1207
By: Cyril Ramière (Cyril.r) 2020-07-01 04:25:39.035-0500

Thanks for your feedback,

I'm betting on a transitent network issue, I don't know what went wrong (dns, db, ari, SIP signaling, other...) so that is a very hard issue to reproduce (I'm still trying).

It happened in the morning at ~09h when offices are opening and people calling us.

But it is possible that we saw the issue in the morning but it happened silently before (like during the night), again I reviewed the logs and found nothing.

I will check with the rest of the team but I'm pretty sure we don't use AMI at all.

By: Asterisk Team (asteriskteam) 2020-07-15 12:00:01.071-0500

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines