Summary: | ASTERISK-28469: AMI frozen on high load | ||
Reporter: | Stefan Repke (stefffan@gmx.de) | Labels: | |
Date Opened: | 2019-07-02 10:16:44 | Date Closed: | |
Priority: | Major | Regression? | |
Status: | Open/New | Components: | Core/ManagerInterface |
Versions: | 16.4.0 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | Linux deacsrtsk12 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux | Attachments: | ( 0) AMIfrozen.txt ( 1) case2.zip ( 2) core-asterisk-running-2019-07-01T16-15-22+0200-brief.txt ( 3) core-asterisk-running-2019-07-01T16-15-22+0200-full.txt ( 4) core-asterisk-running-2019-07-01T16-15-22+0200-locks.txt ( 5) core-asterisk-running-2019-07-01T16-15-22+0200-thread1.txt ( 6) core_show_locks ( 7) core_show_threads ( 8) debug.deacsrtsk12 |
Description: | The AMI queue gets stuck when running Asterisk 16.0.4 with ~30 calls per minute:
After a few hundred calls have been processed, no further manager events are distrubuted and no new manager connections are accepted. The problem is reproducable and we've created a dump. I think the reason is that thread 0x7f6e6e56c700 is blocked in {{send_string}} of {{manager.c:3033}} (while holding/blocking lock 0x7f6ee0003260). Our version of libc6 is 2.19-18+deb8u10. The CLI command {{manager show eventq}} produces dozens of megabytes of output consisting of the queued events. {noformat:title=core show locks} ======================================================================= === 16.4.0 === Currently Held Locks ======================================================================= === === <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 0x7f6e6ffff700 LWP:11439 (desc->accept_fn started at [ 770] tcptls.c ast_tcptls_server_start()) === ---> Waiting for Lock #0 (manager.c): MUTEX 6896 purge_sessions session 0x7f6ee0003260 (1) [0x50277a] asterisk lock.h:446 ast_reentrancy_lock() [inlined] asterisk lock.c:272 __ast_pthread_mutex_lock() [0x45bd6d] asterisk astobj2.c:243 __ao2_lock() [0x5faaca] asterisk manager.c:2678 handle_showmanconn() [0x4bf2e5] asterisk cli.c:2834 ast_cli_command_full() [0x4bf454] asterisk cli.c:2861 ast_cli_command_multiple_full() [0x45865f] asterisk asterisk.c:1420 netconsole() [0x5a40c5] asterisk utils.c:1251 dummy_start() [0x7f6efd2c5064] libpthread.so.0 pthread_create.c:309 start_thread() [0x7f6efc26862d] libc.so.6 clone.S:113 clone() === --- ---> Locked Here: manager.c line 6104 (process_events) === ------------------------------------------------------------------- === === Thread ID: 0x7f6e6e56c700 LWP:11491 (handle_tcptls_connection started at [ 309] tcptls.c ast_tcptls_server_root()) === ---> Lock #0 (manager.c): MUTEX 6104 process_events s->session 0x7f6ee0003260 (1) [0x50277a] asterisk lock.h:446 ast_reentrancy_lock() [inlined] asterisk lock.c:272 __ast_pthread_mutex_lock() [0x45bd6d] asterisk astobj2.c:243 __ao2_lock() [0x5fff11] asterisk manager.c:6105 process_events() [0x604be8] asterisk manager.c:6714 do_message() [inlined] asterisk manager.c:6853 session_do() [0x5a40c5] asterisk utils.c:1251 dummy_start() [0x7f6efd2c5064] libpthread.so.0 pthread_create.c:309 start_thread() [0x7f6efc26862d] libc.so.6 clone.S:113 clone() === ------------------------------------------------------------------- === === Thread ID: 0x7f6e6dc38700 LWP:23563 (netconsole started at [ 1529] asterisk.c listener()) === ---> Waiting for Lock #0 (manager.c): MUTEX 2673 handle_showmanconn session 0x7f6ee0003260 (1) [0x50277a] asterisk lock.h:446 ast_reentrancy_lock() [inlined] asterisk lock.c:272 __ast_pthread_mutex_lock() [0x45bd6d] asterisk astobj2.c:243 __ao2_lock() [0x5faaca] asterisk manager.c:2678 handle_showmanconn() [0x4bf2e5] asterisk cli.c:2834 ast_cli_command_full() [0x4bf454] asterisk cli.c:2861 ast_cli_command_multiple_full() [0x45865f] asterisk asterisk.c:1420 netconsole() [0x5a40c5] asterisk utils.c:1251 dummy_start() [0x7f6efd2c5064] libpthread.so.0 pthread_create.c:309 start_thread() [0x7f6efc26862d] libc.so.6 clone.S:113 clone() === --- ---> Locked Here: manager.c line 6104 (process_events) === ------------------------------------------------------------------- === === Thread ID: 0x7f6e6d380700 LWP:23932 (netconsole started at [ 1529] asterisk.c listener()) === ---> Waiting for Lock #0 (manager.c): MUTEX 2673 handle_showmanconn session 0x7f6ee0003260 (1) [0x50277a] asterisk lock.h:446 ast_reentrancy_lock() [inlined] asterisk lock.c:272 __ast_pthread_mutex_lock() [0x45bd6d] asterisk astobj2.c:243 __ao2_lock() [0x5faaca] asterisk manager.c:2678 handle_showmanconn() [0x4bf2e5] asterisk cli.c:2834 ast_cli_command_full() [0x4bf454] asterisk cli.c:2861 ast_cli_command_multiple_full() [0x45865f] asterisk asterisk.c:1420 netconsole() [0x5a40c5] asterisk utils.c:1251 dummy_start() [0x7f6efd2c5064] libpthread.so.0 pthread_create.c:309 start_thread() [0x7f6efc26862d] libc.so.6 clone.S:113 clone() === --- ---> Locked Here: manager.c line 6104 (process_events) === ------------------------------------------------------------------- === === Thread ID: 0x7f6e6d3fc700 LWP:24956 (netconsole started at [ 1529] asterisk.c listener()) === ---> Waiting for Lock #0 (manager.c): MUTEX 2673 handle_showmanconn session 0x7f6ee0003260 (1) [0x50277a] asterisk lock.h:446 ast_reentrancy_lock() [inlined] asterisk lock.c:272 __ast_pthread_mutex_lock() [0x45bd6d] asterisk astobj2.c:243 __ao2_lock() [0x5faaca] asterisk manager.c:2678 handle_showmanconn() [0x4bf2e5] asterisk cli.c:2834 ast_cli_command_full() [0x4bf454] asterisk cli.c:2861 ast_cli_command_multiple_full() [0x45865f] asterisk asterisk.c:1420 netconsole() [0x5a40c5] asterisk utils.c:1251 dummy_start() [0x7f6efd2c5064] libpthread.so.0 pthread_create.c:309 start_thread() [0x7f6efc26862d] libc.so.6 clone.S:113 clone() === --- ---> Locked Here: manager.c line 6104 (process_events) === ------------------------------------------------------------------- === === Thread ID: 0x7f6efb470700 LWP:25271 (netconsole started at [ 1529] asterisk.c listener()) === ---> Waiting for Lock #0 (manager.c): MUTEX 2673 handle_showmanconn session 0x7f6ee0003260 (1) [0x50277a] asterisk lock.h:446 ast_reentrancy_lock() [inlined] asterisk lock.c:272 __ast_pthread_mutex_lock() [0x45bd6d] asterisk astobj2.c:243 __ao2_lock() [0x5faaca] asterisk manager.c:2678 handle_showmanconn() [0x4bf2e5] asterisk cli.c:2834 ast_cli_command_full() [0x4bf454] asterisk cli.c:2861 ast_cli_command_multiple_full() [0x45865f] asterisk asterisk.c:1420 netconsole() [0x5a40c5] asterisk utils.c:1251 dummy_start() [0x7f6efd2c5064] libpthread.so.0 pthread_create.c:309 start_thread() [0x7f6efc26862d] libc.so.6 clone.S:113 clone() === --- ---> Locked Here: manager.c line 6104 (process_events) === ------------------------------------------------------------------- === ======================================================================= {noformat} {noformat:title=brief.txt} ... Thread 52 (Thread 0x7f6e6e56c700 (LWP 11491)): #0 0x00007f6efc25faed in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00000000005a6b78 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f6e6e56b2d0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 #2 ast_wait_for_output (fd=<optimized out>, ms=<optimized out>) at utils.c:1357 #3 0x00000000004f6f89 in ast_iostream_write (stream=0x7f6e680041e0, buffer=0x7f6e7018e4f8, size=5000, size@entry=487) at iostream.c:473 #4 0x00000000004f747d in ast_iostream_write (stream=stream@entry=0x7f6e680041e0, buffer=buffer@entry=0x7f6e7018e4f8, size=size@entry=487) at iostream.c:475 #5 0x0000000000600121 in send_string (string=0x7f6e7018e4f8 "Event: NewextenrnPrivilege: dialplan,allrnTimestamp: 1561990424.176441rnSystemName: deacsrtsk12rnChannel: SIP/999253-000002c8rnChannelState: 4rnChannelStateDesc: RingrnCallerIDNum: 999253rnCallerIDNam"..., s=0x7f6e6e56b4d0) at manager.c:3033 #6 process_events (s=0x7f6e6e56b4d0) at manager.c:6117 #7 0x0000000000604be8 in do_message (s=<optimized out>) at manager.c:6714 #8 session_do (data=0x7f6e6e56b2d0) at manager.c:6853 #9 0x00000000005a40c5 in dummy_start (data=0x7f6e68003dc0) at utils.c:1249 #10 0x00007f6efd2c5064 in start_thread (arg=0x7f6e6e56c700) at pthread_create.c:309 #11 0x00007f6efc26862d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 ... {noformat} {noformat:title=full.txt} ... Thread 52 (Thread 0x7f6e6e56c700 (LWP 11491)): #0 0x00007f6efc25faed in poll () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00000000005a6b78 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f6e6e56b2d0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 No locals. #2 ast_wait_for_output (fd=<optimized out>, ms=<optimized out>) at utils.c:1357 pfd = {{fd = 22, events = 4, revents = 0}} #3 0x00000000004f6f89 in ast_iostream_write (stream=0x7f6e680041e0, buffer=0x7f6e7018e4f8, size=5000, size@entry=487) at iostream.c:473 No locals. #4 0x00000000004f747d in ast_iostream_write (stream=stream@entry=0x7f6e680041e0, buffer=buffer@entry=0x7f6e7018e4f8, size=size@entry=487) at iostream.c:475 No locals. #5 0x0000000000600121 in send_string (string=0x7f6e7018e4f8 "Event: NewextenrnPrivilege: dialplan,allrnTimestamp: 1561990424.176441rnSystemName: deacsrtsk12rnChannel: SIP/999253-000002c8rnChannelState: 4rnChannelStateDesc: RingrnCallerIDNum: 999253rnCallerIDNam"..., s=0x7f6e6e56b4d0) at manager.c:3033 stream = 0x7f6e680041e0 len = 487 res = <optimized out> string = 0x7f6e7018e4f8 "Event: NewextenrnPrivilege: dialplan,allrnTimestamp: 1561990424.176441rnSystemName: deacsrtsk12rnChannel: SIP/999253-000002c8rnChannelState: 4rnChannelStateDesc: RingrnCallerIDNum: 999253rnCallerIDNam"... s = 0x7f6e6e56b4d0 #6 process_events (s=0x7f6e6e56b4d0) at manager.c:6117 eqe = <optimized out> ret = <optimized out> __PRETTY_FUNCTION__ = "process_events" #7 0x0000000000604be8 in do_message (s=<optimized out>) at manager.c:6714 m = {hdrcount = 6, headers = {0x7f6ee0004620 "action: Monitor", 0x7f6ee04085d0 "actionid: 537#355425276_673410#", 0x7f6ee0008fb0 "file: /tmp/U23TJQ5V23X4MZQB", 0x7f6ee0009b70 "format: wav", 0x7f6ee0009b20 "channel: IAX2/999005-3957", 0x7f6ee0008fe0 "mix: true", 0x0 <repeats 122 times>}} header_buf = "mix: true000AX2/999005-3957000B000061060#", '000' <repeats 993 times> now = 1561988218 #8 session_do (data=0x7f6e6e56b2d0) at manager.c:6853 s = {session = 0x7f6ee00032b8, stream = 0x7f6e680041e0, tcptls_session = 0x7f6e68004068, parsing = MESSAGE_OKAY, write_error = -1, hook = 0x0, lock = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '000' <repeats 16 times>, "001", '000' <repeats 22 times>, __align = 0}, track = 0x7f6ee0005ee0, flags = {tracking = 1, setup = 1}}} res = 5 arg = 1 ser_remote_address_tmp = {ss = {ss_family = 10, __ss_align = 0, __ss_padding = "000000377377177000000001", '000' <repeats 48 times>, "|254367373n177000000000000000000000000000000370377377377377377377377220324367373n177000000340266Vnn177000000220I036377n177000000000000000000000000000"}, len = 28} __PRETTY_FUNCTION__ = "session_do" #9 0x00000000005a40c5 in dummy_start (data=0x7f6e68003dc0) at utils.c:1249 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140114181235520, 57934947619282876, 140112167979552, 140114047020320, 21, 140112274310912, 57947742297497532, -139326592165109828}, __mask_was_saved = 0}}, __pad = {0x7f6e6e56bef0, 0x0, 0x7f6efc521878 <__elf_set___libc_thread_subfreeres_element___rpc_thread_destroy__>, 0x7f6efc521880 <internal_trans_names>}} __cancel_arg = 0x7f6e6e56c700 __not_first_call = <optimized out> ret = <optimized out> a = {start_routine = 0x5955b0 <handle_tcptls_connection>, data = 0x7f6e68004068, name = 0x7f6e68004a20 "handle_tcptls_connection started at [ 309] tcptls.c ast_tcptls_server_root()"} mutex_attr = {__size = "001000000", __align = 1} __PRETTY_FUNCTION__ = "dummy_start" #10 0x00007f6efd2c5064 in start_thread (arg=0x7f6e6e56c700) at pthread_create.c:309 __res = <optimized out> pd = 0x7f6e6e56c700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140112274310912, -139326233515286596, 0, 140114047020320, 21, 140112274310912, 57947742289108924, 57695635703832508}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #11 0x00007f6efc26862d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals. ... {noformat} | ||
Comments: | By: Asterisk Team (asteriskteam) 2019-07-02 10:16:45.515-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur. By: Kevin Harwell (kharwell) 2019-07-03 11:59:09.611-0500 When the system is under heavy load, and you see the problem what's the system's stats look like? For instance what's the output of the following: {noformat} $ top -bn1 -p `pidof asterisk` {noformat} Also, please attach the full backtrace, as well as the full output to 'core show locks' (please attach as *.txt files). Please also attach the Asterisk log at, or around the time of an occurrence. Thanks! By: Stefan Repke (stefffan@gmx.de) 2019-07-05 10:57:17 The load test was already running for several minutes. The first undelivered AMI event has a timestamp of 16:13:42 (UTC+2). AMI was blocked/broken since then. The core dump was taken at 16:15:22. Then, the system load was like idle. Btw., there was had only a single AMI client connected. I've attached the logs/backtraces. By: Kevin Harwell (kharwell) 2019-07-08 15:20:20.105-0500 Thanks for the added info. Could you also please post the output of the following when the problem occurs: {noformat} *CLI> core show taskprocessors {noformat} By: Kevin Harwell (kharwell) 2019-07-08 15:21:51.085-0500 Also that information (system stats) from _top_ could be handy when you notice the system under load and causing problems. By: Stefan Repke (stefffan@gmx.de) 2019-07-11 02:18:08.607-0500 We've created another error case. The requested log files are attached. By: Kevin Harwell (kharwell) 2019-07-11 10:33:58.748-0500 Hrm, still not sure what could be going on here. The log shows your system as extremely backed up. Multiple subsystems appear to be "falling behind". It's possible given enough time they might recover. However the data from _top_ doesn't show the system as particularly "stressed" (although in the top1 output lots of memory is being consumed, which could potentially cause something like this). But it's still hard to tell if it is the system itself that just can't keep up (as in a hardware upgrade might fix this), or something software related. One thing you can do to maybe eek out a little more system performance is following some of the suggestions about [what to do about task processor warnings|https://blogs.asterisk.org/2016/07/13/asterisk-task-processor-queue-size-warnings/]. By: Kevin Harwell (kharwell) 2019-07-11 10:36:59.138-0500 It's possible that the manager's event queue itself is backing up and then causing the rest of Asterisk's subsystems to slow down. What's your AMI client like? How is it implemented to handle incoming events? Does it receive an event and block until done? If it is blocking, and taking to long to process the events it could be "backing up" the connection. By: Stefan Repke (stefffan@gmx.de) 2019-07-22 02:30:02.983-0500 I finally managed to solve the mystery. It is an inter-procedural deadlock! The server (Asterisk) cannot write because the client does not read. The client stopped reading because the server is too slow in reading. The server cannot read because it is blocked in writing. ---- The only client that is connected is a self-written Java implementation based on Netty (event-based, non-blocking). But in order to relax high load situations, I've implemented the following: Whenever the TCP send buffer (from client to server) is full, we stop receiving TCP packets until the opposite is receiving again. Asterisk seams to have a similar behavior: The AMI server stops reading when it cannot write! ---- Although this behavior seams reasonable for some protocols (request-response), it is a bad idea whenever both sides can send data spontaneously -- like AMI: - the server can send an arbitrary number of events and - the client can send an arbitrary number of actions. Therefore, I consider this behavior as a bug in both my client and the AMI server. I've fixed my client, which solves the problem for me. Maybe you want to do reading an writing on different threads? Or you switch to an event-based non-blocking socket framework? BTW: when blocked, the AMI server doesn't accept any new AMI connections. By: Steve Sether (stevesether) 2019-12-02 14:20:25.889-0600 I'm having a very similar problem to this, but I don't think it's an inter-process deadlock. I simplified my AMI client down to a shell script in an attempt to isolate the problem. AMI stops sending data to any processes connected to it, and stops accepting incoming connections. I can telnet to the port, but I don't get the normal greeting from Asterisk of: Asterisk Call Manager/5.0.1 Like the above problem, "manager show eventq" shows a long backlog of events being queued. It seems to happen at load, around maybe 30 simultaneous calls. This happens on the newest release of Asterisk 16. 16.6.1 The shell script is simply: (I switched to port 5039 so my normal AMI clients couldn't connect so as to isolate) #!/bin/sh exec 3<>/dev/tcp/127.0.0.1/5039 cat /root/login.script >&3 timeout 1 cat <&3 sleep 1 while [ 1 ]; do cat /root/status.script >&3 timeout 1 cat <&3 > /dev/null echo loop sleep .1 done; login.script looks like this: Action: Login Username: usernameHere Secret: passwordHere ActionId: 1234 status.script looks like this: Action: Status Variables: CDR(companyID),companyID,from_user,to_user,CALLERID(num),CALLERID(name),CDR(ds_type),CDR(ds_dnis),CDR(ds_ani),CDR(ds_from),CDR(ds_to),CDR(start),CDR(answer) ActionId: 4321 By: Freetech Solutions (freetech) 2020-07-07 15:40:14.206-0500 Taskprocessors output during the issue. By: Freetech Solutions (freetech) 2020-07-07 15:41:57.698-0500 Hi Gents, I was also able to reproduce the issue in Asterisk 16.9 version, under CentOS7. Asterisk 16.9.0 built by root @ build-example.com on a x86_64 running Linux on 2020-06-09 19:08:29 UTC CentOS Linux release 7.8.2003 (Core) 4 cores-CPU 8 GB RAM Scenario: Pure Asterisk, 1 queue configured to receive incoming calls, 1 extension registered in the queue call. Sipp program was used to test the scenario for 30 calls using a CPS = 2 (call per second). 1) Login/Register the extension in queue. 2) Pause extension over AMI (pause successfully done). 3) Issue command: asterisk -rx 'manager show eventq' (just an event coming from my ami user): Usecount: 0 Category: 262144 Event: Event: SuccessfulAuth Privilege: security,all EventTV: 2020-07-07T17:32:37.211-0300 Severity: Informational Service: AMI EventVersion: 1 AccountID: labuserami SessionID: 0x7f2b780021a0 LocalAddress: IPV4/TCP/0.0.0.0/5038 RemoteAddress: IPV4/TCP/127.0.0.1/48266 UsingPassword: 0 SessionTV: 2020-07-07T17:32:37.211-0300 3) Issue sipp command as follow: sipp -sn uac 172.16.20.19:5159 -s 445566 -r 2 -rp 1000 -d 10000 (5159 is my SIP udp testing port, 172.16.20.19 is our asterisk 16.9, 445566 is the DID destination, "r" and "rp" makes 2 calls per second, "d" is duration of the call in milliseconds). 4) After reaching 40 calls waiting on queue (at 2 CPS rate), issue again: asterisk -rx 'manager show eventq' (dozens of megabytes of queued events as described in the very beginning of this track). 5) Issue command: asterisk -rx 'core show taskprocessors'. Queued events can be seen below: stasis/m:manager:core-00000006 161644 57334 57336 2700 3000 6) Any AMI command attempt to port 5038 (or http manager) gives timeout. Note: with CPS=1 issue is not observed. By: Steve Sether (stevesether) 2023-01-24 11:15:55.811-0600 I'd recently had some time to re-visit this issue, as it's become a higher priority. I came up with a trivial way to reproduce it below using netcat (described below) in hopes of having someone fix it I'm happy to report that it seems to be fixed as of Asterisk 16.30. There's several fixes applied to the AMI from 16.25 through 16.30 and I'm not sure which one fixes it, but I happily can't reproduce the problem via the method below on 16.30, while I can on 16.24. My step to reproduce on Linux. nc -I 30000 targetServerNameOrIPHere 5038 < amiLogin.txt hit cntrl-z to suspend this process make a call on the target Asterisk server (critical) attempt to connect to the AMI via telnet. Telnet connects, but Asterisk doesn't respond with Asterisk Call Manager/1.3 When nc command is killed, Asterisk continues to accept connections on the AMI port. amiLogin.txt consists of: Action: Login ActionID: 1 Username: yourUsernameHere Events: on Secret: yourPasswordHere |