[Home]

Summary:ASTERISK-27153: AMI Newexten event returns wrong event parameters
Reporter:John Lagonikas (zeleps)Labels:
Date Opened:2017-07-24 08:06:35Date Closed:
Priority:MajorRegression?
Status:Open/NewComponents:Core/ManagerInterface
Versions:13.17.0 14.6.0 17.6.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:CentOS 6 64BitAttachments:( 0) Debug.zip
( 1) full.txt
( 2) NewExtenTest.tar.gz
Description:We have detected that, in some rare cases, Newexten events contain wrong Application / AppData fields (i.e. Application / AppData do not correspond to what was actually executed, but rather display the values corresponding to the previous execution for that channel).

Our test involves originating a call to TestPhone1, transferring that call to another asterisk via IAX2, then transferring the initial channel to TestPhone2 (local) and then transferring the call to an AGI application.

The problematic Newexten events can be seen at lines 3162 and 5246 of the attached log file. The actual executions (with the correct app/appdata) can be seen at lines 3127 and 5199 respectively. You can see that the Newexten @5246 contains the app/appdata fields that should have been contained by the Newexten @3162 event.

The same behavior is exhibited in both asterisk versions 13 and 14 (latest builds). The problem does not appear in asterisk versions 11 and back.

The manager actions we executed (and which do not appear in the attached log file) are the following:
Step 1:
{noformat}
action: Originate
channel: Local/mboBCmtGYDJ5UWxcKxQSWDzQg2@mb_all_calls
context: mb_all_calls
exten: mbsLGrALV2_n0eoXJHXMsMZKQ1
priority: 1
timeout: 20000
callerid:  <5432>
account: TestPhone1
actionid: mboBCmtGYDJ5UWxcKxQSWDzQg2
async: 1
variable:
{noformat}
Step 2:
{noformat}
action: Redirect
channel: SIP/TestPhone1-00000000
extrachannel:
context: mb_all_calls
exten: TestPhone2
priority: 1
actionid: @000000C3
{noformat}
Step 3:
{noformat}
action: Redirect
channel: SIP/TestPhone1-00000000
extrachannel:
context: mb_all_calls
exten: mbuadBCmtGYDJ5UWxcKxQSWDzQg
priority: 1
actionid: @0000011C
{noformat}

All the extensions you see in the log file were registered by our software at the beginning of the test. Their full definition is as follows:
{noformat}
action: Command
command: dialplan add extension "_mohBCmtGYDJ5UW[x]cK[x]QSWD[z]Qg.","1","MusicOnHold","${EXTEN:26}" into "mb_all_calls" replace
actionid: @00000005
{noformat}
{noformat}
action: Command
command: dialplan add extension "_mboBCmtGYDJ5UW[x]cK[x]QSWD[z]Qg.","1","agi","agi://192.168.1.69:4573/4qqf0RiNsEezVRJatsHMLA" into "mb_all_calls" replace
actionid: @00000008
{noformat}
{noformat}
action: Command
command: dialplan add extension "_mbuadBCmtGYDJ5UW[x]cK[x]QSWD[z]Qg","1","agi","agi://192.168.1.69:4573/hGGhXk7zlkKP-fKZXLtZgg" into "mb_all_calls" replace
actionid: @0000000B
{noformat}
{noformat}
action: Command
command: dialplan add extension "_mbq1LQsysi-K0GJ0Pg6F2ctiA.","1","agi","agi://192.168.1.69:4573/1LQsysi-K0GJ0Pg6F2ctiA" into "mb_all_calls" replace
actionid: @0000000E
{noformat}
{noformat}
action: Command
command: dialplan add extension "_mbsLGrALV2_[n]0eo[X]JH[X]MsM[Z]KQ.","1","Dial","IAX2/192.168.1.13/mbq1LQsysi-K0GJ0Pg6F2ctiA${EXTEN:25}@mb_all_calls" into "mb_all_calls" replace
actionid: @00000011
{noformat}
{noformat}
action: Command
command: dialplan add extension "mbtestmoh","1","musiconhold","default" into "mb_all_calls" replace
actionid: @00000028
{noformat}
{noformat}
action: Command
command: dialplan add extension "_TestPho[n]eX","1","dial","SIP/${EXTEN}" into "mb_all_calls" replace
actionid: @0000002D
{noformat}
Comments:By: Asterisk Team (asteriskteam) 2017-07-24 08:06:35.832-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

By: Antonis Psaras (apsaras) 2017-07-24 08:07:33.108-0500

Log file with debug 9 enabled

By: Benjamin Keith Ford (bford) 2017-07-31 09:40:51.449-0500

[~apsaras], is it possible to switch to PJSIP and see if the error occurs there as well? If it does, attach the logs to the issue and provide exact steps taken to reproduce (if they have changed).
Following the steps to reproduce the issue, I did not see the same problem. Have you seen this behavior outside of this scenario?

By: John Lagonikas (zeleps) 2017-08-01 06:02:30.003-0500

[~bford] I'm Antonis' associate and responsible for the above tests, so I'll pick up the conversation from now on.

The test that exhibits the problem is part of a series of unit tests for our .Net asterisk libraries. We needed a mechanism to hook on channels created by the originate action before receiving the originate response event, hence the mangled extension numbers and the FastAGI call.

The detailed sequence of actions is this:

1. Originate a call from extension A (Local/A@context) to extension B (see the originate action in the original post).
2. Extension A is handled by a FastAGI application that simply executes a dial (Local/TestPhone1@mb_all_calls/n) command (on the Local/A;2 channel) to the originator's phone (TestPhone1).
3. Wait for the call to be bridged (and originateresponse to arrive with success)
4. Extension B is called (by the Local/A;1 channel), which performs the IAX2 dial to a designated extension on the other server.
5. After the call is bridged, any redirect action performed on the SIP/TestPhone1 channel will produce a newexten event with the wrong app/appdata fields.

Please note that we have changed step 2 to dial directly to SIP (instead of Local) and step 4 to dial a local SIP channel with the same results. It appears that the combination of originate action and FastAGI "EXEC DIAL" are the cause of the problem. Also, please note that the problem appears when a redirect action is performed on the originator's SIP channel (the one that was created by the originate action and FastAGI exec dial).

We'll try to perform the above scenario using PJSIP to call TestPhone1 and see what happens. This might take a while since our code does not handle PJSIP channels yet and we need to implement that.

Apart from that, if with the above clarifications you still can't reproduce the problem, I can provide you with an executable file that encapsulates the test and reproduces the problem.

By: Rusty Newton (rnewton) 2017-08-01 16:42:12.052-0500

John, I'm changing the issue reporter to you so that issue workflow and communication will function properly.

Thanks for your latest information. [~bford] was still unable to reproduce the issue with that information. We still saw expected event parameters despite following the steps.

bq. Apart from that, if with the above clarifications you still can't reproduce the problem, I can provide you with an executable file that encapsulates the test and reproduces the problem.

Can you send the script and whatever is needed along with instruction to reproduce the issue? That would be most helpful when a developer attempts to debug this problem.

By: John Lagonikas (zeleps) 2017-08-02 07:29:58.681-0500

Test executable, requires windows / .Net Framework 4.7

By: John Lagonikas (zeleps) 2017-08-02 07:47:26.050-0500

I have attached a test program that exhibits the behavior. It's a console program, that runs in windows and requires .net 4.7 to run.

The program uses PortSip to create the test devices, and unfortunately we couldn't find a way to stop portsip from outputting all its logs to the console.

In order to run this, you need to:

1. create sip accounts for 3 phones, TestPhone1, 2 and 3. You may use any secret you want, you'll have to declare it in the config file in the next step.
2. edit the .config file, and set your environment parameters. You'll need to provide the asterisk version (major only), the IP address of the asterisk host (it will be used for both connecting to AMI and phone registration, port 5038 is assumed for AMI and 5060 for SIP), the IP address of the windows machine you're running the program from (the IP address which will listen to FastAGI requests, port 4573 is assumed), the secrets for the SIP phone registrations and an existing dialplan context under which all the necessary extensions will be registered (that's done automatically by the program).

Once the program is finished, it will display "Error: NewExten fields are invalid" if the app/appdata fields are wrong, or "Test completed successfully" if everything worked as expected, and wait for return. In our tests, we always get the former as a response for asterisk versions 13 and 14, and the latter for 11 and back.

By: John Lagonikas (zeleps) 2017-08-02 07:59:24.097-0500

Please consider that the test program is not very sophisticated, I'm currently on vacation and my resources are limited, so I stiched a test as quickly as I could. If something goes wrong and it doesn't show the above messages, it probably means there's something wrong with the configuration. In that case, please edit the config file, set TextLogging to true, run it again and send me the log file which you'll find in C:\Users\[username]\AppData\Local\AsteriskNewExtenTest to see what went wrong.

I can make myself available for more immediate communication if you want to speed things up, just email me with the details.

By: Rusty Newton (rnewton) 2017-08-03 19:21:11.028-0500

Thanks for the additional information and debugging. We don't have windows machines in our labs right now to reproduce this using your windows exe, but you have provided enough information for a developer to investigate this when resources are available. I've opened up the issue so it is in the queue.

By: John Lagonikas (zeleps) 2017-09-01 10:07:42.103-0500

Hi again,

would it be convenient if we compiled our test binary for .Net Core 2.0 (which can run under linux)? This issue has become a blocker for us and we'd like to help as much as we can to see it resolved.

By: Joshua C. Colp (jcolp) 2017-09-01 10:11:26.471-0500

The issue has been accepted and is in queue. The only additional way to move it forward would be to find a community member who is interested in looking into it, or hiring someone.

By: John Lagonikas (zeleps) 2017-09-01 10:16:09.053-0500

I also tried to research the bug myself a bit. Is there a convenient way to log / dump stasis messages? I can detect the faulty message when it is being processed (the channel_snapshot_update function in manager_channels.c), but it's difficult to figure out where the message is sent out from.

By: John Lagonikas (zeleps) 2017-09-01 10:20:47.511-0500

[~jcolp] as I understand from the comments above, the reason the investigation was halted from your end was the inability to run the test executable I provided, because you don't have windows machines. Converting it to .Net core should alleviate this problem, that's why I proposed it.

By: Rusty Newton (rnewton) 2017-09-01 11:43:35.786-0500

[~zeleps] if you can create a binary that works under Linux that would be helpful to whoever takes on the issue.

As Josh mentioned the issue is open as it is pretty obviously a bug even though we were unable to reproduce it easily here. That means the issue is simply waiting for a developer to take on the issue. There are hundreds of other open issues with varying priorities.

Providing the additional binary could make it more likely that someone takes on the issue earlier and would speed up their process. However, no one in particular is waiting on it at the moment. Primarily when the issue is worked depends on when many other higher priority issues get resolved.

Also, I'm not sure about stasis logging, you might ask on the asterisk-dev mailing list.

By: John Lagonikas (zeleps) 2017-09-04 04:28:56.131-0500

Ok [~rnewton], we'll compile and post a linux binary asap. Obviously, we're not requesting an immediate resolution, we just want to facilitate things so that the bug is dealt with at some point. We'll try to investigate it from our end as well and if we find something we'll let you know. Thanks!

By: John Lagonikas (zeleps) 2017-09-19 04:01:04.903-0500

As promised, this is the test executable, compiled under .Net core 2.0 and it can run under linux.

1. Install .NET Core 2.0 on a Linux machine, following the instructions found here, according to the specific distro used: https://www.microsoft.com/net/download/linux Make sure to use version 2.0
2. Extract the application in a location of your preference
3. Edit the appsettings.json configuration file to match your environment. The following parameters need to be defined:
a. AsteriskAddress: The IP address of the machine running asterisk.
b. AsteriskVersion: Use Version13 for asterisk 13, Version14 for asterisk14
c. AMIUser / AMISecret: authentication details for connecting to AMI. The manager account used needs the following privileges (in manager.conf):
read = system,call,dialplan
write = system,call,agent,user,config,command,reporting,originate
d. LocalIP: The IP (must be static) of the machine on which the test application is running. Since the application is handling fastagi connections, the TCP port 4573 must be accessible by the asterisk machine.
e. TestPhone1,2,3: The extension numbers for 3 sip accounts registered on the asterisk machine. The test involves making a call from Telephone1 to Telephone2 and redirecting it to Telephone3.
f. DefaultContext: the context used for making the calls. We use local channels to initiate the calls, so this means that the dialplan must be configured to e.g. do a Dial(SIP/${EXTEN}) for the phones used in the test.
4. Use the command `dotnet NewExtenTest.dll` from your shell to run the test application.


By: Rusty Newton (rnewton) 2017-09-19 10:15:45.974-0500

[~zeleps] , Thanks for all your work, this will be very helpful for the developer that takes the issue on.

By: John Lagonikas (zeleps) 2020-09-02 11:48:20.134-0500

Hello,

As of version 17.6, the problem still exists.

By: Antonis Psaras (apsaras) 2020-09-02 11:49:00.700-0500

Hello.

We have the same issue for 17.6.

Is there any plan to fix it?


By: Joshua C. Colp (jcolp) 2020-09-02 11:51:04.897-0500

The issue is not currently being worked on by anyone, if that changes then it will be reflected here along with any updates.

By: Antonis Psaras (apsaras) 2020-09-02 11:55:38.437-0500

As this is a blocking factor for us to update from v11 at which there is no such issue, is there a way for donate/pay for the resolution?

By: Joshua C. Colp (jcolp) 2020-09-02 11:57:58.760-0500

There is a bug bounty process documented on the wiki[1]. There is no guarantee that anyone will take up the bounty however.

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Bug+Bounties

By: John Lagonikas (zeleps) 2020-09-02 12:18:03.544-0500

We offer a bounty of $500 for the resolution of this bug.

By: John Lagonikas (zeleps) 2021-03-17 08:06:41.953-0500

Hello everyone,
we finally decided to dedicate some time and effort to fix this issue, so we've come up with some results.

At this point, we have identified the problem and the code that's causing it. The solution requires some discussion about what can be done and what not.

The issue is caused when two channels are linked in a bridge, and one of the channels is redirected to an extension (via manager redirect action, but I guess the problem will be evident in other cases too). The ast_explicit_goto() function (pbx.c) makes changes to the channel's cep, but application is still unknown at the time, so no snapshot publishing occurs. But, when the bridge breaks and it's reconfigured, the set_bridge_peer_vars() function (bridge.c) leads to changes in the peer channels which trigger snapshot publishing, thus a NewExten event is triggered where the application / appdata fields are not yet updated.

The most appropriate solution seems to be to perform the goto operation after the bridge has been reconfigured and the channel has left the bridge, but I'm not really accustomed to the synchronization principles used throughout the project, so I thought it would be best to present the findings here and discuss it. I'm available to provide any details necessary.

Any ideas?

By: Joshua C. Colp (jcolp) 2021-03-17 08:17:15.403-0500

As you're asking a technical question your best bet would be to post the asterisk-dev mailing list[1]. Most people don't monitor individual JIRA issues.

[1] http://lists.digium.com/mailman/listinfo/asterisk-dev

By: John Lagonikas (zeleps) 2021-03-17 08:26:27.902-0500

Thanks, I'll do that.