[Home]

Summary:ASTERISK-28674: Asterisk becomes unstable after SS7 signalling link restarts
Reporter:Gregory Massel (gmza)Labels:
Date Opened:2020-01-01 11:52:57.000-0600Date Closed:
Priority:MinorRegression?No
Status:Open/NewComponents:Channels/chan_dahdi/SS7
Versions:13.29.2 16.7.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Asterisk 16.7.0, LibSS7 2.0.0, DAHDI 3.1.0, Digium TE820 + TE420 (5th gen), Ubuntu 18.04.3 LTS, Kernel 5.0.0-37-generic. Same issue also ocurrs using Asterisk 13.29.2.Attachments:( 0) chan_dahdi.conf.txt
( 1) core-asterisk-running-2020-01-02T23-14-27+0200-brief.txt
( 2) core-asterisk-running-2020-01-02T23-14-27+0200-full.txt
( 3) core-asterisk-running-2020-01-02T23-14-27+0200-locks.txt
( 4) core-asterisk-running-2020-01-02T23-14-27+0200-thread1.txt
( 5) core-asterisk-running-2020-01-03T21-43-37+0200-brief.txt
( 6) core-asterisk-running-2020-01-03T21-43-37+0200-full.txt
( 7) core-asterisk-running-2020-01-03T21-43-37+0200-locks.txt
( 8) core-asterisk-running-2020-01-03T21-43-37+0200-thread1.txt
( 9) core-asterisk-running-2020-01-03T21-44-45+0200-brief.txt
(10) core-asterisk-running-2020-01-03T21-44-45+0200-full.txt
(11) core-asterisk-running-2020-01-03T21-44-45+0200-locks.txt
(12) core-asterisk-running-2020-01-03T21-44-45+0200-thread1.txt
(13) core-brief.txt
(14) core-full.txt
(15) core-locks.txt
(16) core-thread1.txt
(17) ss7.timers.txt
(18) ss7-debug.txt
Description:When one of the SS7 signalling links goes down briefly and then restores, the SS7 subsystem goes into a corrupt state. This manifests in a numbers of ways:

1. The system starts using consistently 100% CPU on exactly one core, resulting in a load average of just over 1.0. This will continue indefinitely.
2. All outbound DAHDI calls thereafter hang during initiation reflecting a "Ring" state; a "core show channels" after a few hours shows thousands of hung channels. The corresponding PJSIP channels close and log "res_pjsip_sdp_rtp.c: Disconnecting channel 'PJSIP/xxx' for lack of RTP activity in 60 seconds" but the call and DAHDI channel remain hung.
3. No inbound calls are accepted via the DAHDI spans.
4. A "ss7 show linkset 1" seems to be missing signalling links that were there before, as if they were never configured.

The log entries before everything goes pear-shaped look like this:
{noformat}
[Jan  1 03:14:08] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 33)
[Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
[Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
[Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
[Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
[Jan  1 03:14:09] VERBOSE[24375] sig_ss7.c: MTP2 link up (SLC 34)
[Jan  1 03:14:09] VERBOSE[24375] chan_dahdi.c: [1] MTP3 T2 timer expired on link SLC: 3 ADJPC: 2200 changeover completed
[Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1.  Dropping
[Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
[Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x2.  Dropping
[Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1.  Dropping
[Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 0 ADJPC: 2200
[Jan  1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
[Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 1 ADJPC: 2200
[Jan  1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
{noformat}
This highlights a related issue: I have SLC numbers 0 to 3 only. The system is actually logging an incorrect SLC number (34) as having failed and/or restored. It appears that this goes beyond just the log entry. The entire system seems to get horribly confused because it thinks that an event has ocurred relating to an invalid SLC number.

Note that the "Received message for slc 0x3, but we are 0x2.  Dropping" messages are NOT the source of the issue; I get these when Asterisk starts up cleanly as well, because chan_dahdi/libss7 wrongly filters out STD Test responses received via a different SLC from the one they're sent on. This is fine because the applicable timer expires and it stops waiting for the STD Test response, however, it highlights another unrelated bug (i.e. that Asterisk shouldn't be filtering STD_TEST replies received on a different signalling link within a single linkset and should treat the received message as proper acknowledgement).

When Asterisk starts up for the first time, it always shows SLC 0 as up for all signalling links:
{noformat}
[Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
[Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
[Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
{noformat}
This is despite these being SLC 0, 1 and 2 (all part of one linkset).
The SLC numbers only reflect 33 or 34 (invalid) once one of the signalling links fails and immediately restores.

The configuration appears to be 100% because the system can run for days - even weeks - perfectly, processing tens of thousands of calls. Things only go awry when the SLC fails and immediately restores and they go awry every single time this happens.

My suspicion is that the issues here is that Asterisk is mishandling the situation where the signalling link fails by misindentifying which SLC number has failed. Thereafter, it seems to corrupt its own structures relating to the signalling links. Even when the signalling link restores, things are already corrupted beyond any hope of functioning.

This has happened countless times over many months, however, is becoming more frequent and difficult to manage. What has aggravated it has been enabling 4 signalling links (previously I had 2) within the linkset, as there is a higher probability of failure.

All drops have been for a second or less; I'm not sure if this has any bearing on the situation (e.g. if two timers are running concurrently, one relating to link failure and the other to link restoration).

Killing Asterisk completely and restarting immediately fixes the issue. It's not necessary to restart DAHDI.

The 100% CPU on a single core does seem to indicate a deadlocked thread, however, I cannot readily identify which thread.

The config is ITU SS7, 4x E1 circuits are connected to the TE820 card. The TE420 card in the system is unused at this stage (the intention was to provide for 12x E1 in the future). I've experienced similar issues on a system with 2x TE820 and, similarly, 4x E1 on one of the TE820's.
Comments:By: Asterisk Team (asteriskteam) 2020-01-01 11:52:59.614-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: Gregory Massel (gmza) 2020-01-01 11:57:55.304-0600

chan_dahdi.conf

By: Gregory Massel (gmza) 2020-01-01 11:58:55.599-0600

ss7.timers

By: Kevin Harwell (kharwell) 2020-01-02 12:23:21.018-0600

If there is a deadlock a backtrace might hopefully show it. Please obtain a backtrace[1] and attach it to this issue.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

By: Gregory Massel (gmza) 2020-01-02 15:19:18.303-0600

Core dump details attached.

By: Gregory Massel (gmza) 2020-01-02 15:24:13.617-0600

In addition to the core dump, I've realised that it is trivial to reproduce this.

Before reproducing:
{noformat}
samsbc2*CLI> ss7 show linkset 1
SS7 flags: 0x0
SS7 linkset 1 status: Up
SS7 calling nai: -1
SS7 called nai: -1
SS7 nationalprefix: 0
SS7 internationalprefix: +
SS7 unknownprefix:
SS7 networkroutedprefix:
SS7 subscriberprefix:
Switch type: ITU
Our point code: 776
SLS shift: 0
numlinks: 3
numsps: 1
 ---------------------------------
 Adjacent SP PC: 2200 STATE: UP
 TRA:  GOT SENT    T19: not running T21: not running
 Routes:
   DPC       State        T6       T10
3856          TFP         -   running
976          TFP         -   running
968          TFP         -   running
 Link ADJ_PC:SLC: 2200:0 NetMngSLS: 0
   State:      INSERVICE,  UP
   STD Test:  passed
   Got, sent :
   Inhibit:
   Changeover: NO
   Tx buffer:  0
   Tx queue:   0
   Retrans pos 0
   CO buffer:  0
   CB buffer:  0
   Last FSN:   108
   MTP3timers:
 Link ADJ_PC:SLC: 2200:1 NetMngSLS: 1
   State:      INSERVICE,  UP
   STD Test:  passed
   Got, sent :
   Inhibit:
   Changeover: NO
   Tx buffer:  0
   Tx queue:   0
   Retrans pos 0
   CO buffer:  0
   CB buffer:  0
   Last FSN:   79
   MTP3timers:
 Link ADJ_PC:SLC: 2200:2 NetMngSLS: 2
   State:      INSERVICE,  UP
   STD Test:  passed
   Got, sent :
   Inhibit:
   Changeover: NO
   Tx buffer:  0
   Tx queue:   0
   Retrans pos 0
   CO buffer:  0
   CB buffer:  0
   Last FSN:   92
   MTP3timers:
{noformat}
Then, I used dahd_maint to loop one of the E1's, causing the associated signalling link to fail.
The following are largely irrelevent but just demonstrate the loop:

{noformat}
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 63: Loopback
[Jan  2 23:11:33] ERROR[24690]: chan_dahdi.c:3104 my_handle_link_exception: SS7 got event: Alarm(4) on span 1/2
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 64: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 65: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 66: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 67: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 68: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 69: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 70: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 71: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 72: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 73: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 74: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 75: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 76: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 77: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 79: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 80: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 81: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 82: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 83: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 84: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 85: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 86: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 87: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 88: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 89: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 90: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 91: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 92: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 93: Loopback
{noformat}
And after that:
{noformat}
[Jan  2 23:11:33] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x2, but we are 0x0.  Dropping
[Jan  2 23:11:33] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x2, but we are 0x0.  Dropping
[1] MTP3 T2 timer expired on link SLC: 2 ADJPC: 2200 changeover completed
[Jan  2 23:11:35] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] T7 expired on link SLC: 0 ADJPC: 2200
[Jan  2 23:11:35] WARNING[24690]: sig_ss7.c:1496 ss7_linkset: MTP2 link down (SLC 153)
[Jan  2 23:11:35] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x0, but we are 0x1.  Dropping
MTP2 link up (SLC 158)
[1] MTP3 T2 timer expired on link SLC: 0 ADJPC: 2200 changeover completed
{noformat}
This shows the issue with the SLC number.
Immediately thereafter:
{noformat}
samsbc2*CLI> ss7 show linkset 1
SS7 flags: 0x0
SS7 linkset 1 status: Up
SS7 calling nai: -1
SS7 called nai: -1
SS7 nationalprefix: 0
SS7 internationalprefix: +
SS7 unknownprefix:
SS7 networkroutedprefix:
SS7 subscriberprefix:
Switch type: ITU
Our point code: 776
SLS shift: 0
numlinks: 3
numsps: 1
 ---------------------------------
 Adjacent SP PC: 2200 STATE: UP
 TRA:  GOT SENT    T19: not running T21: not running
 Routes:
   DPC       State        T6       T10
3856          TFP         -   running
976          TFP         -   running
968          TFP         -   running
 Link ADJ_PC:SLC: 2200:0 NetMngSLS: 0
   State:      INSERVICE,  UP
   STD Test:  passed
   Got, sent : sentCBD
   Inhibit:
   Changeover: CHANGEBACK INITIATED
{noformat}
Note how now only one signalling link shows; both the failed link (SLC 1) and the one after it (SLC 2) just vanished.

After that, I took the core dump on the running Asterisk process.

By: Kevin Harwell (kharwell) 2020-01-02 17:35:06.380-0600

I don't know this section of the code too well, so following the locks appears non trivial. However, at a quick glance at the backtrace ([^core-asterisk-running-2020-01-02T23-14-27+0200-brief.txt]) it does appear threads are being blocked. Specifically "Thread 58" seems to be blocking "Thread 59", and "Thread 58" is being blocked by "Thread 28", or "Thread 56". At least I think that's what's happening.

The odd thing though is "Thread 28" is attempting to "update_txbuf", while "Thread 57" (the "ss7 show linkset 1" CLI command) appears to be trying to read from/get the length of a tx_buf (no locking). I wonder if this is a coincidence (meaning got a backtrace snapshot at just the "right" time), or if the buffer being accessed is corrupted, or something.

Could you get, and post another backtrace so we can compare them and see if things appear to get "stuck" in the same place?

By: Kevin Harwell (kharwell) 2020-01-02 17:39:32.679-0600

The missing output from the CLI commands seems to support what is happening with the "stuck threads" in the backtrace. Did you get a CLI prompt again after issuing the command?

By: Gregory Massel (gmza) 2020-01-03 13:53:57.120-0600

This time I took two dumps. The one attached ([^core-asterisk-running-2020-01-03T21-43-37+0200-brief.txt]) was taken directly after the link failed and restarted, but BEFORE issuing a "ss7 show linkset 1" command on the CLI.

By: Gregory Massel (gmza) 2020-01-03 13:57:33.558-0600

Attached is the second dump ([^core-asterisk-running-2020-01-03T21-44-45+0200-brief.txt]) that was taken AFTER issuing a "ss7 show linkset 1" command on the CLI (i.e. to replicate exactly the backtrace obtained yesterday).

I can confirm that the Asterisk CLI did crash after issuing that command.

By: Gregory Massel (gmza) 2020-01-03 14:08:05.190-0600

In addition, the attached SS7 trace may also assist in showing what messages were sent/received.

By: Kevin Harwell (kharwell) 2020-01-06 14:56:00.060-0600

This is just a theory, but I'm thinking this issue might be result of an infinite loop.

Note something like the following in both the post "ss7 show linkset 1" CLI command backtraces:
{noformat}
#0  len_buf (buf=<optimized out>) at mtp2.c:50
#1  0x00007f0ffee71f10 in ss7_show_linkset (ss7=0x562f906ed0c0, cust_printf=0x562f8df73e76 <ast_cli>, fd=146) at ss7.c:668
#2  0x00007f0fff2ef7d9 in handle_ss7_show_linkset (e=0x7f0fff52b8a0 <dahdi_ss7_cli+1600>, cmd=-4, a=0x7f0f276b7450) at chan_dahdi.c:17265
#3  0x0000562f8df7d8dd in ast_cli_command_full (uid=0, gid=0, fd=146, s=0x7f0f276b7720 "ss7 show linkset 1") at cli.c:2833
#4  0x0000562f8df7dacf in ast_cli_command_multiple_full (uid=0, gid=0, fd=146, size=19, s=0x7f0f276b79d0 "ss7 show linkset 1") at cli.c:2860
#5  0x0000562f8df030be in netconsole (vconsole=0x562f8e3ee180 <consoles>) at asterisk.c:1424
#6  0x0000562f8e094e21 in dummy_start (data=0x7f102c004090) at utils.c:1249
#7  0x00007f1051a756db in start_thread (arg=0x7f0f276b8700) at pthread_create.c:463
#8  0x00007f1050c5e88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}
Now if we take a look at what is at, or around line number 50 in [mtp2.c|https://github.com/asterisk/libss7/blob/master/mtp2.c]:
{noformat}
int len_buf(struct ss7_msg *buf)
{
int res = 0;
struct ss7_msg *cur = buf;

while (cur) {
res++;
cur = cur->next;
}
return res;
}
{noformat}
In this branch at least line 50 is incrementing "res". I think it's a coincidence the threads in both backtraces are on the same line. However, there are only a couple of lines of code in the loop itself so the chances of capturing a thread at that line increases especially if caught in an infinite loop here. Really that's probably the only way the code could get "stuck" in this spot. It would also explain no further output on the CLI, and the 100% CPU.


By: Gregory Massel (gmza) 2020-01-07 10:00:50.105-0600

Please keep in mind that the problem ocurrs BEFORE doing a "ss7 show linkset 1". In fact, the CLI command is largely irrelevent (I would have had no need to even look at the linkset status if the system hadn't failed already) and I included it merely because the output is a very clear and visual demonstration that the data structures relating to the signalling links have been corrupted.

The issues ocurrs from the very moment that the signalling link goes down and tries to come up again. It will happen even if you don't issue any CLI commands at all (or never even log into the CLI).

Looking at the above code snippet, I see no safe-check to test whether cur->next is NULL or res is irrationally high. That, in itself, is not a problem if the underlying 'buf' is valid. Putting in place some sort of limit or check may *MITIGATE* against an infinite loop and more gracefully deal with broken code elsewhere, but it wouldn't solve the underlying problem (that is more likely caused by the data in *buf being corrupt).

I'm no expert at reading a backtrace, however, my inclination was to look at channels/sig_ss7.c, particularly ss7_linkset, row 1493 and 1496, because the logged data is WRONG.

                       case MTP2_LINK_UP:
                               ast_verb(1, "MTP2 link up (SLC %d)\n", e->gen.data);
                               break;
                       case MTP2_LINK_DOWN:
                               ast_log(LOG_WARNING, "MTP2 link down (SLC %d)\n", e->gen.data);
                               break;

Note that it logged:
Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan  3 21:42:38] WARNING[17089]: sig_ss7.c:1496 ss7_linkset: MTP2 link down (SLC 36)

But we know that we only have SLC numbers 0 to 3 on this system, so numbers as high as 36 are wrong.

This means that e->gen.data is probably NOT the correct value of the SLC.

That being said, I'm trying to follow the code from there and, as far as I can see, that issue applies solely to the logging and e->gen.data isn't used thereafter. So, notwithstanding that the logging is buggy (e->gen.data is NOT the correct SLC number), it's not clear that this is the same bug.

I've also followed this code through to libss7-2.0.0/mtp3.c, row 1869, and the following code is also imperfect:

1860:        if (ss7->pc != rl.dpc) {
1861:                ss7_error(ss7, "Received message destined for point code 0x%x, but we are 0x%x.  Dropping\n", rl.dpc, ss7->pc);
1862:                return -1;
1863:        } else if (userpart == SIG_STD_TEST || userpart == SIG_SPEC_TEST || userpart == SIG_NET_MNG) {
1864:                if (link->dpc != rl.opc) {
1865:                        ss7_error(ss7, "Received message from point code 0x%x but we are connected to DPC 0x%x.  Dropping\n", rl.opc, link->dpc);
1866:                        return -1;
1867:                }
1868:
1869:                if (link->slc != rl.sls) {
1870:                        ss7_error(ss7, "Received message for slc 0x%x, but we are 0x%x.  Dropping\n", rl.sls, link->slc);
1871:                        return -1;
1872:                }
1873:        }

I believe that lines 1869 to 1872 are incorrect and should be ommitted. They are basically saying that, in respect of the applicable messages (SIG_STD_TEST, SIG_SPEC_TEST, SIG_NET_MNG), that these *MUST* be received on the same signalling link (link->slc) to which they relate (rl.sls). However, that is simply not correct. This code is having the effect of filtering valid messages. The message must be in respect of the same linkset and, in fact, this is tested in lines 1864-1867, but 1869-1872 go beyond what is reasonable, hence I'm logging lots of messages like:

[Jan  3 21:42:38] ERROR[17089]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x0, but we are 0x1.  Dropping

Looking at the logs, this filtering caused it to think that signalling links 0 and 2 (slc 0 and slc 2) failed (when, in fact, slc1 failed):

[1] MTP3 T2 timer expired on link SLC: 2 ADJPC: 2200 changeover completed
[Jan  3 21:42:38] ERROR[17089]: chan_dahdi.c:13851 dahdi_ss7_error: [1] T7 expired on link SLC: 0 ADJPC: 2200
[1] Link state change: INSERVICE -> IDLE
[1] Link state change: IDLE -> NOTALIGNED
[1] MTP3 T2 timer started on link SLC: 0 ADJPC: 2200
Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan  3 21:42:38] WARNING[17089]: sig_ss7.c:1496 ss7_linkset: MTP2 link down (SLC 36)

This has happened because it filtered out NET_MNG messages incorrectly (libss7-2.0.0/mtp3.c lines 1869-1872)

When timers expire, SLC 0 seems to come up, however, SLC 2 doesn't (maybe because NET_MNG for 2 was received on SLC 1).

So, at this point, SLC 0 should be regarded as up and, in theory, signalling should be fine (as at least one signalling link is up). However, it very clearly isn't. Something is locked up in a loop. Something related to SLC 1 and/or SLC 2 (or both) appears to be left in a corrupt state.

When issuing a "ss7 show linkset 1" on the CLI, it's not that that command is breaking something; it's more likely that it's trying to access a data structure relating to SLC 1 or SLC 2 that, somewhere in the above, got corrupted.

By: Kevin Harwell (kharwell) 2020-01-07 10:48:55.853-0600

Ah yes to be clear I misstated when I said, "I'm thinking _this_ issue might be result of an infinite loop."

The possible infinite loop is a side issue, and potential result of the underlying problem. An infinite loop might only explain the affects after issuing the CLI command, for instance only part of the output showing, etc...

It does point out though that there is a probable problem with the buffer, and/or a buffer object.

By: Gregory Massel (gmza) 2020-01-07 10:52:01.672-0600

Regarding channels/sig_ss7.c, ss7_linkset, row 1493 and 1496:

case MTP2_LINK_UP:
ast_verb(1, "MTP2 link up (SLC %d)\n", e->gen.data);
break;
case MTP2_LINK_DOWN:
ast_log(LOG_WARNING, "MTP2 link down (SLC %d)\n", e->gen.data);
break;

I need someone more familiar with this code to review, however, to me it looks like the logging here can/should be corrected by replacing "e->gen.data" with "e->link.link->slc".

I don't really just want to experiment randomly, however, if this makes sense to you, then I'll try to test it.

While it won't fix the issue of the SS7 signalling hanging, it may at least correct the logging to show the correct SLC number on link failure/restoration (i.e. fixing a related bug).

By: Kevin Harwell (kharwell) 2020-01-07 12:06:20.519-0600

Unfortunately, I am not familiar with this part of the code at all. However, after glancing at the code I do think your assumption is reasonable.

The event structure is a union, so in order for the data within to make sense you'd have to ensure it contains the expected data. That being said if you take a look at where the "state" is being set to either MTP2_LINK_UP or MTP2_LINK_DOWN ([mtp2.c|https://github.com/asterisk/libss7/blob/master/mtp2.c#L628] - only place I found it being set) you'll note the "link" structure is set there too.

So I think it is reasonable to assume the event in the code you mention would be holding information with regards to the link. So de-referencing it as such should be okay.

By: Gregory Massel (gmza) 2020-01-07 14:32:43.656-0600

I've done some more testing and have some more feedback.

1. LOGGING FIX
In terms of channels/sig_ss7.c, lines 1493 ro 1496, replacing e->gen.data with e->link.link->slc corrects the logging, however, /usr/include/libss7.h defines ss7_event and ss7_event_link, but NOT struct mtp2. As a result, in addition to replacing those two references, one also needs to #include "mtp2.h". mtp2.h, in turn, includes ss7_internal.h. Neither of these files are installed to /usr/include by libss7's make install, so I had to create a symlink within channels/ pointing back to libss7-2.0.0/mtp2.h and libss7-2.0.0/ss7_internal.h and then #include "mtp2.h" within channels/sig_ss7.c (in line 48, inserted just above #include "sig_ss7.h"). Once I'd done that, it had a definition for struct mtp2 and was able to compile with the references to e->link.link->slc.

I've tested this repeatedly and, in all of the situations, including failure and restoration, it now logs the correct signalling link code (SLC).

2. INVALID MESSAGE FILTERING
I commented out lines 1869-1872 in mtp3.c (in libss7-2.0.0). There does NOT appear to be any negative consequence and, in fact, it is processing NET_MNG (COO) messages correctly now I can set ss7 timer mtp3_timer.t21 back to its default (previously I had to drop it to 1).

3. SOURCE OF THE DEADLOCK
In testing further, I found an interesting situation. If I enable four of the signalling links, depending on which one I then fail, it will either get caught in a loop (as we'd previously witnessed), or coredump entirely.

When it coredumps entirely, we get a significantly more useful dump that, I hope, will lead us directly to the source of the problem. I will attach the new core dump details now.

By: Gregory Massel (gmza) 2020-01-07 14:45:59.147-0600

Note that mtp3_free_co is causing malloc to abort the process with: "double free or corruption (out)"

This is the interesting part:

#6  0x00007f36ccaaacf3 in mtp3_free_co (link=link@entry=0x558bd0060bf0) at mtp3.c:653
       cur = <optimized out>
#7  0x00007f36ccaada66 in mtp3_changeover (link=0x558bd0060bf0, fsn=<optimized out>) at mtp3.c:884
       tmp = 0x0

What I can add to this, is that if I start Asterisk with the link already failed, it starts up cleanly and, when I unfail the link, it cleanly activates it and performs changeover.

This also tells us that the problem relates specifically to when the link FAILS, and not to when it restores.

By: Kevin Harwell (kharwell) 2020-01-07 15:35:50.750-0600

Hrm well this is very odd indeed. Besides the logging did you alter any other parts of the code? The backtrace with the crash looks unusual vs. the actual code. For instance, here is the stack trace of the crash:
{noformat}
Thread 1 (Thread 0x7f36a8278700 (LWP 485)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f371a8a8801 in __GI_abort () at abort.c:79
#2  0x00007f371a8f1897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f371aa1eb9a "%sn") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f371a8f890a in malloc_printerr (str=str@entry=0x7f371aa20870 "double free or corruption (out)") at malloc.c:5350
#4  0x00007f371a8ffe75 in _int_free (have_lock=0, p=0x7f3650013cd0, av=0x7f371ac53c40 <main_arena>) at malloc.c:4278
#5  __GI___libc_free (mem=0x7f3650013ce0) at malloc.c:3124
#6  0x00007f36ccaaacf3 in mtp3_free_co (link=link@entry=0x558bd0060bf0) at mtp3.c:653
#7  0x00007f36ccaada66 in mtp3_changeover (link=0x558bd0060bf0, fsn=<optimized out>) at mtp3.c:884
#8  0x00007f36ccaae5ee in net_mng_receive (rl=0x7f36a8277680, rl=0x7f36a8277680, rl=0x7f36a8277680, len=<optimized out>, buf=<optimized out>, mtp2=0x558bd00f29e0, ss7=0x558bd006bdc0) at mtp3.c:1271
#9  mtp3_receive (ss7=0x558bd006bdc0, link=0x558bd00f29e0, msg=<optimized out>, len=<optimized out>) at mtp3.c:1891
#10 0x00007f36ccaaf86e in ss7_write (ss7=0x558bd006bdd8, fd=<optimized out>) at ss7.c:475
#11 0x02260308c007909e in ?? ()
#12 0x0000000000673321 in ?? ()
#13 0x0000000000000000 in ?? ()
{noformat}
I'm not seeing where _ss7_write_ directly calls the _mtp3_receive_ function. In fact [line 475|https://github.com/asterisk/libss7/blob/2.0/ss7.c#L475] calls _mtp2_transmit_ not receive. Something appears quite "off" with the call stack.

Valgrind might help to track down the memory corruption. Barring that you'd have to find where all the memory gets freed for the object in question, and double check that is it freeing when it should, and it's getting appropriately removed from any lists and such. Sounds like you've narrowed it down a bit at least where the corruption may start.

By: Gregory Massel (gmza) 2020-01-07 16:55:56.644-0600

No, I haven't altered the code beyond the logging and commenting out of lines 1869-1872 in mtp3.c But the problem ocurred even before then; I've only made those changes since logging this.
I'm really out of my depth here. I'm not a programmer by trade and the last time I wrote code of my own in C/C++ was ~20 years ago.
Is there any chance that Matthew Fredrickson could take a look, as he wrote most of this code?

By: Gregory Massel (gmza) 2020-01-07 18:16:57.530-0600

So the change that I made to remove filtering will impact how link changeover is handled.

In the original code, mtp3_receive is filtering messages where link->slc != rl.sls. However, a SIG_NET_MNG with COO/COA message will always have link->slc != rl.sls because the very nature of that message is that it's signalling a changeover to avoid using the failed link. To expect the COO/COA message via the failed link is to expect the impossible. This is why lines 1869-1872 in mtp3.c are just illogical.

With that original code, it seems impossible that net_mng_receive will ever deal with "case NET_MNG_COA:". So, prior to me commenting lines 1869-1872 in mtp3.c out, what would happen is that COO message would be sent, however, COA (acknowledgement of COO) would be received but immediately dropped. Processing would then proceed in terms of a timer expiration within mtp3_t2_expired. After I removed the offending filter lines, the COA message would be dealt with within net_mng_receive, as it really should be (i.e. properly dealt with rather than treated as timed out).

The difference in behavior relates to the two different functions called:
{noformat}
static void mtp3_t2_expired(void * data)
{
       struct mtp2 *link = data;
       struct ss7_msg *tmp = NULL;

       link->mtp3_timer[MTP3_TIMER_T2] = -1;
       link->got_sent_netmsg &= ~(SENT_COO | SENT_ECO);
       mtp3_move_buffer(link->master, link, &link->co_tx_q, &tmp, -1, -1);
       mtp3_move_buffer(link->master, link, &link->co_buf, &tmp, -1, -1);
       mtp3_transmit_buffer(link->master, &tmp);
       link->changeover = CHANGEOVER_COMPLETED;
       mtp3_free_co(link);
       mtp3_check(link->adj_sp);
       ss7_message(link->master, "MTP3 T2 timer expired on link SLC: %i ADJPC: %i changeover completed\n",
                       link->slc, link->dpc);
}

static void mtp3_changeover(struct mtp2 *link, unsigned char fsn)
{
       struct ss7_msg *tmp = NULL;

       if (link->changeover == CHANGEBACK || link->changeover == CHANGEBACK_INITIATED) {
               mtp3_cancel_changeback(link);
       }
       if (link->changeover == NO_CHANGEOVER ||
                       link->changeover == CHANGEOVER_INITIATED) {
               mtp3_move_buffer(link->master, link, &link->co_tx_buf, &tmp, -1, fsn);
               mtp3_move_buffer(link->master, link, &link->co_tx_q, &tmp, -1, -1);
               mtp3_move_buffer(link->master, link, &link->co_buf, &tmp, -1, -1);
               mtp3_transmit_buffer(link->master, &tmp);
               link->changeover = CHANGEOVER_COMPLETED;
               ss7_message (link->master, "Changeover completed on link SLC: %i PC: %i FSN: %i\n", link->slc, link->dpc, fsn);
               mtp3_free_co(link);
               mtp3_check(link->adj_sp);
       }
}
{noformat}

From what I am reading, the key difference between the two is an additional:
{noformat}
               mtp3_move_buffer(link->master, link, &link->co_tx_buf, &tmp, -1, fsn);
{noformat}
in mtp3_changeover.

Even still, neither of these two functions seems to adequately handle changeover and, in both cases, it seems that some sort of corruption arises.

By: Gregory Massel (gmza) 2020-01-07 18:28:11.289-0600

BTW, I haven't the faintest clue what is trying to be achieved with this code:
{noformat}
mtp3_move_buffer(link->master, link, &link->co_tx_buf, &tmp, -1, fsn);
mtp3_move_buffer(link->master, link, &link->co_tx_q, &tmp, -1, -1);
mtp3_move_buffer(link->master, link, &link->co_buf, &tmp, -1, -1);
mtp3_transmit_buffer(link->master, &tmp);
{noformat}

But it's interesting that they're writing to a buffer; it seems to be the same buffer that you identified as infinitely looping the len_buf function. So in the one instance, it corrupts the buffer sufficiently to cause anything accessing it to get caught in a loop, while in the other it corrupts it such that mtp3_free_co tries to free it and causes the process to abort. The corruption likely happens before mtp3_free_co, and I'm willing to bet within the above lines of code.

Of course it's possible that I'm just being too quick to blame the section of code I'm really struggling to wrap my head around...

By: Gregory Massel (gmza) 2020-01-09 11:07:57.447-0600

More insight from an actual crash early this morning (not one that reproduced).

Four E1's with 4 signalling links (SLC 0 to 3) and the fourth E1 went into alarm.
Immediately it logs:
{noformat}
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer started on link SLC: 3 ADJPC: 2200
{noformat}
Then correctly sends COO via SLC's 0, 1 and 2 in respect of 3:
{noformat}
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 14 [ 57 b7 0b c0 98 08 c2 30 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 55 FIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 87 BIB 0
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] >[2200:0] MSU
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 57 b7 0b ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      OPC 776 DPC 2200 SLS 3
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 98 08 c2 30 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COO
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 14 [ e9 fd 0b c0 98 08 c2 30 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 125 FIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 105 BIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] >[2200:1] MSU
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ e9 fd 0b ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      OPC 776 DPC 2200 SLS 3
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 98 08 c2 30 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COO
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 14 [ 7a 8f 0b c0 98 08 c2 30 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 15 FIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 122 BIB 0
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] >[2200:2] MSU
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 7a 8f 0b ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      OPC 776 DPC 2200 SLS 3
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 98 08 c2 30 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COO
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COO
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 11 0d 00 00 00 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ 85 8d 01 03 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 13 FIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 5 BIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:3] LSSU SIOS
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
{noformat}

Then it receives COA via SLC 0 in respect of SLC3:
{noformat}
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 10 [ b7 58 07 c0 08 03 26 32 21 05 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 88 FIB 0
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 55 BIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:0] MSU
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ b7 58 07 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      OPC 2200 DPC 776 SLS 3
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 08 03 26 32 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 2
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COA
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 21 05 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer stopped on link SLC: 3 ADJPC: 2200
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Changeover completed on link SLC: 3 PC: 2200 FSN: 5
{noformat}

The above looks good, however, things go bad from here.
We receive another COA in respect of our second COO, however, it's received on SLC 0 and, owing to logic error in libss7, it's expecting it to only be received via the SLC that it was originally sent out of (SLC's 1 and 2):

{noformat}
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 10 [ b7 59 07 c0 08 03 26 32 21 05 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 89 FIB 0
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 55 BIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:0] MSU
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ b7 59 07 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      OPC 2200 DPC 776 SLS 3
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 08 03 26 32 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 2
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COA
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]      [ 21 05 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:47] ERROR[22441] chan_dahdi.c: [1] Got COA on SLC 3 ADJPC 2200 but we haven't sent COO or ECO
{noformat}

Note the error logged!

But here is where it gets even worse:
{noformat}
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ ff ff 01 00 ]
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 127 FIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 127 BIB 1
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:3] LSSU SIO
[Jan  9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:49] ERROR[22441] chan_dahdi.c: [1] T7 expired on link SLC: 0 ADJPC: 2200
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: INSERVICE -> IDLE
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: IDLE -> NOTALIGNED
[Jan  9 04:08:49] ERROR[22441] chan_dahdi.c: [1] T7 expired on link SLC: 1 ADJPC: 2200
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: INSERVICE -> IDLE
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: IDLE -> NOTALIGNED
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer started on link SLC: 0 ADJPC: 2200
[Jan  9 04:08:49] VERBOSE[22441] sig_ss7.c: Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan  9 04:08:49] WARNING[22441] sig_ss7.c: MTP2 link down (SLC 0)
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer started on link SLC: 1 ADJPC: 2200
[Jan  9 04:08:49] VERBOSE[22441] sig_ss7.c: Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan  9 04:08:49] WARNING[22441] sig_ss7.c: MTP2 link down (SLC 1)
{noformat}

libss7 thinks that SLC 0 and SLC 1 are down because of a timer T7 expiry; i.e. it thinks that there was excessive delay in it receiving a COA response to its COO message, however, it received the COA's but disregarded them because it erroneously thought they were not valid if not received on the same SLC on which they were sent!

Now it tries to send COO via SLC 2 in respect of SLC 0 and SLC 1 (which are actually up):
{noformat}
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ ff ff 01 00 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 127 FIB 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 127 BIB 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:0] LSSU SIO
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ ff ff 01 00 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 127 FIB 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 127 BIB 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:1] LSSU SIO
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 12 [ 7a 90 09 c0 98 08 c2 00 11 59 00 00 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 16 FIB 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 122 BIB 0
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:2] MSU
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 7a 90 09 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      OPC 776 DPC 2200 SLS 0
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      [ 98 08 c2 00 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COO
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      [ 11 59 00 00 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 10 [ 7a 91 07 c0 98 08 c2 10 11 69 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 17 FIB 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 122 BIB 0
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:2] MSU
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 7a 91 07 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      [ c0 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      OPC 776 DPC 2200 SLS 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      [ 98 08 c2 10 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      H0: 1 H1: 1
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      Message type: COO
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]      [ 11 69 ]
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
{noformat}

At this stage, only SLC 2 is thought to be up and, as a message arrives on SLC 2:
{noformat}
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] <[2200:2] MSU
[Jan  9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 91 7b 07 ]
{noformat}

Asterisk then crashes.
From the core dump, we can see that we received a COA (as expected) and tried to process changeover:

{noformat}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fb1cd25d801 in __GI_abort () at abort.c:79
#2  0x00007fb1cd2a6897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fb1cd3d3b9a "%sn") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fb1cd2ad90a in malloc_printerr (str=str@entry=0x7fb1cd3d5870 "double free or corruption (out)") at malloc.c:5350
#4  0x00007fb1cd2b4e75 in _int_free (have_lock=0, p=0x7fb1040058b0, av=0x7fb1cd608c40 <main_arena>) at malloc.c:4278
#5  __GI___libc_free (mem=mem@entry=0x7fb1040058c0) at malloc.c:3124
#6  0x00007fb17b480077 in mtp3_transmit (ss7=ss7@entry=0x560a784062e0, userpart=<optimized out>, rl=..., priority=<optimized out>, m=0x7fb1040058c0, link=0x0) at mtp3.c:1796
#7  0x00007fb17b480103 in mtp3_transmit_buffer (ss7=0x560a784062e0, buf=buf@entry=0x7fb12e474640) at mtp3.c:640
#8  0x00007fb17b482a33 in mtp3_changeover (link=link@entry=0x560a783f59c0, fsn=<optimized out>) at mtp3.c:881
#9  0x00007fb17b4835ee in net_mng_receive (rl=0x7fb12e474670, rl=0x7fb12e474670, rl=0x7fb12e474670, len=776423125, buf=<optimized out>, mtp2=0x560a78468da0, ss7=0x560a784062e0) at mtp3.c:1207
#10 mtp3_receive (ss7=0x560a784062e0, link=0x560a78468da0, msg=msg@entry=0x7fb12e4746db, len=<optimized out>) at mtp3.c:1891
#11 0x00007fb17b47f67e in msu_rx (len=<optimized out>, h=<optimized out>, link=<optimized out>) at mtp2.c:765
#12 mtp2_receive (link=<optimized out>, buf=buf@entry=0x7fb12e4746e0 "221{a300b003&022!}024377", len=<optimized out>) at mtp2.c:956
#13 0x00007fb17b48486e in ss7_read (ss7=<optimized out>, fd=<optimized out>) at ss7.c:495
#14 0x00007fb17b920713 in ss7_linkset (data=0x7fb17bb3ff60 <linksets>) at sig_ss7.c:1452
#15 0x0000560a75090e21 in dummy_start (data=0x560a783fb830) at utils.c:1249
#16 0x00007fb1ce1556db in start_thread (arg=0x7fb12e475700) at pthread_create.c:463
#17 0x00007fb1cd33e88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
{noformat}

It's basically doing something that is causing it to try and remove its last remaining signalling link and then trying to transmit a message out the signalling link that it's just removed.

So two MORE bugs can be isolated:
1. It should never try and transmit a message out a link structure that it's just free()'ed;
2. It is erroneously disregarding COA messages received if they're received via a different SLC to the one the COO message was transmitted out of, notwithstanding that the message was correct and valid in respect of the SLC specified (SLS parameter).


By: Gregory Massel (gmza) 2020-01-09 13:41:00.136-0600

Looking at above, COO should NOT have been sent across all 3 remaining links in respect of the failed one. One COO would have been correct.
It seems that it is actually disregarding the COA messages because it only tracks one COO at a time. So, had it not transmitted three of them, it wouldn't have complained about receiving the additional unexpected COA's.

It seems that in mtp3.c, after line 1603 (within net_mng_send), perhaps we need a "break;"

{noformat}
       if (link->std_test_passed) {
               return mtp3_transmit(ss7, SIG_NET_MNG, rl, 3, m, link);
       } else {
               /* we may use another link to the same adjacent sp */
               res = -1;
               for (i = 0; i < link->adj_sp->numlinks; i++) {
                       if (link->adj_sp->links[i]->std_test_passed) {
                               res = mtp3_transmit(ss7, SIG_NET_MNG, rl, 3, m, link->adj_sp->links[i]);
+ break;
                       }
               }
{noformat}

This is the only code that I can find that I believe could, potentially, retransmit the same COO message over each of the remaining links (instead of just one).

I'm going to try add this break in and re-test because I suspect this could address the issue.

By: Gregory Massel (gmza) 2020-01-09 14:23:12.008-0600

It worked!!!

With the 'break;' in place, I'm now able to repeatedly fail any of the four links and Asterisk correcly handles it, sending COO, receiving COA, marking the link as FAILED and effecting changeover cleanly. Moreover, I can bring the failed link back up and it cleanly restores. I can successfully do a "ss7 show linkset 1" in between each.

So, what I've discovered through this is that there are TWO bugs that actually need addressing:
1. The original issue of "Asterisk becomes unstable after SS7 signalling link restarts"
2. A logging bug which can be described as "Wrong SLC number logged when MTP2 link state changes"

And, a more precise description of bug 1 is "Crash and/or instability during SS7 link changeover with linksets of more then two links"

Let's deal with 1 as it's what was originally logged here. The fix involves two changes to mtp3.c (libss7-2.0.0):
1. Lines 1869-1872 in mtp3.c should be removed or commented out:
{noformat}
/*              if (link->slc != rl.sls) {
                       ss7_error(ss7, "Received message for slc 0x%x, but we are 0x%x.  Dropping\n", rl.sls, link->slc);
                       return -1;
               } */
{noformat}
This code is entirely nonsensical because, by very nature of a link having failed, we will NEVER receive SIG_NET_MNG over the failed link and have to receive it over one of the other links.
2. Lines 1599 to 1606 require the addition of a "break;" so as to exit the for loop after the NET_MNG message has been successfully sent via an alternate link:
{noformat}
               /* we may use another link to the same adjacent sp */
               res = -1;
               for (i = 0; i < link->adj_sp->numlinks; i++) {
                       if (link->adj_sp->links[i]->std_test_passed) {
                               res = mtp3_transmit(ss7, SIG_NET_MNG, rl, 3, m, link->adj_sp->links[i]);
+++                             break;
                       }
               }
{noformat}

In terms of bug 2, the logging bug, the applicable code is in channels/sig_ss7.c (Asterisk itself), lines 1493 to 1498, within ss7_linkset.
I changed it as follows.
Before:
{noformat}
                      case MTP2_LINK_UP:
                               ast_verb(1, "MTP2 link up (SLC %d)\n", e->gen.data);
                               break;
                       case MTP2_LINK_DOWN:
                               ast_log(LOG_WARNING, "MTP2 link down (SLC %d)\n", e->gen.data);
                               break;
{noformat}
After:
{noformat}
                      case MTP2_LINK_UP:
                               ast_verb(1, "MTP2 link up (SLC %d)\n", e->link.link->slc);
                               break;
                       case MTP2_LINK_DOWN:
                               ast_log(LOG_WARNING, "MTP2 link down (SLC %d)\n", e->link.link->slc);
                               break;
{noformat}
In essence, e->gen.data is random garbage and the correct way to determine the SLC number is e->link.link->slc.
However, where things get more complex is that sig_ss7.c won't compile then because the reference to "slc" requires struct mtp2 to be defined and that is defined in mtp2.h which is in the libss7 source base. I added a #include "mtp2.h" into sig_ss7.c and symlinked mtp2.h from libss7 into the asterisk/channels directory as well as ss7_internal.h (because mtp2.h includes it), however, there is more than likely a cleaner solution to this (e.g. making libss7 install mtp2.h and ss7_internal.h to /usr/include as part of a "make install").

I would greatly appreciate it if you could submit the first fix for inclusion in the next version of libss7 and finesse the second fix and then include it in Asterisk.

Thank you for all the assistance in helping me understand the applicable code and backtraces so as to get to this point of being able to propose fixes.

By: Gregory Massel (gmza) 2020-01-09 14:27:21.113-0600

BTW, the reason I suggested that this be renamed to "Crash and/or instability during SS7 link changeover with linksets of more then two links" is because, having looked at the above, it is technically impossible for this situation to ocurr when a linkset has either one or two links. With one link, changeover cannot happen and, with two links, there is only ever ONE remaining link for COO to be sent out. It's only if you have three or more than COO can be sent two or more times over.

By: Joshua C. Colp (jcolp) 2020-01-09 15:58:15.210-0600

The libss7 project is currently unmaintained however we do our best to still review changes where we are able. It follows the same review process as Asterisk, documented on the wiki[1] except from the "libss7" project instead. If you put the change up using that process it will be reviewed and once through review will then be merged. I've also created an internal issue to perform a libss7 release, but I have no timeframe on when that would occur. If you don't wish to go through the process then this issue will remain open with the information and if another individual wishes they can go through the process.

Cheers,

[1] https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

By: Kevin Harwell (kharwell) 2020-01-09 16:08:26.701-0600

[~gmza] Excellent news! I'm glad to hear you were able to figure it out.

What [~jcolp] said, but wanted to add a few more links in case you decided create some patches for submission.

If you would like to submit a patch, or patches, you'll first need to sign a contributor's license [1] if you have not previously done so. The just follow the Code Review [2] instructions on the wiki. Also be sure to:

* Verify that your patch conforms to the Coding Guidelines [3]
* Review the Code Review Checklist [4] for common items reviewers will look for

When ready, submit your patch and any tests to Gerrit [5] for code review.

Thanks!

[1] https://issues.asterisk.org/jira/secure/DigiumLicense.jspa
[2] https://wiki.asterisk.org/wiki/display/AST/Code+Review
[3] https://wiki.asterisk.org/wiki/display/AST/Coding+Guidelines
[4] https://wiki.asterisk.org/wiki/display/AST/Code+Review+Checklist
[5] https://wiki.asterisk.org/wiki/display/AST/Gerrit+Usage

By: Joshua C. Colp (jcolp) 2020-02-03 10:09:27.852-0600

[~gmza] Do you plan on putting the change up for review on Gerrit so we can then do a release?