### 4.1 Synthetic data

#### 4.1.1 Visualization of JLNMF

The performance of the proposed JLNMF is visualized and compared to the standard NMF through a synthetic example. In this example, two non-stationary signals are generated, *x*(*m*) and *y*(*m*), by combining a set of non-stationary components in the form of *α*
*g*(*μ*,*σ*) sin(2*π*(*f*+*θ*
*m*)*m*) where *g*(*μ*,*σ*) is a Gaussian function with mean *μ* and variance of *σ*, and the set (*α*,*μ*,*σ*,*f*,*θ*) is the parameter for each component (a) to (g). The parameters of each component are manipulated to generate components (a) through (d) (see Fig. 3) as the shared structure by the two signals, components (e) and (f) as the distinct structure of signal *x*(*m*), and the chirp component (g) as the distinct structure of signal *y*(*m*). In this figure, parameters (*α*,*μ*,*σ*,*f*,*θ*) for components (a) to (g) are, respectively, as follows: (1, 1.2, 0.05, 0.325, 0), (1,1.2, 0.05, 0.075, 0), (1, 3.4, 0.04, 0.125, 0), (1, 5.6, 0.03, 0.325, 0), (3, 0.9, 0.001, 0.450, 0), (3, 3.5, 0.001,0.213, 0), and (1, 5.3, 0.05, 0.005, 0.245), and Spectrogram with FFT size of 1024 points and Kaiser window with parameter of 5, length of 256 samples, and 220 samples overlap, is used to construct the TF representations. Each TF data (i.e., **V**
_{1} and **V**
_{2}) was decomposed using standard NMF method with *r*=6, and the decomposed TF basis and coefficient matrices were obtained and displayed in Fig. 3
c, d. It can be seen that NMF adaptively breaks down the non-stationary signal into TF basis and coefficient matrices representing the spectral and temporal information of the signal, respectively. For example, the first TF basis in **W**
_{1} represents the spectral characteristics of components (f) and (c) in the TF signal **V**
_{1}, and the corresponding TF coefficient in **H**
_{1} represents the temporal locations of those components in the TF signal. The best scenario for this decomposition would be if the decomposed TF bases represent the discriminant structures of each TF signal separate from the shared ones. Then the discriminant TF bases would be powerful in differentiating between the two signals. However, there is no separation between the TF bases that represent the shared structure (i.e., components (a)–(d)) and the distinct structure of each signal; for example, TF basis 1 of *x*(*m*), contains the frequency structure of components (e) and (b), and similarly TF basis 2 models the components (a), (b), and (e). A similar behavior can be seen in the TF bases of signal *y*(*m*).

We repeated the process using the proposed JLNMF with *r*=6 and *r*
_{
j
}=3. The values of *σ* and *β* were set to 0.01 and 0.1, respectively, as commonly used in literature [15], and a value of 0.05 was selected for *λ* and *γ*. As can be seen in Fig. 4
a, the first three (i.e., *r*
_{
d
}=3) components of the decomposed TF bases **W**
_{1d
} and **W**
_{2d
}, respectively, represent the distinct structure of signals *x*(*m*) and *y*(*m*), and the last three (*r*
_{
j
}) are zero (see Eq. (7) for the structures of **W**
_{
d
} and **W**
_{
j
} with respect to **W**). The shared structure is represented in the last three (*r*
_{
j
}=3) of the shared TF bases **W**
_{
j
}. It is interesting to point out that although *r*
_{
d
} was set to be three, the method adaptively identifies that only two distinct TF bases are sufficient to model the discriminate structure in *x*(*m*), and finds an empty (all zero) for the one of the TF basis vectors in **W**
_{2d
}. Additionally, unlike standard NMF TF bases, none of the JLNMF TF bases represents both the shared and distinct TF structure of the original signals *x*(*m*) and *y*(*m*). Comparing the decomposed TF basis and coefficient matrices of the proposed JLNMF and standard NMF (Figs. 3
c, d and 4
a, b), it can be observed that the JLNMF coefficients are noticeably more localized. Furthermore, the discriminant TF structures of signals *x*(*m*) and *y*(*m*) are successfully reconstructed as **W**
_{1d
}
**H**
_{1} and **W**
_{
j
}
**H**
_{2}, respectively, as shown in Fig. 4
c. The shared structures are also successfully reconstructed as **W**
_{
j
}
**H**
_{1} and **W**
_{1d
}
**H**
_{2} for each signal as shown in Fig. 4
d. This example demonstrates that the proposed JLNMF successfully separates the shared components and the distinct class-specific structures of each TF data. The only way to make this separation happen in the standard NMF is by identifying the TF bases of **W**
_{1} and **W**
_{2} that are correlated as the shared TF bases, and the uncorrelated ones as the class-specific TF bases. However, as can be seen in Fig. 3
c, the standard NMF TF bases are a mixture of shared and distinct structures and the TF bases of the two classes are correlated, and thereby, the standard NMF cannot achieve the results shown in Fig. 4.

#### 4.1.2 Properties of JLNMF

It is desirable that the extracted discriminant bases are robust to signal processing operations, which are not expected to affect the classification task. This section examines the properties of JLNMF for amplitude scale and time shift.

##### 4.1.2.1 Amplitude scaling:

If *y*(*m*) is the amplitude scaled version of signal *x*(*m*), such that *y*(*m*)=*a*
*x*(*m*), the TFM of the signal *y*(*m*), **V**
_{2}, can be written as **V**
_{2}=*a*
^{2}
**V**
_{1}, where **V**
_{1} is the TFM data of signal *x*(*m*) and *a* is the amplitude scale. Since the amplitude scaling does not affect the TF structure of the signal, it is expected that JLNMF does not identify any discriminant TF bases. To put it into test, an amplified version of signal *x*(*m*) in Fig. 3
a was generated and JLNMF was applied to the original and scaled TF data, and as expected, both of the discriminant matrices, **W**
_{1d
} and **W**
_{2d
}, were obtained empty, demonstrating that JLNMF was successful in identifying that there were no distinct differences between the two data, and verifying that JLNMF is invariant to amplitude scaling.

##### 4.1.2.2 Time shift:

The property of JLNMF under temporal shift is examined using the following example. Let us consider two signals *x*(*m*) and *y*(*m*)=*x*(*m*−*τ*). The temporal shift only shifts the TF structure of the TF data in time and does not change the TF components of the data (i.e., **V**
_{2}(*n*,*m*)=**V**
_{1}(*n*,*m*−*τ*). Hence, JLNMF is not expected to identify any distinct structural differences between *x*(*m*) and *y*(*m*). A signal *x*(*m*) and *x*(*m*−*τ*) with a time shift of *τ*=1.5 s is generated as shown in Fig. 5
a, and the TF data is displayed in Fig. 5
b.

JLNMF with *r*=9 and *r*
_{
j
}=6 was performed on the TF data, and the decomposed TF bases and coefficients are displayed in Fig. 5
c. The method did not identify any discriminant TF matrices (i.e., **W**
_{1d
} and **W**
_{2d
} are both zero). Hence, \(\hat {\textbf {V}}_{1d}\) and \(\hat {\textbf {V}}_{2d}\) are empty. **W**
_{
j
} contained the spectral vectors that are shared between *x*(*m*) and *x*(*m*−*τ*), and **H**
_{2} was equal to **H**
_{1}(*m*−*τ*). As can be seen, the JLNMF algorithm is invariant to time shifting.

#### 4.1.3 Data retrieval application using JLNMF algorithm

The developed JLNMF is applied for a data retrieval case, where there is an overlap of information in the test data. In order to evaluate the results, a synthetic non-stationary dataset is generated (see Fig. 6) as inspired by the previous work in the literature [24, 25]. In this data, the triangles in the TF representation of classes *x* and *y* could be considered as the distinct structure in each class, and the box represents the shared structures between the two. Hence, each class consists of a class-specific distinct structure and one shared structure. Each signal is defined as the sum of two components as defined below:

$$ {{}\begin{aligned} x_{\text{train}}(m) &= g(0.5,0.18) {\sin}\left[2\ast \pi\left(a_{0}+a_{1}m\right)\right] \\ &\quad + g(0.5,0.18){\sin}\left[2\ast \pi\left(b_{0}+b_{1}m + b_{2} m^{2}\right)\right], \end{aligned}} $$

(30)

where the parameters of the shared component (i.e., *a*
_{0},*b*
_{0}) belong to a uniform distribution U(0,1), *a*
_{1}=0.25, *b*
_{1}=0.40, and *N*=1,000 is the signal length in samples with a sampling frequency of 1 kHz. Two classes are generated by selecting *b*
_{2} from one of the following uniform distributions:

$$ \text{Class 1: U}\left(\frac{-0.30}{2(N-1)},\frac{-0.20}{2(N-1)}\right) \\ $$

(31a)

and

$$ \text{Class 2: U}\left(\frac{-0.15}{2(N-1)},\frac{-0.05}{2(N-1)}\right) $$

(31b)

The TF representation for signals in each class is plotted in Fig. 6
a, b. For training purposes, a total of 300 signals are generated in each class. A test data set is generated by 100 % overlap between the two classes as follows:

$${} \begin{aligned} x_{\text{test}}(m) &= g(0.5,0.18) {\sin}\left[2\ast\pi\left(a_{0}+a_{1}m\right)\right]\\ &\quad + g(0.5,0.18) {\sin}\left[2\ast \pi\left(b_{0}+b_{1}m + b_{21} m^{2}\right)\right]\\ &\quad + g(0.5,0.18) {\sin}\left[2\ast \pi\left(b_{0}+b_{1}m + b_{22} m^{2}\right)\right] \end{aligned} $$

(32)

where *b*
_{21} and *b*
_{22} are selected using Eqs. (31a) and (31b), respectively. The TF representation for signals in each class is plotted in Fig. 6
c.

A total of 40 test signals are generated, and the training/classification tasks are performed as follows: (1) spectrogram with FFT size of 128 points and Kaiser window with parameter of 5, length of 128 samples, and 125 samples overlap, is used to construct the TF matrices of each signal. The dimension of the TF matrix is 65×291. (2) each TF matrix is collapsed into a vector, *v*
_{18915×1}, and the training data sets are created as: \(\textbf {V}^{(18915\times 300)}_{1} = \left [v_{1}(1) \ v_{1}(2) \ \cdots \ v_{1}(300) \right ]\) and \(\textbf {V}^{(18915\times 300)}_{2} = \left [v_{2}(1) \ v_{2}(2) \ \cdots \ v_{2}(300) \right ]\). (3) JLNMF is applied to the TF data **V**
_{1} and **V**
_{2} and class-specific and shared TF bases are decomposed. The parameters *r* and *r*
_{
j
} are selected 40 and 20, respectively, and *λ* and *γ* are varied over {0.05,0.1,0.15,0.2}. (4) for classification purposes, each test data, *v*
_{test}(*i*)_{
i=1:40}, is projected over the JLNMF class-specific and shared TF bases obtained in the previous step. To make this happen, an overall TF bases, **W**, is constructed in the form of [*w*
_{1d
}(1) ⋯ *w*
_{1d
}(*r*
_{
d
}) *w*
_{2d
}(1) ⋯ *w*
_{2d
}(*r*
_{
d
}) *w*
_{
j
}(*r*−*r*
_{
j
}+1) ⋯ *w*
_{
j
}(*r*)], and coefficient vector \(h^{T}_{\text {test}}(i)_{i=1:40}\) is computed using the updating rule in Eq. (3). The first *r*
_{
d
}=20 elements in \(h^{T}_{\text {test}}(i)\) contain the corresponding coefficients of the class-specific TF bases of class 1, the next *r*
_{
d
}=20 elements contain the corresponding coefficients of the class-specific TF bases of class 2, and the last *r*
_{
j
}=20 elements contain the corresponding coefficients of the linear combinations of the shared TF bases. Hence, the distinct structures of class 1 and class 2 in test data *i* are reconstructed as follows:

$$\begin{array}{@{}rcl@{}} &&\hat{v}_{1d}(i) = \left[w(1) \ \cdots \ w(r_{d})\right] h^{T}(1:r_{d}) \text{ and } \\ &&\hat{v}_{2d}(i) = \left[w(r_{d}+1) \ \cdots \ w(2r_{d})\right] h^{T}\left(r_{d}+1:2r_{d}\right) \end{array} $$

(33)

(5) The correlation coefficients between \(\hat {v}_{1d}(i)\) and the original *v*
_{1d
}(*i*), and \(\hat {v}_{2d}(i)\) and the original *v*
_{2d
}(*i*), are computed. The average of the correlation values, which we denote as class-specific recovery percentage (CRP), is computed as shown in the following equation to assess the method’s success in identifying and recovering the distinct structures of each class in the presence of 100 % overlap (see Fig. 6
c for the test data).

$$\begin{array}{@{}rcl@{}} \text{CRP}_{1}(\%) = \sum\limits_{i=1}^{i=300}{\frac{\text{Cor}\left(\hat{v}_{1d}(i), v_{1d}(i)\right) }{300}}\times 100 \end{array} $$

(34)

$$\begin{array}{@{}rcl@{}} \text{CRP}_{2}(\%) = \sum\limits_{i=1}^{i=300}{\frac{\text{Cor}\left(\hat{v}_{2d}(i), v_{2d}(i)\right) }{300}}\times 100 \end{array} $$

(35)

(6) Steps (1)–(5) are repeated 20 times and the average CPR is reported for each set of *λ* and *γ* over {0.05,0.1,0.15,0.2}. The average CRP_{1} and CRP_{2} values are shown in Fig. 7.

As can be seen, there is a high correlation between the reconstructed and the original distinct structures of each class, meaning that the proposed JLNMF successfully separated the distinct structure of each class from the shared structure. The performance slightly varied by the selected value of *λ* and *γ* with (0.10 and 0.05), being the best average CRP value (97 %) for the given example. Figure 8 displays the reconstructed class-specific structures for two examples with (CRP_{1}, CRP_{2}) values of (96 %, 98 %) and (72 %, 99 %) for the example in the top and bottom, respectively. As can be seen in this plot, although there is a relatively low correlation between \(\hat {v}_{1d}(i)\) and *v*
_{1d
}(*i*), the distinct structure of class 1 has been mostly recovered. To compare JLNMF with NMF, the experiment is repeated using standard NMF, where NMF with *r*=30 is separately applied to each TF data and the shared TF bases are identified as the TF bases with a correlation value of greater than 0.9. The average CRP value of 88 % is obtained, which is significantly less the CRP value of 97 %, obtained using the proposed JLNMF method.

### 4.2 Real data

#### 4.2.1 Localization of epileptic spikes in EEGs

The application of the proposed JLNMF method to localize the epileptic discharges associated with infantile spasms in hypsarrhythmia (HYPS) is explored. Infantile spasms refer to a catastrophic form of epilepsy occurring in infancy that is diagnosed based on the findings of HYPS in EEG recordings combined with epileptic spasms [26]. HYPS is characterized by a chaotic and high voltage background with multifocal, discharges [27]. However, identifying this pattern of activity in a conventional EEG recording is challenging in the presence of HYPS due to the abundance of epileptiform discharges with varying focality and morphology [28]. An experienced electroencephalographer interprets an EEG by inspecting and approximating the characteristics of HYPS subjectively rather than through objective quantification. Due to complex nature of these signals, even experienced EEG readers tend to interpret HYPS differently, which can have serious implications in the treatment of the infant [28]. Several algorithms have been developed to detect epileptic discharges during epilepsy. Some of those algorithms include temporal analysis based on template matching and mimetic analysis methods, or TF method based on wavelet analysis. However, those methods have been developed for epileptic discharge detection in EEG signals associated with other types of epilepsy and not in the presence HYPS. The existing methods are either based on supervised classifiers in which the presence of true spikes has to be readily available and identifiable to train the algorithm during a learning phase, or they are template based, which rely on pre-specified spike characteristics such as amplitude and duration of the discharges. Given the chaotic appearance of EEG during HYPS, the manual localization of true spikes is not always possible. The spikes of interest are characteristically non-uniform, which presents a challenge for temporal-based methods. Hence, there is a need to develop an semi-supervised feature extraction and classification method to assist in localization of spikes with multiple foci and varying morphologies during HYPS.

A 5-min section of awake EEG recording from an infant with infantile spasms is used to explore the application of JLNMF method for semi-supervised localization of epileptic spikes. The subject consent was obtained through the Infantile Spasms Registry and Genetic Studies via a protocol approved by the University of Rochester’s Research Subjects Review Board. The EEG signals were recorded based on the international standard 10–20 system with sampling frequency of 512 Hz. The recording EEGs were imported to Persyst EEG software (Persyst, San Diego, CA) for artifact reduction and then were imported into MATLAB and bandpass filtered (0.5–30 Hz) for further analysis. All the epilepticform discharges were manually marked by an epileptologist.

The data is divided into 10-s windows and an epileptologist labels each 10-s EEG window as non-spike (NS) if there are no epiletiform spikes in that window; otherwise, the window is marked as possibly-spike (PS). The objective is to characterize the structure of the epileptiform spikes and localize them during each 10-s window. The localization task is formulated as a classification task, where the objective is to decompose the class-specific (i.e., epileptic spikes) and shared bases (i.e., the common EEG baseline), which can be used to reconstruct the class-specific data in each class. The distinct structure of the PS recordings is expected to indicate the spike locations during the EEG recordings. There are two main challenges: the first challenge is that the EEG recordings are strongly non-stationary, which can be addressed by using the TF data of the EEG recordings to improve the representation of the non-stationary information. The second challenge is the substantial amount of similarity (i.e., overlap) between the two NS and PS classes. This is mainly because the exact locations of the epileptiform spikes are not specified, and the only available information is that the PS class contains several spikes at some unknown locations. The proposed JLNMF method is used to address this challenge, as it is able to decompose the TF data to the spike-specific TF bases from the common EEG baseline in a semi-supervised fashion. The details of the application of JLNMF on the EEG recording is as follows:

(1) The TF data is constructed using matching-pursuit TF (MP-TF) method. MP-TF has a much higher TF resolution compared to spectrogram [2, 8], and can better represent the spike-related transients and non-stationarity of the EEG recordings. The resolution of the MP-TF is selected to be 0.15 Hz in frequency and 2 ms in time. Since there is no meaningful physiological information beyond 30 Hz, the frequency domain is limited to that value. The dimension of the TF matrix is 200×5120 for each 10-s segment. (2) Each 10-s TF matrix is divided into 64 sections and each section is collapsed into a vector, *v*
_{16,000×1}. The training data set is created by collecting the collapsed TF vectors of half of the NS and PS data. (3) JLNMF is applied to the TF data and class-specific and shared TF bases are decomposed. The parameters *r* and *r*
_{
j
} are selected as 40 and 20, respectively, and the values of 0.10 and 0.05 are, respectively, selected for *λ* and *γ*. Figure 9
a shows the shared (on top) and PS-specific (on bottom) TF bases. To visualize the TF bases, each decomposed base is restructured back to the original size. The NS-specific TF matrix is found empty, which means that JLNMF does not identify any distinct structure to the NS class. (4) For classification purposes, each test data is projected over the JLNMF PS-specific and shared TF bases obtained in the previous step, and the coefficient vector is computed using the updating rule in Eq. (3), and the distinct structures corresponding to the PS-specific TF bases are reconstructed for all the test data.

Figure 9
c shows the PS-specific reconstructed TF data for the EEG signals shown in Fig. 9
b. The figure shows two examples: the one on top belongs to a case, where two distinct spikes are located on the PS-specific reconstructed TF data (see the two dark vertical lines in Fig. 9
c), and are marked by two red arrows in Fig. 9
b. An epileptologist confirmed that those identified locations are indicating epileptic spikes. The plots on the bottom of Fig. 9
b, c show a case, where the PS-specific reconstructed TF data does not indicate any spikes as also are confirmed by a epileptologist. Comparing the two plots in Fig. 9
b, it can be seen that the two EEG signals look very similar; however, the proposed JLNMF is able to successfully locate the epileptic spikes. To compare the JLNMF algorithm with the standard NMF, the TF data of NS and PS classes are separately decomposed using NMF with *r*=40. The TF bases of each class is shown in Fig. 10. The NS and PS TF bases are compared to separate the spike-specific TF bases from the shared ones, but none had a correlation value of greater than 0.9 and only three had a value of greater than 0.8. Hence, the standard NMF is unable to locate any spike-related TF bases without any further analysis, while JLNMF showed to be successful in identifying the epileptic spikes. Such a method is deemed to be necessary for reliable evaluation of features associated with HYPS, which could potentially improve the assessment of infantile spasms, which is of significant importance in the therapy, management and ultimately the success of the prescribed treatment.

#### 4.2.2 Discrimination of pathological voice disorder

Dysphonia or pathological voice disorder refers to speech problems resulting from a damage to or malformation of the speech organs. Pathological voice disorder is more common in people who use their voice professionally, for example, teachers, lawyers, salespeople, actors, and singers, and it dramatically affects these professional groups’ lives both financially and psychosocially. The purpose of the discrimination of pathological voice disorder is to help patients with pathological problems for monitoring their progress over the course of voice therapy. We applied the developed JLNMF method to the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database, distributed by Kay Elemetrics Corporation [29]. The database consists of 51 normal and 161 pathological speakers whose disorders spanned a variety of organic, neurological, traumatic, and psychogenic factors. The speech signal is sampled at 25 kHz and quantized at a resolution of 16 bits/sample. In this exploratory experiment, one speech signal from a normal subject and one from a pathological subject were used as two classes: normal and pathology. The TF signals of each class was constructed by computing the spectrogram with FFT size of 1024 points and Kaiser window with parameter of 5, length of 256 samples and 220 samples overlap. The two TF signals were then fed to the JLNMF algorithm to generate three sets of TF bases: normal-specific TF bases, pathological-specific TF bases, and shared TF bases. Figure 11 shows the above procedure along with the decomposed TF bases. The speech samples and their corresponding TF data are shown in Fig. 11
a–d. The decomposed normal-specific TF bases, pathological-specific TF bases, and shared TF bases are shown in Fig. 11
e–g, respectively. For a successful decomposition, it is expected that the normal discriminant bases represent stronger formants compared to the pathological discriminant bases and the shared bases represent the natural structure of a speech signal [9]. The success of the JLNMF for discrimination of pathological voice disorder was visually investigated from the decomposed TF bases shown in Fig. 11. As expected, the pathological discriminant bases (Fig. 11
f) present weak formants, while the normal discriminant bases (Fig. 11
e) have more periodicity in low frequencies and introduce stronger formants. The shared TF bases (Fig. 11
g) represent the low-frequency TF structures that is common to any natural speech regardless of being normal or pathological.