Time-frequency feature extraction for classification of episodic memory

This paper investigates the extraction of time-frequency (TF) features for classification of electroencephalography (EEG) signals and episodic memory. We propose a model based on the definition of locally stationary processes (LSPs), estimate the model parameters, and derive a mean square error (MSE) optimal Wigner-Ville spectrum (WVS) estimator for the signals. The estimator is compared with state-of-the-art TF representations: the spectrogram, the Welch method, the classically estimated WVS, and the Morlet wavelet scalogram. First, we evaluate the MSE of each spectrum estimate with respect to the true WVS for simulated data, where it is shown that the LSP-inference MSE optimal estimator clearly outperforms other methods. Then, we use the different TF representations to extract the features which feed a neural network classifier and compare the classification accuracies for simulated datasets. Finally, we provide an example of real data application on EEG signals measured during a visual memory encoding task, where the classification accuracy is evaluated as in the simulation study. The results show consistent improvement in classification accuracy by using the features extracted from the proposed LSP-inference MSE optimal estimator, compared to the use of state-of-the-art methods, both for simulated datasets and for the real data example.


Introduction
The analysis of electroencephalography (EEG) signals is one of the main methodological tools for understanding how the electrical activity of the brain supports cognitive functions [1]. One of the main reasons for the importance of the EEG is the outstanding temporal resolution. Additionally, the procedure of measuring EEG on the scalp surface is non-invasive and low-cost, which makes it excellent both for clinical practice and for cognition studies.
Extensive literature supports the possibility to investigate the cognitive functions, such as episodic memory, through neural activity recordings [2][3][4]. In cognitive psychology, event-related potentials are indicators of cognitive processes. For the transient responses which are not specifically time-locked, the time-frequency (TF) images of EEG signals have become one of the more popular techniques of today's research. The TF images are often used for extracting the features to feed a neural network classifier, and most TF methods are based on the short-time Fourier transform (spectrogram) and the wavelet transform (scalogram). The quality of the TF representation is crucial for the extraction of robust and relevant features [1,[5][6][7], thus leading to the demand for high-performing spectral estimators. The wavelet transform, using the Morlet wavelet, is the most popular TF method today [8][9][10][11].
Recently, it has been argued that the sine waves and the wavelets of the spectrogram and scalogram should be replaced with physiologically defined waveform shapes which could be related to an individual-based physical model [12,13]. With more sophisticated techniques and advanced applications, the increase in the individual-based tailoring of the methods becomes even more critical. We consider an individual-based stochastic model for the signals, based on the definition of locally stationary processes (LSPs), introduced by Silverman in [14]. LSPs are characterized by a covariance function that is the modulation in time of an ordinary stationary covariance function.
The optimal kernel for the estimation of the Wigner-Ville spectrum (WVS) for a certain class of LSPs is obtained in [15]. Based on this result, we derive the mean square error (MSE) optimal WVS for our model covariance. The use of a proper time-frequency kernel offers a valuable approach to address the fundamental problem in stochastic time-frequency analysis, which is the estimation of a reliable time-frequency spectrum from only a few realizations of a stochastic process. The derivation of the optimal timefrequency kernels of stochastic models has been a research field of signal processing interest for a long time [16,17]. An efficient implementation is based on an equivalent optimal multitaper spectrogram [18][19][20]. The kernels are parameter-dependent, and thanks to the HAnkel-Toeplitz Separation (HATS) inference method introduced in [21], we are able to estimate the MSE optimal WVS from measured data. The derivation of the corresponding multitapers for the kernel is of great interest in the signal processing community, as their main advantages include computational efficiency of spectral properties and real-time use.
In a simulation study, we compare the MSE optimal WVS with state-of-the-art TF representations. Some preliminary results are presented in a previous conference paper [22]. Here, we extend the study by including the classical estimator of the WVS and the wavelet transform among the TF representations compared. Since the wavelet transform is one of the most popular representations used for EEG signal classification, its inclusion among the state-of-the-art methods considered in the comparison is especially relevant for evaluating the advantages offered by the proposed approach. Additionally, we extend the quantitative comparison in terms of MSE with the evaluation of the classification accuracy of simulated signals based on the TF features extracted with the different methods. The controlled setting of a simulation study allows evaluating how the quality of the TF estimate affects the classification accuracy of classes of signals indistinguishable in the time-domain but having distinct TF spectra. The classifier implemented is a multilayer perceptron neural network, which has been previously used for the classification of EEG signals based on TF features, e.g., [5,23,24]. The real data example considered consists of three categories of EEG signals measured during a memory encoding task [8,25,26].
The paper is structured as follows. In Section 2, we present the different TF representations, the model of LSPs used both in the simulation study and in the real data case study, the expression for the MSE optimal WVS, and the neural network used for classification. The results for the simulation studies and for the real data case are presented in Section 3. In Section 3.1, the performance of the derived spectral estimator is evaluated in terms of MSE, whereas in Section 3.2, the performance is evaluated as classification accuracy in a simulated study. In Section 3.3, we present the results of the classification of EEG signals collected within a study on human memory encoding and retrieval. Final comments and directions for further research are given in Section 4.

Methods
This section is dedicated to the exposition of the theory and methods. First, we present the different TF representations considered. Then, the definition of LSP [14] is recalled, and the inference problem discussed. We proceed by presenting the MSE optimal kernel for the general LSP case, based on [15,27], together with the MSE optimal kernel derived for the parametric model used in this paper. Finally, we specify the neural network classification approach.

Time-frequency representations
The spectrogram is a natural extension of the periodogram for non-stationary signals.
The spectrogram of a zero-mean signal x(t), t ∈ R, is calculated as: where h(t) is a window function, commonly a Hanning window or Gaussian window centered around zero. The Welch method [28], first introduced for stationary spectral estimation, aims at reducing the variance of the spectral estimate by calculating the average of the spectrum estimates obtained on shorter segments of the data, possibly overlapping. It can be implemented in TF analysis for computing the spectral estimates of each shorter sub-sequence of the spectrogram. We can write any TF representation belonging to Cohen's class, including the spectrogram, as: where all integrals run from −∞ to ∞ and φ(θ, τ ) is an ambiguity kernel [29]. With φ(θ, τ ) = 1 for all values of θ and τ , the classical Wigner-Ville distribution is received as: An extension of the uncertainty principle holds in TF analysis, implying a trade-off between frequency and temporal resolution. The wavelet transform is an approach to address the resolution limits posed from the uncertainty principle, by prioritizing the time resolution for high-frequency events and the frequency resolution for low-frequency events. The continuous wavelet transform is based on the correlation between the signal and the wavelet translated at time τ and scaled by a. The scalogram is defined as: where the unite energy function ψ should fulfill the admissibility condition: with F representing the Fourier transform. Larger scale values offer higher frequency resolution and are used for determining the frequency of long-term events, such as baseline oscillations. Smaller scales provide a high temporal resolution, necessary for determining short-time events, such as spikes and transients.
A scale-to-frequency conversion allows recovering a time-frequency representation from the scalogram, where the constant β = a ω depends on the wavelet.

Locally stationary processes
The stochastic model proposed for episodic memory data and used for the simulation study belongs to the class of locally stationary processes in Silverman's sense [14]. A zero mean stochastic process X(t) is a locally stationary process (LSP) in the wide sense if its covariance: can be written as: with is a non-negative function and r(τ ) is a stationary covariance function. When q(η) is a constant, (7) reduces to a stationary covariance. The proposed model is determined by the functions q(η) and r(τ ) as: and The latter assumption is necessary to assure that the resulting covariance is positive semi-definite. This choice of functions is motivated by the case study presented in Section 3.3. In Fig. 1, we illustrate how different parameter settings affect the LSP realizations obtained from the model covariance (8). Each set of realizations presented has power centered at time b q = 0.20 s, but different parameters determining the functions q and r result in a slowly varying behavior in "a" and "b, " compared to a much faster variation in "c" given by a larger value of c r . The models generating the realizations shown in "a" and "b" differ only for the parameter L, representing the minimum energy level. A higher parameter L, as in "b" and "c, " is more realistic and illustrates the possibility of using the level L to include the additional on-going spontaneous EEG activity.
For the practical use of the LSP model, the parameters defining the covariance need to be estimated. A maximum likelihood approach is not feasible since a closed-form expression of the process distribution is not known. A natural approach is the least square fitting of the model covariance to the classical sample covariance matrix (SCM). Unfortunately, the SCM is not a reliable estimator when the sample size is smaller than the number of elements to be estimated [30]. Additionally, even when the sample size would be sufficient to produce reliable estimates, the initialization of the starting point for the optimization problem is computationally expensive. The inference method HATS [21], based on the separation of the two factors defining the LSP covariance function in order to take advantage of their individual structure, allows to overcome the mentioned disadvantages of the least square fitting to the SCM.

Mean square error optimal kernel
The Wigner-Ville spectrum (WVS) is the stochastic version of the classical Wigner-Ville distribution in (3) [29], defined as: The traditional estimate of the WVS based on one realization of the process is equivalent to estimating the Wigner-Ville distribution (WV) (3). For LSPs, the WVS assumes the advantageous expression: and the ambiguity spectrum, defined as: is given by: [14,15]. Any TF representation member of Cohen's class can be expressed in terms of the ambiguity spectrum and kernel as: The MSE optimal kernel for LSP having Gaussian functions as factors of the covariance is expressed as: [15]. Thanks to (14), we are able to compute the parameter-dependent optimal kernel φ 0 (θ, τ ) for the introduced model (8), as: where δ 0 denotes the Dirac delta function. The efficient implementation and estimation are based on multitapers, i.e., a weighted sum of windowed spectrograms, as: with weights α k and windows h k (t), k = 1 . . . K [18][19][20]. The weights and windows are derived from the solution of the eigenvalue problem: where the rotated time-lag kernel is Hermitian and defined as: with The multitaper spectrogram solution is an efficient solution from implementation aspects since only a few α k differ significantly from zero. In the implementation for real signals, the corresponding analytic signal is used.

Pattern recognition neural networks
Pattern recognition networks are feed-forward networks, i.e., networks in which the information flow is only directed forward, which can be trained to classify inputs according to target classes [31]. The input and target vectors are divided into three separate sets for training, validation, and testing. After the training of the neural network, the validation phase is necessary to ensure that the network is generalizing and not overfitting, and the testing phase consists of a completely independent test of the network. As in standard networks used for pattern recognition, in this study, we consider a multilayer perceptron, with the input layer where the TF features are inserted, two hidden layers each consisting of 20 neurons, and an output layer for classification of the signals into three categories. The network structure is exemplified in Fig. 2. The activation function of the nodes in the hidden layers is the logistic function, also called sigmoid function, defined as: As usual for multiclass classification, the output node j converts its total input z j into a class probability p j by using the softmax function, defined as: where N c is the total number of classes. The cost function is the categorical cross-entropy between the target probabilities p k and the outputs of the softmax q k : and the learning technique is back-propagation [32,33].

Results and discussion
In this section, first we present the results of the evaluation of the proposed method in a simulation study where the true WVS, as given in (9), is known, and the MSE of the estimators can be calculated. The second part of the simulation study is the classification of simulated datasets based on features extracted from the different TF estimators. The same approach is then used in a real data application to classify the EEG signals measured during a memory encoding task.

Mean square error evaluation
The first part of this simulation study focuses on evaluating the proposed method performance in terms of MSE, in comparison with state-of-the-art estimators. We simulate M = 100 realizations of an LSP with covariance function defined by (8), sampled with f s = 512 Hz in 256 equidistant points during the time interval [T 0 , T f ] = [0, 0.5] seconds. The model parameters for simulating the data are (L, a q , b q , c q , c r ) = (120, 500, 0.20, 800, 15,000), and a few realizations from this parameter setting can be seen in Fig. 1b. The parameters (L, a q , b q , c q , c r ) are estimated with the inference method HATS. Based on the parameter estimates, the MSE optimal ambiguity kernel and corresponding multitapers according to (15) are calculated. In Fig. 3, the MSE optimal multitapers and weights for this parameter set are shown.
The state-of-the-art estimators considered for comparison are the single Hanning window spectrogram (HANN), calculated as in (1); the Welch 50% overlapped segment averaging with Hanning windows (WOSA); the classical Wigner-Ville spectrum estimate

Anderson and Sandsten EURASIP Journal on Advances in Signal Processing
(2020) 2020: 19 Page 9 of 18 (WV); and finally the continuous wavelet transform using the Morlet wavelet (CWT). To allow the comparison with the other TF methods, the scales for evaluating the scalogram (4) are computed as a = β f , where f denotes the frequency in hertz and the constant β is the center frequency of the Morlet wavelet [34].
Each method is optimized to evaluate it at its best performance in terms of MSE. For HANN, the window length N w ∈ {16, 32, 64, 128, 256} is optimized, while for WOSA, the optimized parameter is the number of windows K ∈ {2, 4, 8, 12, 16}, where the total length of all included windows is the total data length of 256 samples. For WV and CWT, a spectral magnitude scaling parameter is used to adjust the estimate magnitude to the true spectrum magnitude.
The expected value of the MSE, or mean MSE (mMSE), is computed approximated as the average of 100 independent realizations. The boxplots of the MSE achieved with the different methods in the 100 simulations are presented in Fig. 4. The mMSE value for the MSE optimal estimator with the true parameters (LSP) is 1.606 and with parameters estimated from the 100 realizations with HATS (LSP-HATS) is 1.655. Not only the spectral estimate obtained using LSP-based MSE optimal kernels achieves the best mMSE as expected, but using the true parameters or those estimated with HATS leads to almost the same result.
The optimal mMSE for HANN and WOSA is 3.438 and 2.061, respectively, obtained with N w = 32 for HANN and K = 8 for WOSA. The worst mMSE of 4.434 is, as expected, obtained with WV. The CWT performance is in between HANN and WOSA, with a mMSE of 2.431.  Table 1. The mMSE is extremely low even with only M = 10 realizations. However, the number of realizations required for a reliable estimate in real data cases might be higher especially in the case of a low signal-to-noise ratio.

Classification of simulated stochastic signals
Three classes of stochastic signals are simulated from the model (8), using slightly different parameters and center frequency f 0 , reported in Table 2. The parameters L and c r are the same for the three classes, whereas different parameters define the function q describing the instantaneous power, plotted in Fig. 6. The center frequency f 0 for each class is   (6,10) uniformly distributed around a meanf 0 , with a jitter frequency from the mean of maximum 2 Hz, i.e., f 0 ∼ U (f 0 −2,f 0 +2). A few random realizations from the three classes and their true WVS are presented in Figs The total classification accuracies of each method are summarized in Table 3, from which is evident how the LSP-HATS outperforms the state-of-the-art estimators with an accuracy of 86%. Similar performance is achieved with HANN and CWT, with accuracy of 65.0% and 69.7%, respectively. The methods with the worst classification performance are WOSA and WV, with an accuracy just above 50%. The confusion matrices for each method are reported in Tables 4, 5, 6, 7, and 8. The signals correctly classified appear on the diagonal of each confusion matrix; therefore, an ideal classifier would have 200 on each diagonal entry and 0 otherwise.

Application: classification of EEG signals
The data considered have been collected within a study on human memory retrieval, conducted at the Department of Psychology of Lund University, Sweden, during the spring of 2015. The EEG signals have been measured from one subject participating in the experiment. The encoding task consisted in associating a non-related word with a target picture belonging to one of three categories ("Faces, " "Landmarks, " "Objects"). A total of 180 trials were performed, 60 for each class. The EEG measurements were recorded from channel O1 according to the International 10-20 system, as primary visual areas can be found below the occipital lobes, and sampled with f s = 512 Hz. Analogously to the simulation study, each time series has 256 equidistant samples during the time interval [0, 0.5] seconds. A few signals from each class are shown in Fig. 9.
Similarly to the simulation study, the LSP model parameters L, a q , b q , c q , c r were inferred from 40 randomly selected realizations out of the total 60 for each class. The estimated Fig. 9 Three random EEG signals, after 70 Hz low-pass filtering, corresponding to three different trials of a memory task, from each category: a "Faces," b "Objects," and c "Landmarks"

Anderson and Sandsten EURASIP Journal on Advances in Signal Processing
(2020) 2020: 19 Page 15 of 18

Fig. 10
Example of LSP-based MSE optimal multitapers and weights for a random set of realizations from the three categories: a multitapers and b weights for "Faces," c multitapers and d weights for "Objects," and e multitapers and f weights for "Landmarks" parameters are used to compute the LSP-HATS kernels and corresponding optimal multitapers to be used for computing the training data TF spectra. The multitapers and weights for the three classes, estimated from all available trials, are illustrated in Fig. 10. The spectral estimates obtained with LSP-HATS, HANN (N w = 128), WOSA (K = 8), WV, and CWT are used to extract TF features, where each feature is the spectral power at each TF point in the time interval [0, 0.5] seconds and frequency up to 40 Hz. The remaining 20 realizations are used for independent testing of the network. The LSP parameters are re-estimated from these realizations and used in the computation of their LSP-HATS spectral estimates. The random partition in 40 realizations for training and 20 for testing is repeated 10 times, and the test is repeated with the different random sets of testing realizations. Classification accuracy is based on the 10 independent splits of the testing data.
The total classification accuracy of each method is reported in Table 9. The use the TF features obtained with the proposed MSE optimal LSP-HATS estimator has resulted in significantly higher classification accuracy, compared to the use of the other estimators. As a note of caution, despite that the proposed approach is, in theory, optimal in terms of MSE, the performance in real applications depends both on how appropriate is the model for the data and on the purpose of the application. Nevertheless, in our case study, the higher quality of the TF representation has improved the accuracy of classification.

Conclusion
The purpose of this paper is to show how the MSE optimal WVS offers a significant improvement in practical applications, leading to a higher classification accuracy thanks to the greater quality of the TF features extracted with the proposed approach. The estimation of the model parameters thanks to a novel inference method allows the explicit computation of the corresponding MSE optimal kernel. The kernel is transformed into a robust and computationally efficient multitaper spectrogram, and a complete procedure for the LSP-inference MSE optimal spectral estimator from data is achieved.
In a simulation study evaluating the performance of the method in terms of MSE, the spectral estimate obtained using the optimal kernel achieves the best average MSE compared to state-of-the-art TF estimators, such as the Hanning spectrogram, the Welch method, the classical Wigner-Ville spectrum estimator, and the Morlet waveletbased scalogram, as expected. More critical, computing the MSE optimal kernel from the parameters estimated with the novel HAnkel-Toeplitz Separation (HATS) inference method gives a similar result, which holds for parameter estimation from the small number of 10 realizations.
The performance of the proposed estimator is also evaluated in a classification simulation study, where it outperforms state-of-the-art estimators in terms of classification accuracy. The higher classification accuracy is also achieved in the episodic memory case study, where the parameter estimation on a suitable LSP model for EEG signals allows the extraction of improved features for classification.
Extensions of this research could consider a multidimensional model for extracting features from multichannel measurements.