 Research
 Open Access
Time–frequency based feature selection for discrimination of nonstationary biosignals
 Juan D MartínezVargas^{1}Email author,
 Juan I GodinoLlorente^{2} and
 Germán Castellanos‐Dominguez^{1}
https://doi.org/10.1186/168761802012219
© MartínezVargas et al.; licensee Springer. 2012
 Received: 1 December 2011
 Accepted: 28 August 2012
 Published: 9 October 2012
Abstract
This research proposes a generic methodology for dimensionality reduction upon time–frequency representations applied to the classification of different types of biosignals. The methodology directly deals with the highly redundant and irrelevant data contained in these representations, combining a first stage of irrelevant data removal by variable selection, with a second stage of redundancy reduction using methods based on linear transformations. The study addresses two techniques that provided a similar performance: the first one is based on the selection of a set of the most relevant time–frequency points, whereas the second one selects the most relevant frequency bands. The first methodology needs a lower quantity of components, leading to a lower feature space; but the second improves the capture of the timevarying dynamics of the signal, and therefore provides a more stable performance. In order to evaluate the generalization capabilities of the methodology proposed it has been applied to two types of biosignals with different kinds of nonstationary behaviors: electroencephalographic and phonocardiographic biosignals. Even when these two databases contain samples with different degrees of complexity and a wide variety of characterizing patterns, the results demonstrate a good accuracy for the detection of pathologies, over 98%. The results open the possibility to extrapolate the methodology to the study of other biosignals.
Keywords
 Partial Little Square
 Relevance Measure
 Short Time Fourier Transform
 Relevance Threshold
 Probability Distribution Function
Introduction
Biosignal recordings are useful to extract information about the functional state of the human organism. For this reason, such recordings are widely used to support the diagnosis, making automatic decision systems important tools to improve the pathology detection and its evaluation. Nonetheless, since the underlying biological systems use to have a time dependent response to environmental excitations, nonstationarity can be considered as an inherent property of biosignals [1, 2]. Moreover, changes in physiological or pathological conditions may produce significant variations along time. For instance, the normal blood flow inside the heart is mainly laminar and therefore silent; but when the flow becomes turbulent it causes vibration of surrounding tissues and hence is noisy, giving rise to murmurs, which can be detected analyzing the phonocardiographic (PCG) recordings. So, PCG recordings are nonstationary signals that exhibit sudden frequency changes and transients [3]. In another example, the electroencephalographic (EEG) signals represent the clinical signs of the synchronous activity of the neurons in the brain, but in case of epileptic seizures, there is a sudden and recurrent mal–function of the brain that exhibits considerable short–term nonstationarities [4] that can be detected analyzing these recordings.
However, in the aforementioned examples, the conventional analysis in time or frequency domains does not sufficiently provide relevant information for feature extraction and classification, limiting an automatic analysis for diagnostic purposes. Nonetheless, the main difficulty to automatically detect physiological or pathological conditions lies in the wide variety of patterns that use to appear in nonstationary conditions. Thus, for example, the possibility to automatically detect epileptic seizures from EEG signals is limited by the wide variety of frequencies, amplitudes, spikes, and waves that use to appear [5] along the time with no precise localization. Likewise, in PCG signals, murmurs appear overlapped with the cardiac beat, and sometimes cannot be easily distinguished even by the human ear [3]. Thereby, the performance of automatic decision support systems strongly depends on an adequate choice of those features that accurately parameterize the nonstationary behaviors that are present. Thus, a current challenging problem is to detect a variety of nonstationary biosignal activities with a low computational complexity, to provide tools for efficient biosignal databases management and annotation.
As commented before, it is well known that nonstationarity conditions give rise to temporal changes in the spectral content of the biosignals [2]. In this sense, the literature reports different features for examining the dynamic properties during transient physiological or pathological episodes. These features are usually extracted from the time–frequency (t–f) representations [1, 3, 4] of the signals under analysis. In order to estimate such t–f representations, both parametric and nonparametric estimations are generally employed. Among the most popular nonparametric approaches are: short time fourier transform (STFT), wavelet transform (WT); matching pursuit (MP); ChoiWilliams distribution (CWD), Wigner–Ville distribution (WVD) [2, 6]; and among the parametric models: time–variant autoregressive models, and adaptive filtering [3, 4].
The features that are extracted from t–f representations are expected to characterize abnormal behaviors [7]. Previous studies about EEG or PCG have shown that techniques such as matching pursuit are efficient for describing the t–f representations with a reduced number of atoms[8, 9]. Nonetheless, a signal decomposition grounded on matching pursuit does not necessarily provide the same number of t–f atoms for each recording, hence the multidimensional reduction arises as an additional issue to handle dynamic features of different lengths. Additionally, two–dimensional time–frequency/scale approaches, such as the t–f distributions (linear or quadratic) or even the Wavelet analysis, have also been widely used in biosignal processing, in particular for EEG [5, 10] and PCG [6, 11]. In this sense, an approach to create optimized quadratic t–f representations is proposed in [12] by designing kernels that lead to the maximum separability among classes. Moreover, recent approaches allow an EEG data representation with adaptive and sparse decompositions [13].
However, despite the flexibility provided by two–dimensional t–f representations, and regarding their use for classification purposes, some issues still remain open. For instance, the intrinsic dimensionality of t–f representations is huge, and thus, the extraction of relevant and nonredundant features becomes essential for classification. For this purpose, [5] proposes a straightforward approach to compute a set of t–f tiles that represent the fractional energy of the biosignal in a specific frequency band and time window; thus the energy can be evaluated by a simple measure, like the mean energy in each tile. Nonetheless, there is a noteworthy unsolved issue associated with localbased analysis in the tiling approach, namely the selection of the size of the local relevant regions [2]. As a result, the choice of features over the t–f representations is highly dependent on the final application. In this sense, linear decomposition methods have been also considered to extract features over t–f planes [1, 14], by arranging the t–f matrix in a single feature vector; however, in this case, it is strongly convenient to fix previously a confined area of relevance over the t–f representations [3]. Thus, in [15], a t–f region is selected by a twodimensional weighting function based on a mutual information criterion developed to obtain the maximum separability among classes, so the weighted space is mapped to a set of onedimensional features, although the methodology is restricted to a specific class of t–f representations.
Therefore, the extraction of relevant information from bi–dimensional t–f features have been discussed in the past as a means to improve performance during and after training in the learning process. Namely, as pointed out in [16], two main issues have to be solved to obtain an effective feature selection algorithm: the estimation of the measure associated with a given relevance function (i.e., a measure of distance among t–f planes), and the calculation of the multivariate transformation, which may maximize the differences among classes pointed out by the measures of relevance projecting the features onto a new space [1].
This research proposes a new methodology for dimensionality reduction of t–f based representations. The proposed methodology carries out consecutively a stage of feature selection with a stage of the linear decomposition of the time–frequency planes. At the beginning, the most relevant features (best localized points, or frequency bands over the t–f representations) are selected by means of some kind of relevance measure. As a result, both the irrelevant information and the computational burden of a later transformation and/or classification stage are significantly decreased. Then, data are projected into a lower dimensional subspace using orthogonal transformations. For the sake of comparison, techniques based on principal component analysis (PCA) and partial least squares (PLS) were considered throughout this study as nonsupervised and supervised transformations, respectively.
In order to evaluate the generalization capabilities of the proposed methodology, it has been evaluated using two different databases under different classification scenarios: the first uses a database of PCG recordings to detect heart murmurs; the second uses EEG recordings to detect epilepsy; and the third differentiates between five different types of EEG segments.
The article is organized as follows: The first section is dedicated to an overview of linear decomposition methods with extension to matrix data; second, the concepts of relevance in terms of relevant mappings and the selection of t–f based features by means of relevance measures, are described. Then, comparative results against other t–f based methods are provided [3, 5, 6].
Methods
The methodology introduced throughout this article stands on a prior segmentation of the different signals with a further characterization by means of a t–f representation. Later, the (t–f) planes are significantly reduced by means of a feature selection procedure followed by a linear decomposition. Considered stages are described next.
For the sake of simplicity, the timefrequency analysis carried out in this study has estimated using spectrograms based on the classical STFT [5]. A t–f representation of a segment of a nonstationary signal can be seen as a matrix set of features with column and row wise relationships, holding discriminant information about the underlying process.
where each column vector ${\mathit{x}}_{\mathit{cj}}^{\left(k\right)}$ represents the power content at F frequencies in the time instants j=1,…,T, while each row vector ${\mathit{x}}_{\mathit{ri}}^{\left(k\right)}$ represents the power change along T time instants, given the frequency bands i=1,…,F. The real–valued ${x}_{\mathit{ij}}^{\left(k\right)}$ is the power content at frequency i and time j.
Nonetheless, the main drawbacks of these arranged features are their large size and huge quantity of redundant data. Thereby, data reduction methods are required to accurately parameterize the activity of time–varying features, but preserving the information contained in the column and row–wise relationships of the matrix data [14].
Dimensionality reduction of t–f representations using linear decomposition approaches
Thus, to reduce the dimensionality of the input data, a transformation matrix $\mathit{W}\in {\mathbb{R}}^{\mathit{FT}\times p}$, with p≪FT, can be defined to map the original feature space ${\mathbb{R}}^{1\times \mathit{FT}}$ into a reduced feature space ${\mathbb{R}}^{1\times p}$, by means of the linear operation z^{(k)}=χ^{(k)}W, where z^{(k)} is the transformed feature vector. The transformation matrix W can be obtained using a nonsupervised approach such as PCA, or using a supervised approach such as PLS [17]. The vectorization approach in Equation (1) will be referred next as vectorized PCA/PLS, depending on the specific transformation used.
Finally, the feature vector z^{(k)} is obtained by stacking the columns of z^{(k)}into a new single feature vector. As described in [3], these approaches are termed 2D–PCA or 2D–PLS [18, 19], depending on the orthogonal transformation used to compute the matrices U and V in Equation (2).
Relevance analysis over t–f based features
According to some measure of evaluation, a relevance analysis distinguishes those variables which effectively represent the subjacent physiological phenomena. Such variables are named relevant features, whereas the measure of evaluation is known as relevance measure. In this sense, a variable selection tries to reject those variables whose contribution for representing a target is none or negligible (irrelevant features), as well as those that have repeated information (redundant features).

– Nonnegativity: i.e., $\rho (\mathcal{X},\mathcal{C},{x}_{\mathit{ij}})\ge 0.$

– Nullity: the function $\rho (\mathcal{X},\mathcal{C},{x}_{\mathit{ij}})$ is null if the feature x_{ ij }has not relevance at all.

– Nonredundancy: if ${x}_{\mathit{ij}}^{\prime}=\alpha {x}_{\mathit{ij}}+\varsigma $, where the real–valued α≠0 and ς is some noise with mean zero and unit variance, then $\left\rho \right(\mathcal{X},\mathcal{C},{x}_{\mathit{ij}})\rho (\mathcal{X},\mathcal{C},{x}_{\mathit{ij}}^{\prime}\left)\right\to 0.$
 a.Linear correlation, given by:${\rho}_{\text{lc}}\left({x}_{\mathit{ij}}\right\mathit{c})\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\left\phantom{\rule{0.3em}{0ex}}\frac{\mathit{E}\left\{\right({x}_{\mathit{ij}}^{\left(k\right)}\overline{{x}_{\mathit{ij}}}\left)\right({c}^{\left(k\right)}\overline{\mathit{c}}):\forall k\}}{\sqrt{\mathit{E}{\left\{\right({x}_{\mathit{ij}}^{\left(k\right)}\overline{{x}_{\mathit{ij}}}):\forall k\}}^{2}\mathit{E}{\left\{\right({c}^{\left(k\right)}\overline{\mathit{c}}):\forall k\}}^{2}}}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}},$(4)
 b.Symmetrical uncertainty, which is a measure of uncertainty of a random variable, based on the informationtheoretical concept of entropy, given by:${\rho}_{\text{su}}\left({x}_{\mathit{ij}}\right\mathit{c})=\frac{\mathit{H}\{{x}_{\mathit{ij}}^{\left(k\right)}:\forall k\}\mathit{H}\left\{{x}_{\mathit{ij}}^{\left(k\right)}\right{c}^{\left(k\right)}:\forall k\}}{\mathit{H}\{{x}_{\mathit{ij}}^{\left(k\right)}:\forall k\}\mathit{H}\{{c}^{\left(k\right)}:\forall k\}}$(5)
where $P\left({x}_{\mathit{ij}}^{\left(k\right)}\right)$ and P(c^{(k)}) are the probability distribution functions (PDF) of the features of interest and the labels, respectively; and $P\left({x}_{\mathit{ij}}^{\left(k\right)}\right{c}^{\left(k\right)})$ is the conditional PDF. For computing these functions, histogramdriven estimators were used, and the sums on Equations (6) and (7) were carried out along the histogram bins. However, if the number of recordings is lower of certain threshold, another kind of estimators, as kernel based could be used.
Selection of the most informative areas from t–f representations
 i)
The first one consists on the evaluation of the relevance for each point of the t–f representation, and then selecting the set of the most relevant to appraise a reduced feature vector that later will be transformed by conventional one dimensional (1–D) linear decomposition methods. This approach is described in Algorithm 1 and will be referred later as 1D–PCA and 1D–PLS (depending on the transformation technique used).
 ii)
The second consists on evaluating the relevance of the timevarying spectral components of the t–f representation, and then selecting the most relevant frequency bands to appraise a t–f based feature matrix, which will be further reduced using a two dimensional (2–D) matrixbased approach. This approach is described in Algorithm 2 and will be referred later as 2D–PCA and 2D–PLS (depending on the transformation technique used).
Algorithm 1 Selection of t–f based features using relevance measures and dimensionality reduction (1–D approach)
 1.
Estimate the relevance measure ρ(x _{ ij }c) of the t–f points, using some of the relevance measures defined in Equations (4) or (5).
 2.
Select the most relevant t–f variables
for k=1 to K do
${\widehat{X}}^{\left(k\right)}=\left\{{x}_{\mathit{ij}}^{\left(k\right)}\phantom{\rule{1em}{0ex}}\forall i,j:\rho \left({x}_{\mathit{ij}}\right\mathit{c})\ge \eta \right\}$
 3.
Convert t–f matrices into vectors
for k=1 to K do
${\chi}^{\left(k\right)}=vec\left({\widehat{X}}^{\left(k\right)}\right)=\left[{\left({\widehat{x}}_{c1}^{\left(k\right)}\right)}^{\top},{\left({\widehat{x}}_{c2}^{\left(k\right)}\right)}^{\top},\dots ,{\left({\widehat{x}}_{\mathit{cT}}^{\left(k\right)}\right)}^{\top}\right]$
 4.
Compute the transformation matrix V of 1D–PCA or 1D–PLS using the relevant feature vector set {χ ^{(1)},χ ^{(2)},…,χ ^{(K)}}.
 5.
Transform the feature vectors χ ^{(k)}into the reduced feature vector z ^{(k)}, as
for k=1 to K do
z^{(k)}=χ^{(k)}V
end for
Algorithm 2 Frequency band selection from t–f representations using relevance measures and dimensionality reduction by matricial approach (2–D approach)
Input: t–f matrix dataset {X^{(1)},X^{(2)},…,X^{(K)}}, relevance threshold η.
 1.Estimate the relevance measure ρ(x _{ ij }c) of the t–f points, building the relevance map R:$\begin{array}{l}\mathit{R}=\left[\begin{array}{llll}\rho \left({x}_{11}\right\mathit{c})& \rho \left({x}_{12}\right\mathit{c})& \dots & \rho \left({x}_{1T}\right\mathit{c})\\ \rho \left({x}_{21}\right\mathit{c})& \rho \left({x}_{22}\right\mathit{c})& \dots & \rho \left({x}_{2T}\right\mathit{c})\\ \vdots & \vdots & \ddots & \vdots \\ \rho \left({x}_{F1}\right\mathit{c})& \rho \left({x}_{F2}\right\mathit{c})& \dots & \rho \left({x}_{\mathit{FT}}\right\mathit{c})\end{array}\right]\end{array}$
 2.Compute the average relevance value on the frequency range, as$\begin{array}{l}{\rho}_{\mathit{rF}}=\mathit{E}\{{\rho}_{\mathit{ij}}:\forall j\}\end{array}$
 3.
Select the most relevant frequency bands
for k=1 to K do
${\widehat{X}}^{\left(k\right)}=\left\{{\mathit{x}}_{\mathit{rF}}^{\left(k\right)}\phantom{\rule{1em}{0ex}}\forall r:{\rho}_{\mathit{rF}}\ge \eta \right\}$
 4.
Compute the transformation matrices U and V of 2D–PCA (or 2D–PLS, respectively), using the reduced t–f matrices set $\{{\widehat{X}}^{\left(1\right)},{\widehat{X}}^{\left(2\right)},\dots ,{\widehat{X}}^{\left(K\right)}\}$.
 5.
Transform the reduced t–f matrices ${\widehat{X}}^{\left(k\right)}$ into the reduced feature vector z ^{(k)}, as
for k=1 to K do
${\mathit{Z}}^{\left(k\right)}=\mathit{U}{\widehat{\mathit{X}}}^{\left(k\right)}\mathit{V}$
${\mathit{z}}^{\left(k\right)}=vec\left({\mathit{Z}}^{\left(k\right)}\right)$
end for
Summary of the proposed approaches
Algorithm  Transformation  Relevance measure  

Method 1  Algorithm 1  1D–PCA  ρ _{lc} 
Method 2  Algorithm 1  1D–PCA  ρ _{su} 
Method 3  Algorithm 1  1D–PLS  ρ _{lc} 
Method 4  Algorithm 1  1D–PLS  ρ _{su} 
Method 5  Algorithm 2  2D–PCA  ρ _{lc} 
Method 6  Algorithm 2  2D–PCA  ρ _{su} 
Method 7  Algorithm 2  2D–PLS  ρ _{lc} 
Method 8  Algorithm 2  2D–PLS  ρ _{su} 
Experimental set–up
Database acquisition and preprocessing
Summary of the characteristics of the database
Database  K  Classes  K_{ C }(class)  f_{ s }[Hz]  N _{ bits }  l 

PCG  548  2  274 (normal), 274 (murmur)  4000  16  4800 
EEG  500  3  100 (Z), 100 (O), 100 (N), 100 (F), 100 (S)  173.6  12  4096 
PCG database
This collection is made up of 45 adult subjects, who gave their informed consent approved by an ethical committee, and underwent a medical examination. A diagnosis was carried out for each patient and the severity of the valve lesion was evaluated by cardiologists according to clinical routine. A set of 26 patients were labeled as normal, while 19 were tagged as pathological with evidence of systolic and diastolic murmurs caused by valve disorders. Furthermore, eight phonocardiographic recordings corresponding to the four traditional focuses of auscultation were taken per patient in the phase of postexpiratory and postinspiratory apnea. Every recording lasted 12 s approximately, and was obtained from the patient standing in dorsal decubitus position. Next, after visual and audio inspection by cardiologists, some of the eight signals were removed because of artifacts and undesired noise. An electronic stethoscope (WelchAllyn Meditron model) was used to acquire the PCG simultaneously with a standard 3lead electrocardiographic (EKG) (since the QRS complex is clearly determined, DII derivation is synchronized as a time reference). Both signals were sampled with 44.1 kHz rate and amplitude resolution of 16 bits. Preprocessing was carried out including downsampling at 4000 Hz, amplitude normalization and inter–beat segmentation, as described in [3]. Finally, after the segmentation process, the database holds 548 heartbeats in total: 274 with murmurs, and 274 that were labeled as normal. The selection of the 548 beats used for training and validation was carried out by expert cardiologists related to the most representative beats of normal and pathological patients (with murmurs) without having into account the number of heart beats provided for each patient. The database belongs to both Universidad Nacional de Colombia and Universidad de Caldas. Recording was carried out taking into account the rules fixed by the Research Ethics Committee of the Universidad de Caldas which provides guidelines and supervision during those procedures involving human beings.
EEG database
The EEG signals correspond to 29 patients with medically intractable focal epilepsies. They were recorded by the Department of Epileptology of the University of Bonn, by means of intracranially implanted electrodes [21]. All EEG signals were recorded with an acquisition system of 128 channels, using average common reference. Data were digitized at 173.61 Hz, with 12 bits of resolution. The database comprises five sets (denoted as Z, O, N, F, S) composed of 100 single channel EEG segments of 23.6 s and 4096 timepoints, which were selected and extracted after visual inspection from continuous multichannel EEG to avoid artifacts (e.g., muscular activity or eye movements). Datasets Z and O consist of segments taken from scalp EEG recordings of five healthy subjects using the standard 10–20 electrode placement. Volunteers were awake, relaxed with their eyes open (Z) and eyes closed (O), respectively. Datasets N, F, and S were selected from presurgical diagnosed EEG recordings. The signals were selected from five patients who achieved a complete control of the epileptic episodes after the dissection of one of the hippocampal formations, which was correctly diagnosed as the epileptogenic zone. Segments of set F were recorded in the epileptogenic zone, and segments of Nx in the hippocampal zone on the opposite side of the brain. While sets N and F only contain activity measured on inter–ictal intervals, set S only contains recordings with ictal activity. In this set, all segments were selected from every recording place exhibiting ictal activity.
Estimation of the t–f representations
Evaluation of the classification performance
 i.
Scenario 1. Murmur detection of PCG signals. The PCG recordings were arranged into two classes (normal and pathological).
 ii.
Scenario 2. Classification of EEG signals into three categories. The EEG segments were sorted into three different classes. Z and O types of EEG segments were combined into a single class; N and F types were also combined into a single class; and type S was the third class. This scenario with only three categories is close to the real medical applications. Following this criterion the database was split in: normal (i.e., types Z and O) containing 200 recordings, seizure free (i.e., types N and F) with 200 recordings, and seizure (i.e., type S) with 100 recordings.
 iii.
Scenario 3. Classification of EEG signals into five different categories. In this scenario each type of EEG segments (Z, O, N, F, S) was considered as a single class, each containing 100 recordings.
The evaluation of the classification accuracy of each method was carried out using a simple $\stackrel{\u030c}{k}$–NN classifier evaluated following a crossvalidation scheme [22]. Several reasons justify the use of this classifier: it is straightforward to implement; it generally leads to good recognition performance thanks to the nonlinearity of its decision boundaries; and its complexity is assumed to be independent of the number of classes. The cross–validation approach used to evaluate the performance of the methodology consists of the division of each dataset into 10 folds containing different recordings, and an even quantity of records from each class. Nine of these folds were used for training and the remaining one for validation purposes. The methods enumerated in Table 1 were applied to the training folds, and the resulting feature spaces were used to train the $\stackrel{\u030c}{k}$–NN classifier. Then, the relevant measures, the transformation matrices, and the classifier obtained during the training phase were used to categorize the recordings of the validation fold. This procedure was repeated changing the training and validation folds, until the 10 folds were used.
where n_{ C } is the number of correctly classified patterns, n_{ T } is the total number of patterns used to feed the classifier, n_{TP}is the number of true positives (objective class accurately classified), n_{FN}is the number of false negatives (objective class classified as control class), n_{TN}is the number of true negatives (control class accurately classified), and n_{FP} is the number of false positives (control class classified as objective class). In this study, the pathological classes correspond to the objective class, while the normal classes correspond to the control class. The accuracy, sensitivity and specificity are calculated for each validation fold and the mean and standard deviation were used as figures of merit.
For the multi–class classification problems (scenarios 2 and 3), the sensitivity and specificity were computed taking each class as the target and the remaining ones as the control classes.
Results
This section analyzes the tuning of the parameters that characterize the methods proposed: the number of neighbors of the $\stackrel{\u030c}{k}$–NN classifier, the number of components used by the linear decomposition approaches, and the relevance threshold. For the sake of comparison the mean and the standard deviation of the accuracy obtained for the different methods were computed. For those configurations that provided the best accuracy, the sensitivity and the specificity were also computed.
The tuning of the proposed methods was carried out for the PCG database using the scenario 1, whereas for the EEG database the procedure was carried out using the scenario 2; finally, with the best configurations obtained for the scenario 2, the scenario 3 was tested.
Tuning of the $\stackrel{\u030c}{k}$–NN based classifier
By stepwise increasing the number of neighbors, $\stackrel{\u030c}{k}$, the optimal value was determined as the one which provided the highest accuracy. The procedure was done for each algorithm, by using all the t–f representations available in each scenario, and the relevance threshold was selected as η=100% (i.e., no relevance criterion was introduced). Additionally, the number of components for PCA/PLS (n for the 1D methods, n_{ r } and n_{ C }for the 2D methods) were selected based on the number of components that describes the 90% of the total variability of the dataset.
In the framework of the scenario 1, Figure 5A shows that applying Algorithm 2 to the PCG data, the accuracy of the classifier decreases as the number of neighbors increases. Moreover, the standard deviation is lower for intermediate values and becomes larger as the number of neighbors increases. In the context of the scenario 2, Figure 5B shows similar conclusions for EEG signals.
Similar trends appear using the method based on the selection of frequency bands (Algorithm 2). Figures 5C,D show that the performance decreases as the number of neighbors increases for the PCG and EEG databases (scenarios 1 and 2). Note that the results using Algorithm 2 are more stable than with Algorithm 1. These results reflect the overall structure of the feature spaces obtained.
Accordingly, for both Algorithms, the optimal number of neighbors was fixed as $\stackrel{\u030c}{k}=1$ for the PCG database, and $\stackrel{\u030c}{k}=3$ for the EEG database. After the feature selection stage, the decision boundary among classes is expected to be clearer than when no relevance measures are used. Thus, after the relevance analysis, the number of neighbors $\stackrel{\u030c}{k}$ of the classifier could be tuned in a higher value, however, the initial estimation (no relevance) is an admissible approximation.
Selection of the relevant features
The variable selection was carried out choosing the most relevant features according to the proposed measures of relevance: linear correlation, ρ_{lc}, and symmetrical uncertainty, ρ_{su}. The training sets of the t–f representations for the PCG and EEG signals were used to compute point–wise relevance measures, yielding to a relevance matrix, which is a dependence measure of each t–f point with its respective label. As a result, a global measure of the degree of dependence is accomplished. Therefore, the amount of features is selected according to a universal threshold fixed a priori over the relevance map, which shows the t–f areas or frequency bands with higher relation to the phenomena under study. Additionally, as explained above, the threshold is varied as a function of the total number of features in hand, i.e., the higher the number of selected features, the lower the relevance threshold.
In order to find the most relevant features, the relevance measures estimated were reshaped as vectors and later sorted from highest to lowest values. For both databases, the relevance vectors sorted using the relevance measures considered are shown at the bottom of each subfigure in Figure 6. In the case of the t–f point selection, and using the methodology described in Algorithm 1, the variables were selected according to their relevance, selecting those with a value over a certain threshold η. Such threshold should be adjusted to optimize the accuracy of the classifier.
Regarding to the selection of the frequency bands described in Algorithm 2, the relevance measures were averaged over the time axis. As a result, a vector corresponding to the relevance of the frequency axis is calculated. The values of relevance, ${\rho}_{r{F}_{\text{lc}}}$ and ${\rho}_{r{F}_{\text{su}}}$, corresponding to the frequency axis for both databases are shown in the left plots of each subfigure in Figure 6. Thus, the frequency bands were selected according to their relevance.
Data transformation by linear decomposition methods
After selecting the most relevant variables, the data set obtained (comprising the most relevant variables) was further reduced using the linear transformation methods commented before. The amount of latent components n for 1D–PCA (methods 1 and 2) and 1D–PLS (methods 3 and 4), as well as the number of time n_{ C } and frequency n_{ r }components for 2D–PCA (methods 5 and 6) and 2D–PLS (methods 7 and 8), was selected according to the maximum classification rate obtained.
Figures 9B,C,E,F show the performance of the classifier vs. the number of column and row components of the 2D methods used in Algorithm 2. Figures 9B,E show the classifier outcomes using the 2D–PCA methods, while Figures 9C,D show the results using the 2D–PLS methods proposed.
In the scenario 1, the number of row and column components of the t–f representation of PCG signals must be augmented to achieve a stable behavior. In the case of EEG signals used in the scenario 2, both methods provided a stable behavior as the number of components in rows and columns increased. Furthermore, the accuracy increased with a small number of column components, whereas it got stable as the number of row components augmented. Since the column components are related to the temporal activity, and the row components are associated with the spectral variability, the behavior exhibited by the EEG signals can be interpreted as a smooth temporal activity with a higher spectral variability, while in PCG both temporal and spectral activities present a large variability.
Summary of results
Best performance obtained for the methodologies studied using the PCG database (scenario 1)
Original size of thet–f representation: 512 × 480= 245760. Number of neighbors: 1  

Methodology  ρ _{ min }  n _{ rel }  n =( n _{ c } × n _{ r } )  Accuracy (%)  Sensitivity (%)  Specificity (%) 
PCA with  NA  NA  18=(9×2)  92.52±2.32  92.70±6.21  92.30±3.69 
tiling [5]  
PLS with  NA  NA  18=(9×2)  93.80±2.85  94.52±5.48  93.02±4.78 
tiling [5]  
Vectorized  NA  NA  NA  91.22±2.76  90.50±2.60  91.88±6.51 
PCA [14]  
Vectorized  NA  NA  NA  94.89±2.24  94.55±3.05  95.25±4.24 
PLS [14]  
Algorithm 2 +  NA  NA  NA  99.28±1.52  99.64±1.13  98.90±2.48 
2D–PCA  
(no relevance)  
Algorithm 2 +  NA  NA  NA  99.28±1.52  99.64±1.13  98.90±2.48 
2D–PLS  
(no relevance)  
Method 1  45%  110592  27  93.07±3.50  92.72±4.18  93.43±4.10 
Method 2  15%  36864  12  96.72±2.06  95.62±3.76  97.78±3.98 
Method 3  40%  98304  26  98.18±1.49  98.20±2.54  98.16±2.61 
Method 4  15%  36864  21  98.72±1.23  98.90±1.77  98.53±1.90 
Method 5  40%  97920  21=(7× 3)  97.09±2.12  97.43±3.00  96.72±2.04 
Method 6  10%  24576  70=(10× 7)  99.28±1.25  99.64±1.13  98.92±1.75 
Method 7  40%  97920  60=(12× 5)  97.46±1.94  98.17±2.56  96.72±2.69 
Method 8  10%  24576  70=(10× 7)  99.64±0.76  99.64±1.13  99.63±1.17 
Results for the EEG database (scenario 2)
Original size of thet–f representation: 512×500=256000. Number of neighbors: 3  

Methodology  ρ _{ min }  n _{ rel }  n =( n _{ c } × n _{ r } )  Accuracy (%)  Sensitivity (%)  Specificity (%)  
PCA with  NA  NA  15=(3×5)  94.00±3.65  (ZO)  94.50±4.38  95.67±4.17 
Tiling [5]  (NF)  98.50±2.42  95.67±3.17  
(S)  84.00±10.75  99.00±1.29  
PLS with  NA  NA  15=(3×5)  93.60±3.37  (ZO)  94.00±4.59  95.67±4.17 
Tiling [5]  (NF)  98.50±2.42  95.33±2.81  
(S)  83.00±10.59  98.75±1.32  
Vectorized  NA  NA  NA  93.40±4.12  (ZO)  97.00±4.22  94.00±2.63 
PCA [14]  (NF)  93.50±4.12  97.00±3.31  
(S)  86.00±16.47  98.50±2.11  
Vectorized  NA  NA  NA  96.00±2.98  (ZO)  100.00±0.00  95.33±3.58 
PLS [14]  (NF)  95.50±4.97  98.00±2.81  
(S)  89.00±11.01  100.00±0.00  
Algorithm 2 +  NA  NA  NA  98.00±1.88  (ZO)  100.00±0.00  98.66±2.33 
2D–PCA  (NF)  99.00±2.10  98.00±2.33  
(no relevance)  (S)  92.00±7.88  100.00±0.00  
Algorithm 2 +  NA  NA  NA  98.20±1.13  (ZO)  100.00±0.00  98.33±2.35 
2D–PLS  (NF)  99.00±2.10  98.66±1.72  
(no relevance)  (S)  93.00±4.83  100.00±0.00  
Method 1  50%  128000  20  94.40±2.95  (ZO)  99.00±2.11  94.00±4.10 
(NF)  91.50±6.69  97.00±4.29  
(S)  91.00±11.97  99.75±0.79  
Method 2  15%  38400  25  94.80±3.29  (ZO)  95.00±4.71  95.00±4.23 
(NF)  95.00±3.33  97.33±2.63  
(S)  94.00±8.43  99.25±1.21  
Method 3  15%  38400  9  97.80±1.14  (ZO)  98.00±2.58  99.33±1.41 
(NF)  98.00±2.58  98.00±1.72  
(S)  97.00±4.83  99.25±1.21  
Method 4  10%  25600  13  98.20±1.99  (ZO)  98.50±2.42  99.00±2.25 
(NF)  98.50±3.37  98.00±1.72  
(S)  97.00±4.83  100.00±0.00  
Method 5  50%  128000  425=(17×25)  98.80±1.03  (ZO)  100.00±0.00  99.00±1.61 
(NF)  99.00±2.11  99.33±1.49  
(S)  96.00±5.16  99.75±0.79  
Method 6  45%  115200  442=(17×26)  98.40±1.26  (ZO)  100.00±0.00  99.33±1.41 
(NF)  99.00±2.11  98.33±2.36  
(S)  94.00±6.99  99.75±0.79  
Method 7  45%  115200  42=(3×14)  98.60±0.97  (ZO)  100.00±0.00  99.00±1.61 
(NF)  98.00±2.58  99.33±1.41  
(S)  97.00±4.83  99.50±1.05  
Method 8  40%  102400  784=(28×28)  98.80±1.03  (ZO)  100.00±0.00  99.33±1.41 
(NF)  99.50±1.58  98.67±1.72  
(S)  95.00±5.27  100.00±0.00 
Results for the five class problem with the EEG database (scenario 3)
Methodology  Accuracy (%)  Sensitivity (%)  Specificity (%)  

Tiling + PLS [5]  79.40±7.00  Z  71.00±16.33  93.00±3.07 
O  83.00±11.60  94.75±3.81  
N  85.00±13.54  92.25±4.16  
F  73.00±14.94  95.00±3.54  
S  85.00±8.50  99.25±1.21  
Method 4  91.00±1.94  Z  93.00±6.75  97.00±2.58 
O  94.00±6.99  99.25±1.21  
N  94.00±5.16  95.00±2.64  
F  77.00±9.49  98.00±2.30  
S  97.00±4.83  99.50±1.05  
Method 8  94.40±3.75  Z  99.00±3.16  98.25±2.06 
O  95.00±7.07  99.50±1.58  
N  96.00±5.16  97.00±1.58  
F  88.00±10.33  98.50±1.75  
S  94.00±6.99  99.75±0.79 
Discussion
Several tests were carried out to assess the behavior of the proposed methodologies described in Algorithms Algorithm 1 Selection of t–f based features using relevance measures and dimensionality reduction (1–D approach) and Algorithm 2 Frequency band selection from t–f representations using relevance measures and dimensionality reduction by matricial approach (2–D approach). Two different kinds of signals with different stochastic behaviors were tested: PCG signals, with a well defined temporal structure and well localized events; and EEG signals, whose stochastic structure is unfixed. The relevance measures clearly reflected the particular stochastic behavior of each kind of signal.
Figure 6 demonstrates that, for EEG signals, the information content is distributed along the time axis, whereas it is well localized in the case of PCG signals. The relevance analysis also demonstrates the presence of informative and non informative frequency bands. The selectivity of each relevance measure is different and also depends on the specific signal, as it is shown in Figure 6.
In the scenario 1, for PCG signals, the symmetrical uncertainty is the most selective relevance measure; linear correlation provided some peaks of relevance, but in general is very disperse. This is also reflected as a faster decrease of the performance (Figure 7A) for the linear correlation measure, and a more sustained performance with the symmetrical uncertainty (Figure 7B). Since the values of the relevance measures are very low in a large span of the time–frequency plane of PCG signals, a large amount of points can be interpreted as uninformative. Moreover, there is a zone with a lower accuracy after a peak of performance around 20 to 30% of the relevant features. Regarding the 2D methodology, the symmetrical uncertainty is the most stable measure, since its drop of performance is very low. Nevertheless, the method based on linear correlation reveals a similar performance. The larger stability of the symmetrical uncertainty can be explained because it spans larger time–frequency areas, including high frequency components. Therefore, according to the previous results using the Algorithms Algorithm 1 Selection of t–f based features using relevance measures and dimensionality reduction (1–D approach) and Algorithm 2 Frequency band selection from t–f representations using relevance measures and dimensionality reduction by matricial approach (2–D approach), the best relevance measure is the symmetrical uncertainty, given its selectivity and effectiveness for provided feature selection, and its stability in the accuracy.
In the scenario 2, for EEG signals, the behavior for both relevance measures is quite similar. The relevance measure based on the symmetrical uncertainty is the most selective, which is reflected in a more sustained accuracy rate (Figure 7D), compared with the fast declination of the performance shown in Figure 7C, where the linear correlation relevance measure is considered. Regarding to the 2D methods, the linear correlation and the symmetrical uncertainty showed the highest values of relevance for the lower frequency bands. Regarding the linear decomposition methods, 1DPLS and 2DPLS methods demonstrate, in general, the best performance, exhibiting a difference with respect to PCA around 2 or 3 points of accuracy. However, PCA tends to stabilize the performance of the classifier with a lower quantity of components, both in the 1D and 2D versions. In any case, the performance of the 1DPLS and 2DPLS methods converges with a few amount of components and remains stable.
For the scenario 2, using the 1D methodology (methods 1 to 4), the EEG data needed a small amount of features to achieve high performance rates. Also, the number of temporal components of the 2D methodology is very low, which means that the stochastic activity is easier to parameterize using the 2D approaches. On the other hand, PCG signals need more components in both 1D and 2D approaches, given that local events and specific stochastic behaviors that these signals exhibit must be modeled.
Additionally, as shown in Tables 3 and 4, it can be seen that the proposed methodologies perform better (about 3–4 points, in terms of accuracy) than recent approaches discussed in the literature [5, 14]. These results were expected due to the capabilities of both approaches to capture the most informative relevant points or bands over the t–f planes, which additionally brings computational stability to the dimensionality reduction process. The Algorithm 2 with no relevance criterion (η=100%) provide similar performance to the best approaches (methods 7 and 8), however the feature selection stage based on relevance measures (linear correlation and symmetrical uncertainty) allows to reduce the computational burden of the process, because the size of the matrices is further reduced before the dimensionality reduction process.
Regarding the scenario 3, the results obtained with methods 4 and 8 outperformed those using the algorithm in [5] (up to 10 classification points). Nevertheless, the band selection methodology described in Algorithm 2 (method 8) is more suitable to discriminate among the different classes.
Summary of best performance rates for each database
Methodology  t–f  n _{ rel }  % Reduc. 1  n=(n_{ c }×n_{ r })  % Reduc. 2  Accuracy  

representation  
size  
PCG   Method 4  245760,  36864  85%  21  94.30%  98.20±1.99% 
Scenario 1  Method 8  (512×480)  24576  90%  70=(10×7)  71.52%  99.64±0.66% 
EEG   Method 4  256000,  25600  90%  13  99.99%  98.20±1.99% 
Scenario 2  Method 8  (512×500)  102400  60%  784=(28×28)  70%  98.72±1.23% 
EEG   Method 4  256000,  25600  90%  13  99.99%  91.00±1.94% 
Scenario 3  Method 8  (512×500)  102400  60%  784=(28×28)  70%  94.40±3.75% 
The feature selection stage allows an effective selection of the most relevant features. In accordance with the results shown in Table 6 for the scenario 1, an accuracy of 99.64% was obtained with only 10% of the features extracted from the PCG signals; and for the EEG database, accuracies of 98.80 and 94.40% were obtained for the scenarios 2 and 3, respectively, by using 40% of the t–f features.
On the other hand, the methodology of Algorithm 1 needed a lower quantity of components, which is reflected in a lower feature space dimensionality; but Algorithm 2 allows larger matrices with almost the same performance. In the case of the 1D methodology (methods 1 to 4), and for data matrices of size (F·T)×K, it is necessary to compute transformation matrices of size (F·F)×n, while for the 2D methodology (methods 5 to 8), two transformation matrices of F×n_{ r } and T×n_{ c } are needed while working with two data matrices of size T×(F·K) and F×(T·K).
Conclusions
This research proposes a new and promising approach for feature selection over t–f based features that can be applied to nonstationary biosignal classification. The results obtained showed a high performance under different scenarios and demonstrated that the accuracy is stable for EEG and PCG signals, giving evidence of the generalization capabilities of the proposed methodology for different signals with diverse nonstationary behaviors. The results open the possibility to extrapolate the methodology to the study of other biosignals.
The method directly deals with highly redundant and irrelevant data contained in the bidimensional t–f representations, combining a first stage of irrelevant data removal by variable selection using a relevance measure, with a second stage of redundancy reduction by linear transformation methods. Under these premises, two methodologies have been derived: the first one aimed to find the most relevant t–f points; the second one devised to select the frequency bands with a higher relevance. Each methodology needs a particular linear decomposition approach: in the first case, PCA and PLS methods were used, whereas, in the second approach, a to the matrixdata based generalization these methods was used.
Although this work uses the spectrograms, the proposed approaches can be applied to other kind of realvalued t–f representations, such as timefrequency distributions, wavelet transforms, and matching pursuit, among others.
The relevance analysis was evaluated using two supervised measures: linear correlation and symmetrical uncertainty. Under the same premises, the application of these measures demonstrated a significant improvement in comparison with the case when no relevance measure was used. Besides, the relevance measure based on the symmetrical uncertainty provided a better performance, allowing an effective selection of the most relevant variables, thus diminishing the computational burden of the linear decomposition methods and of the classifier. In addition, the relevance analysis serves itself as an interpretation tool, giving information about those t–f patterns closer related to abnormalities and pathological behavior.
On the other hand, it was found that the use of a supervised method (such as PLS) clearly improved the performance of the classifier. Moreover, the performance of the 1D and 2D versions was found almost similar. Although the 1D methodology needs a lower quantity of components, which is reflected in a lower feature space dimensionality, the 2D methodology allows to take into account the dynamic information of each spectral component over the t–f planes, which was reflected in more stable results.
As a future study, the introduction of the relevance measure directly into the linear decomposition method should be evaluated; so a relevance and a redundancy analysis could be carried out in the same step, but probably at the expense of a larger computational burden and memory requirements. Additionally, the use of other linear or nonlinear decomposition techniques, such as linear discriminant analysis or local linear embedding should be evaluated. Moreover, the use of other relevance measures such as mutual information might also be considered, since it is an effective criterion for feature selection algorithms.
Declarations
Acknowledgements
This research was supported by “Centro de Investigación e Innovación de Excelencia  ARTICA, Programa Nacional de Formación de Investigadores ”GENERACIÓN DEL BICENTENARIO”, 2011,” funded by COLCIENCIAS, “Servicio de monitoreo remoto de actividad cardiaca para el tamizaje clinico en la red de telemedicina del departamento de Caldas” funded by Universidad Nacional de Colombia and Universidad de Caldas and through the project grant TEC200914123C0402 financed by the Spanish government.
Authors’ Affiliations
References
 SepulvedaCano LM, AcostaMedina CD, CastellanosDominguez G: Relevance Analysis of Stochastic Biosignals for Identification of Pathologies. EURASIP J. Adv. Signal Process 2011, 2011: 10. 10.1186/16876180201110View ArticleGoogle Scholar
 Sejdic E, Djurovic I, Jiang J: Timefrequency feature representation using energy concentration: an overview of recent advances. Digital Signal Process 2009, 19: 153183. 10.1016/j.dsp.2007.12.004View ArticleGoogle Scholar
 AvendanoValencia L, GodinoLlorente J, BlancoVelasco M, CastellanosDominguez G: Feature extraction from parametric timefrequency representations for heart murmur detection. Annals Biomed. Eng 2010, 38(8):27162732. 10.1007/s1043901000774View ArticleGoogle Scholar
 Tarvainen MP, Georgiadis S, Lipponen JA, Hakkarainen M, Karjalainen PA: Timevaryingspectrum estimation of heart rate variability signals with Kalman smoother algorithm. 2009, 14.Google Scholar
 Tzallas A, Tsipouras M, Fotiadis D: Epileptic seizure detection in electroencephalograms using timefrequency analysis. IEEE Trans. Inf. Technol. Biomed 2009, 13(5):703710.View ArticleGoogle Scholar
 QuicenoManrique AF, GodinoLlorente JI, BlancoVelasco M, CastellanosDominguez G: Selection of dynamic features based on timefrequency representations for heart murmur detection from phonocardiographic signals. Annals Biomed. Eng 2010, 38: 11837. 10.1007/s1043900998383View ArticleGoogle Scholar
 Debbal S, BereksiReguid F: Time–frequency analysis of the first and the second heartbeat sounds. Appl. Math. Comput 2007, 128(2):10411052.View ArticleMATHMathSciNetGoogle Scholar
 Jabbari S, Ghassemian H: Modeling of heart systiloc murmurs based on multivariate matching pursuit for diagnosis of valvular disorders. Comput. Biol. Med 2011, 41: 802811. 10.1016/j.compbiomed.2011.06.016View ArticleGoogle Scholar
 Durka PJ, Matysiak A, MartínezMontes E, ValdesSosa P, Blinowska KJ: Multichannel matching pursuit and EEG inverse solutions. J. Neurosci. Methods 2005, 148: 4959. 10.1016/j.jneumeth.2005.04.001View ArticleGoogle Scholar
 Zandi AS, Javidan M, Dumont GA, Freshi RT: Automated realti me epileptic seizure detection in scalp eeg recordings using a n algorithm based on wavelet packet transform. IEEE Trans. Biomed. Eng 2010, 57(7):16391651.View ArticleGoogle Scholar
 Cvetkovic D, Übeyli ED, Cosic I: Wavelet transform feature extraction from human PPG, ECG, and EEG signal responses to ELF PEMF exposures: a pilot study. Digital Signal Process 2008, 18(5):861874. 10.1016/j.dsp.2007.05.009View ArticleGoogle Scholar
 Gillespie B, Atlas L: Optimizing timefrequency kernels for classification. IEEE Trans. Signal Process 2001, 49(3):485496. 10.1109/78.905863View ArticleGoogle Scholar
 Haufe S, Tomioka R, Dickhaus T, Sannelli C, Blankertz B, Nolte G, Müller KR: Largescale EEG/MEG source localization with spatial flexibility. NeuroImage 2011, 54: 851859. 10.1016/j.neuroimage.2010.09.003View ArticleGoogle Scholar
 Bernat E, Williams W, Gehring W: Decomposing ERP time–frequency energy using PCA. Clin. Neurophys 2005, 116: 13141334. 10.1016/j.clinph.2005.01.019View ArticleGoogle Scholar
 GrallMaes E, Beauseroy P: Mutual informationbased feature extraction on the timefrequency plane. IEEE Trans. Signal Process 2002, 50(4):779790. 10.1109/78.992120View ArticleGoogle Scholar
 Zhao Y, Zhang S: Generalized dimensionreduction framework for recentbiased time series analysis. IEEE Trans. Knowl. Data Eng 2006, 18(2):231244.View ArticleGoogle Scholar
 Barker M, Rayens W: Partial least squares for discrimination. J. Chemomet 2003, 17(3):166173. 10.1002/cem.785View ArticleGoogle Scholar
 Yang J, Zhang D, Frangi A, Yang J: Twodimensional PCA: a new approach to appearancebased face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell 2004, 26: 131137. 10.1109/TPAMI.2004.1261097View ArticleGoogle Scholar
 Zhang D, Zhou ZH: (2D)2PCA: twodirectional twodimensional PCA for efficient face representation and recognition. Neurocomputing 2005, 69(1–3):224231.View ArticleGoogle Scholar
 Yu L, Liu H: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res 2004, 5: 12051224.MATHMathSciNetGoogle Scholar
 Andrzejak R, Lehnertz K, Rieke C, Mormann F, David P, Elger C: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E 2001, 64: 7186.View ArticleGoogle Scholar
 Duda R, Hart P: D Stork Pattern Classification 2nd edn. with Computer Manual 2nd Edition Set. Wiley; 2001.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.