 Research
 Open Access
 Published:
Discriminant nonstationary signal features’ clustering using hard and fuzzy cluster labeling
EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 250 (2012)
Abstract
Current approaches to improve the pattern recognition performance mainly focus on either extracting nonstationary and discriminant features of each class, or employing complex and nonlinear feature classifiers. However, little attention has been paid to the integration of these two approaches. Combining nonstationary feature analysis with complex feature classifiers, this article presents a novel direction to enhance the discriminatory power of pattern recognition methods. This approach, which is based on a fusion of nonstationary feature analysis with clustering techniques, proposes an algorithm to adaptively identify the feature vectors according to their importance in representing the patterns of discrimination. Nonstationary feature vectors are extracted using a nonstationary method based on time–frequency distribution and nonnegative matrix factorization. The clustering algorithms including the Kmeans and selforganizing tree maps are utilized as unsupervised clustering methods followed by a supervised labeling. Two labeling methods are introduced: hard and fuzzy labeling. The article covers in detail the formulation of the proposed discriminant feature clustering method. Experiments performed with pathological speech classification, Twave alternans evaluation from the surface electrocardiogram, audio scene analysis, and telemonitoring of Parkinson’s disease problems produced desirable results. The outcome demonstrates the benefits of nonstationary feature fusion with clustering methods for complex data analysis where existing approaches do not exhibit a high performance.
1 Introduction
The advancement in sensor technology made it possible to gather huge amounts of data, which on the one hand extends the applicability of signal analysis to a wide variety of fields, such as communications, security, biomedicine, biology, physics, finance, and geology. But on the other hand, the large data make demands for advanced and automated pattern recognition techniques to effectively process the gathered data. In pattern detection context, the general purpose of any processing technique can be described as the analysis of a given dataset to make a certain decision based on the obtained information.
In a signal classification method, a feature extraction divides a signal into shortduration segments and maps the segments into features in an appropriate multidimensional space. Next, a classification scheme performs the actual task of classifying the signals relying on the extracted features. In general, classification techniques can be divided into two groups: supervised learning and unsupervised learning. In a supervised learning, the classification scheme is usually based on the availability of a set of signals that have already been classified or described. Learning can also be unsupervised, in the sense that the system is not given a prior labeling of patterns. Instead, it establishes the classes based on the statistical or structural regularities of the patterns.
Supervised learning approaches are developed based on the assumption that the structures of signals from different classes are completely different. They then find a discriminating pattern among signals by dividing the feature space into nonoverlapping subspaces which represent each corresponding class. Although, this approach might be satisfactory in cases the signals are separable in the feature space, this approach seems to be too optimistic in applications where an overlap exists between different classes. This is a common issue in many realworld applications specially, in biomedical applications which the aim is to determine abnormal behaviored signals from the normal ones. In majority of cases, the discriminative structure of an abnormal signal occurs in a short duration, and as a result not the entire signal is abnormal. Hence, feature vectors that are extracted from the normal portion of an abnormal signal will overlap with the features extracted from the normal signals. In other words, natural similarities between different classes may result in some overlapping in the feature space. For example, in pathological speech recognition, while the nature of both normal and pathological signals is speech, only few highfrequency contents or transient components cause the discrimination between the two classes. Therefore, the extracted features may not necessarily represent the discriminating structures in each class, causing an overlap in the feature space. In addition, nonstationarities in the realworld signals cause some variations in the signals’ properties which may result in spread and overlapping of the obtained feature vectors over the feature space.
Because of this overlapping, a supervised classifier may refuse to identify a clear discrimination among the groups and as a result may degrade the performance of the pattern recognition. Several directions have been taken in the literature to overcome the nonstationary and overlapping pattern recognition challenges as briefly mentioned in the following:

(i)
Employing complex classification algorithms: complex learning methods such as artificial neural networks (ANN) [1] have been developed in order to discriminate different classes in the presence of features’ overlapping.

(ii)
Applying feature selection methods: there have been previous attempts to select uncorrelated feature elements that are more related to the discriminative characteristics of each class in order to improve the classification accuracy. One of these approaches is the theory of rough sets, proposed by Pawlak [2, 3], is a kind of data analysis theory that introduced overlaps between classes. In this theory, a rough membership function makes it possible to distinguish similar feature elements and measures the degree of overlap between a set of experimental values representing a standard (e.g., set of values typically associated with a biomedical abnormality). This approach has been applied in feature selection and extraction to reduce a large number of features and identify the representative features [4]. It is worth mentioning that the aforementioned feature selection approaches differ from the subject of our study as the former selects the uncorrelated feature elements in a feature vector to increase the accuracy rate, while the latter keeps all the feature elements and identifies the cluster of feature vectors that are unrelated to the discrimination between classes.

(iii)
Extracting the discriminant features: some attempts have been performed in the literature in order to obtain the discriminative features of the signals: local discriminant base analysis [5] and timewidth versus frequency band energy mapping [6]. While these analyses are active areas of research, the optimal choice of discriminant features highly depends on the nature of the dataset and the dissimilarity measures used to distinguish between classes. Furthermore, these analyses can only be used with decompositionbased time–frequency (TF) analysis such as wavelet or matching pursuit, and are restricted to TF analysis approaches.
In an unsupervised classification method, a clustering method (e.g., Gaussian mixture model and Kmeans clustering) is used to obtain clusters of features for each class. This training stage is performed sequentially for each class; there is no interactions between feature vectors of different classes. In the test stage, the unknownclass data are tested with respect to the discriminant clusters of each class. The predicted class is the one associated with the clusters with the maximum probability. Unsupervised classification is a natural way to proceed towards automatic pattern recognition systems for realworld applications with overlapping features as it considers the possibility of overlapping features and clusters that share a common structure among different classes.
As our goal to enhance discriminatory powers in nonstationary feature extraction, in this study, we focus on developing a new scheme for a combined unsupervised and supervised classification approach. This framework, which we call ’discriminant cluster selection’, aims to improve the classification accuracy in decisionmaking systems by providing an alternative solution to the feature overlapping problem mentioned above. In this study, we also demonstrate the fusion of nonstationary feature analysis with the proposed unsupervised classification methods to cluster the nonoverlapping feature vectors as the discriminative pattern.
In this study, we employ and refine the existing clustering approaches to develop a classification technique that improves the classification accuracy rate. We adopt the notion of unsupervised clustering; however, unlike commonly used unsupervised clustering methods, we propose to perform the clustering stage on all the training feature vectors obtained from the different classes and train one set of clusters for the entire training features. Next, we use the distribution of feature vectors in these clusters and their class label to compute the presence of the discriminative pattern in each class. Two types of clusters are identified: discriminant clusters which mainly consist of feature vectors from one specific class, and common clusters which are a mixture of features from different classes. We propose that discriminant clusters identify the representative structures in each class, and common clusters represent the similarities between classes. The proposed scheme is different from feature selection techniques which attempt to select the optimal feature elements in a feature vector to improve the classification performance. Our proposed work feeds all the elements of the feature vectors to the clustering stage, and then decides which feature cluster represents the discriminative structure between the classes. Both feature selection techniques and the proposed method can simultaneously be applied to increase the classification accuracy. In a future study, a combination of these two methods can be investigated to further improve the accuracy rate in a classification application. Our proposed framework is predicted to significantly improve the classification accuracy rate of signals. It will also improves our insight about the discrimination pattern in each class which may be reconstructed or located using the feature vectors in the discriminant clusters.
The structure of this article is as follows: Section 2 explains the discriminant feature clustering methodology. Section 3 explains Kmeans clustering and the selforganizing tree map (SOTM) as two unsupervised clustering techniques employed in this study. Two supervised cluster labeling techniques (hard and fuzzy labeling) are explained in Section 4. Section 5 explains the nonstationary signal features. In Section 6, the application of the developed technique is presented for three synthetic examples. In addition, the application of the proposed strategy is investigated for speech pathological detection, sudden cardiac deathrisk stratification, audio scene classification, and telemonitoring of Parkinson’s disease (PD), and the results are given in Section 6. Conclusion is provided in Section 7.
2 Methodology
Our goal is to identify the nonstationarity feature clusters that represent discriminative characteristics of each group. In order to proceed towards such a feature clustering approach, there is a need for a nonstationary feature extraction and clustering technique that detects the discriminant features. Figure 1 demonstrates an example to explain the proposed methodology and it’s advantageous to determine such key clusters.
In this example, one normal and one abnormal signals are generated using the following equation:
where $g\left(\sigma ,\mu \right)$ is a Gaussian with mean μ and variance of σ^{2}. Mean of this Gaussian function locates a component in time, and the variance specifies the duration of each component. The sine function localizes the component in frequency domain. The normal signal is constructed to consist of seven frequencymodulated components.
To construct an abnormal synthetic signal, three of the components are transformed into transients. In many realworld applications such as biomedicine, transients are known to be the discriminative structures of abnormal signals, and are used in this example as one of the abnormality descriptors. Figure 1a–d displays the generated normal and abnormal signals in time and TF domains. In this example, spectrogram with FFT size of 1024 points and Kaiser window with parameter of 5, length of 256 samples and 220 samples overlap, was used to construct the TF of each signal. The TF domain provides TF distribution (TFD), which is a threedimensional TF representation with two dimensions representing the time and frequency domains, respectively. The third dimension (i.e., the intensity of the distribution) indicates the energy distribution of the signal at the corresponding time and frequency. While the time representation does not provide much information about the difference between these two synthetic signals, the TFD provides a better visual display of the discriminant structure as indicated by the dashed circles. If the right quantification and classification algorithms are used, the TF representation may successfully be employed for automatic pattern recognition applications.
Six joint TF feature vectors [7] are extracted from each signal while each vector consists of three features: S_{ h }, i.e., sparsity of the signal in time domain; S_{ w }, i.e., sparsity of the signal in frequency domain; and D_{ w }, i.e., abrupt changes in frequency domain. The applied TF feature extraction method is fully explained in Section 5. The extracted TF feature vectors are shown in Figure 1e. As can be seen in the feature space of Figure 1e, considering the relative location of the features in this feature space, two types of clusters can be detected: an overlapped cluster containing the frequencymodulated components which are common between two signals, and an abnormality cluster which consists of features corresponding to the transients in the abnormal signal. Our proposed feature classification method is successful if it can separate the abnormal cluster from the normal one (i.e., in this example, the transient and normal feature groups, respectively), and use the abnormality cluster for detection of any abnormality behavior in a test signal. The overlapped clusters do not play any role in any discrimination between the two classes. Therefore, once any feature vector is assigned to an overlapped cluster, it will be excluded from the classification of its corresponding signal, and will have no effect in labeling the signal as abnormal.
Figure 2 displays the schematic of our proposed discriminant feature clustering method. As can be seen in the block diagram of Figure 2, once TF features are extracted, a discriminant feature clustering system is introduced in order to discriminate the abnormality clusters in the feature space. This system consists of two stages: unsupervised clustering and supervised cluster labeling. In the first Stage, an unsupervised learning is performed on the entire features (i.e., both normal and abnormal) to detect all the possible feature clusters ($\left\{\overrightarrow{C}\right\}$) in the feature space. Employing this stage on the synthetic example of Figure 1 should result in two types of clusters as indicated in Figure 1f.
In the second stage, each cluster is labeled ({α}) based on the feature arrangements in the feature domain determining whether the cluster consists of discriminant features or common features. The clusters which consist of the majority of abnormality signals are labeled as the discriminant structure corresponding to the abnormality pattern. The outcome of this stage in Figure 1 indicates the left cluster in Figure 1f as the abnormality cluster since all the containing features belong to the abnormal signal. Similarly, the righthand cluster is labeled as the common cluster because the cluster consists of fairly equal number of normal and abnormal signals.
Once the abnormal and normal clusters are labeled, the trained clusters along with their labels ({α}) are passed to the classification stage. In test stage, each of the test feature vectors are assigned to one of the cluster centers based on the minimum Euclidean distance (ED) measure. Next, feature vectors which belong to the overlapped clusters will be excluded, and finally, based on the membership of the test feature vectors, the class label of the corresponding signal is determined. Two methods are proposed to define the class label of each signal: hard labeling which is based on majority vote, and fuzzy labeling which is based on majority vote weighted by the membership distribution of each cluster. The above stages are described in the following Sections.
3 Clustering methods
One of the most popular clustering algorithms is Kmeans clustering algorithm. The other popular clustering algorithm is SOTM that does not require any information about the number of clusters in the feature domain. This Section explains the unsupervised clustering method, and the supervised cluster labeling is explained in the next Section.
3.1 Kmeans clustering
The Kmeans clustering is one the simplest and the most popular unsupervised clustering algorithms. The algorithm is computationally efficient and is advantageous on a dataset that consists of compact and wellseparated clusters [8]. Given a set of feature vectors, ${\left\{{\overrightarrow{f}}_{z}\right\}}_{z=1,\dots ,Z}$, the following phases are performed in the algorithm to identify K feature clusters [9]:

1.
The method starts with K initial random centroids, ${\left\{{\overrightarrow{C}}_{u}\right\}}_{u=1,\dots ,K}$.

2.
It classifies the feature samples into the nearest centroid according to the squared ED. To do so, it first calculates the squared ED of any given sample to all the centroids as given in the following equation:
$$\begin{array}{l}\left\{{e}_{z}^{2}\right\}=\sum _{u=1}^{K}{\u2225{\overrightarrow{f}}_{z}{\overrightarrow{C}}_{u}\u2225}^{2}\end{array}$$(2)Then, the algorithm assigns the sample to the centroid with minimum ED.

3.
The mean of the points in each cluster is computed as the new cluster centroids:
$$\begin{array}{l}{\overrightarrow{C}}_{u}=\frac{1}{{Z}_{u}}\sum _{u=1}^{{Z}_{u}}{\overrightarrow{f}}_{z}^{u}\end{array}$$(3)where Z_{ u } is the number of feature samples assigned to cluster u, and ${\left\{{\overrightarrow{f}}_{z}^{u}\right\}}_{z=1,\dots ,{Z}_{u}}$ are the assigned samples to cluster u.

4.
The algorithm iteratively repeats Steps 2 and 3 unless the new cluster centers are the same as or close enough to the centroids of the previous Stage.
3.2 SOTM
SOTM is a type of ANN which was first introduced in [10]. The algorithm maps the data from a high dimensional Euclidean feature space onto a finite set of prototypes. Like most of the clustering algorithms, it attempts to organize unlabeled feature vectors into the clusters in a way that all the samples within a cluster are more similar to each other than those of other clusters. Each cluster is then represented using one or more prototype. Unlike classic clustering methods (like Kmeans) where the number of clusters should be known beforehand, in SOTM the number of clusters is determined by the algorithm based on parameters, which define the desired resolution of the clustering. The steps involved in the SOTM algorithm are briefly explained below:

1.
The weight vectors are initialized randomly ${\left\{{\overrightarrow{C}}_{u}\left(t\right)\right\}}_{u=1,\dots ,K}$, where K is the number of clusters. The random value is usually a vector from the training set.

2.
For a new input vector, the distance from the input vector and all of the existing nodes, d _{ u }, is calculated as
$$\begin{array}{c}\phantom{\rule{6.0pt}{0ex}}{d}_{u}(\overrightarrow{f},{\overrightarrow{C}}_{u}(t\left)\right)={\left\{\sum _{z=1}^{Z}{\left[{\overrightarrow{f}}_{z}{\overrightarrow{C}}_{u}\left(t\right)\right]}^{2}\right\}}^{1/2}u=1,\mathrm{..},K\end{array}$$(4)where ${\overrightarrow{C}}_{u}\left(t\right)$ is the node of the cluster u at time t.

3.
Select the node with the minimum distance d _{ u }as the winning node, u ^{∗}
$${d}_{u\ast}(\overrightarrow{f},{\overrightarrow{C}}_{u}(t\left)\right)=min{d}_{u}(\overrightarrow{f},{\overrightarrow{C}}_{u}(t\left)\right)$$(5) 
4.
The minimum distance, ${d}_{u\ast}(\overrightarrow{f},{\overrightarrow{C}}_{u}(t\left)\right)$, is then compared with H(t), the hierarchical control function, which decreases over time. If the input vector is within the threshold H(t) of the winning node, the weight vector is updated based on the following update rule:
$${\overrightarrow{C}}_{u}(t+1)={\overrightarrow{C}}_{u}\left(t\right)+\lambda \left(t\right)[\overrightarrow{f}{\overrightarrow{C}}_{u}(t\left)\right]$$(6)Where λ(t) is the learning rate, which decreases with time. When the input vector is farther from the winning node than the threshold, a new subnode is generated from the winning node at $\overrightarrow{f}$.

5.
Checking the terminating conditions; The algorithm will stop if any of the following conditions are fulfilled

Maximum number of iterations is reached.

Maximum number of clusters is reached.

No significant change occurs in the structure of the tree.


6.
Otherwise the algorithm is repeated from Step 2.
The hierarchical control function acts as an ellipsoid of significant similarity. H(t) can be assumed as a global vigilance threshold that is used for measuring the proximity of a new input sample to the nearest existing node in the network. Samples that fall outside the scope of the nearest existing node result in generation of a new node as child of the winning node. By initializing H(t) to start from a large value, the clusters discovered at the early stages of the clustering will be far from each other. Decay of H(t) over time results in partitioning the feature space in low resolution at the early stages of the clustering, while favoring partitioning at higher resolutions later. There are two standard hierarchical control functions proposed for the original SOTM algorithm: linear and exponential decays.
where τH is a time constant, which is bound to the projected size of the input feature F, H(0) is the initial value, t is the number of iterations (or sample presentation), and ζ is the number of iterations over which the linear version of H(t) would decay to the same level as the exponential version. One benefit of initializing H(t) to a large value, possibly larger than the maximum variation within the data, is that all levels of resolution across the data can be explored.
The learning rate in Equation (6), λ(t), is an important factor in organizing the network. λ(t) can operate in number of different global or local modes. In global modes, a single learning rate is applied to all nodes, whereas in local modes an individual rate operates for each node a set of nodes. There are a few modalities proposed for the operation of the learning rate and the details are discussed in [11, 12].
4 Cluster labeling
Assignment of the right label to each cluster is one of the critical concerns in our proposed discriminant cluster selection system. We propose two methods to label the obtained clusters and obtain the class label of the signals as explained in the following subsections.
4.1 Method 1: hard labeling
In an Eclass classification problem, this method decides whether each cluster represents classes 1,2,…, or E.

First, the clusters are identified, say K clusters $\left\{{\overrightarrow{C}}_{1},{\overrightarrow{C}}_{2},\dots ,{\overrightarrow{C}}_{K}\right\}$. K≥E is the number of clusters and is not necessarily equal to the number of classes (i.e., E). The number depends on the application and the employed clustering method.

Next, we calculate the feature vectors of each class based on their assignment to a cluster and denote this number as NUM_{1}(u), NUM_{2}(u),…, NUM_{ E }(u) representing the number of class 1 to E feature vectors in the u th cluster, respectively.

Then, clusters with a fairly equal mix of feature vectors from different classes are identified as overlapped clusters and labeled as common clusters (i.e., K_{ c } clusters). The remaining clusters (i.e., K_{ d }clusters) are discriminant clusters and are labeled based on the membership distribution of their feature vectors. The class with majority membership defines the label of each discriminant cluster. In order to quantify the significance of the overlap between different classes, the clusters with more that 30% of overlap are assigned to the common clusters, and the remaining clusters are identified as the discriminant clusters. The calculation proceeds as shown in the following equation:
$$\begin{array}{l}{\alpha}_{u}=0,\phantom{\rule{2em}{0ex}}\\ \text{For}\phantom{\rule{1em}{0ex}}u\in \left\{{K}_{c}\right\}\phantom{\rule{2em}{0ex}}\\ {\alpha}_{u}=\mathrm{arg}\phantom{\rule{1em}{0ex}}\text{Max}\left\{{\text{NUM}}_{e}\left(u\right)\right\},\phantom{\rule{2em}{0ex}}\\ \text{For}\phantom{\rule{1em}{0ex}}u\in \left\{{K}_{d}\right\}\text{and}\phantom{\rule{1em}{0ex}}e=1,\dots ,E\phantom{\rule{2em}{0ex}}\end{array}$$(8)where α_{ u } is the label defined for the u th cluster, and α_{ u }=0 represents a common cluster.

Once the training stage is completed, the estimated clusters and the calculated labels denoted with $\left\{{\alpha}_{1},{\alpha}_{2},\dots ,{\alpha}_{K}\right\}$ are passed to the test stage.
In the testing stage, first a signal is decomposed to r feature vectors. Next, each feature vector is classified based on which cluster it belongs to. Finally, based on the label of the r feature vectors, we decide on the class label of the signal. To perform this calculation, for any new feature vector ${\overrightarrow{f}}_{\text{test}}$, the following procedure is performed:

${\text{Cluster}}_{{\overrightarrow{f}}_{\text{test}}}$, the cluster, which each test feature belongs to, is found as the nearest cluster based on ED criterion:
$$\begin{array}{ll}{\text{Cluster}}_{{\overrightarrow{f}}_{\text{test}}}& =\mathrm{arg}\phantom{\rule{2.77695pt}{0ex}}\underset{u=1,\dots ,K\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}},}{{Min}_{i}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left\{\left{\overrightarrow{f}}_{\text{test}}{\overrightarrow{C}}_{u}\right\right\},\phantom{\rule{2em}{0ex}}\\ ={u}_{f},\phantom{\rule{2em}{0ex}}\end{array}$$(9)where ${\overrightarrow{C}}_{u}$ is the center of each cluster constructed in the training stage.

The label of the above cluster is assigned to each test feature, and is used to determine the class ${\overrightarrow{f}}_{\mathrm{test}}$ belongs to:
$${\overrightarrow{f}}_{\text{test}}\in \text{Class e}\phantom{\rule{1em}{0ex}}\mathrm{if}\phantom{\rule{1em}{0ex}}{\alpha}_{{u}_{f}}=e,$$(10) 
Once all the feature vectors in a test signal are labeled, the feature vectors that are assigned to common clusters are excluded and the labeling of the remaining feature vectors are used to classify the signal. A test signal is classified as a class e signal, if the majority of its test feature vectors (i.e., excluding the feature vectors assigned to common clusters) belong to class e.
We call this procedure ‘hard labeling’ as each cluster is distinguished with one label.
4.2 Method 2: fuzzy labeling
After all the feature vectors are clustered, clusters with large overlapping (i.e., containing more than 30% overlapping feature vectors) are associated as common clusters (i.e., K_{ c }), and the remaining clusters (i.e., K_{ c }) which are the discriminant clusters are used in the training stage as follows:
The proposed fuzzy cluster labeling calculates the label of each feature as a membership matrix M${}_{{K}_{d}\times E}$, where each entry in this membership matrix, m_{ ue }(which is called a membership coefficient) indicates the probability of a vector in the cluster u belongs to class e.
where E is the number of classes and K_{ d }is the number of discriminant clusters.
The membership coefficients are calculated based on the distribution of each class in different clusters as shown in the following equation:
where NUM_{ e }(u) is the number of features belong to class e that exist in cluster u, and m_{ u } is the total of features in the u th cluster. These coefficients will be used in calculation of the membership degree for each of the test vectors.
In the test stage, first a signal is decomposed to r feature vectors. Each of the feature vectors representing is assigned to one cluster centers found in the previous stage based on the minimum ED criterion (Equation 9). The feature vectors located in a common cluster is excluded. Next, we simply count the number of feature vectors that are located in each discriminant cluster and recorded the numbers as a scatter vector S. The scatter vector is defined for the remaining feature vectors as follows:
where s_{ u }is the number of the representing vectors for a test signal that fall within the u th cluster and K_{ d } is the number of discriminant clusters.
Finally, the probability of a signal belonging to a class is calculated according to the distribution of its representing feature vectors in different clusters and can be described as
where M(:,e) is the e th vector of the membership matrix, M, and the signal is labeled to belong to the class associated with the maximum value of Φ(e).
Although the advantage of the hard and fuzzy labeling is the identification of the representative clusters for each class and discriminating them from the common clusters, the method requires that each class contributes with the same number of feature vectors. Since the identification of representative clusters is based on comparing the membership of each class in the clusters, the number of normal and abnormal feature vectors should be the same in order to perform a fair comparison. The proposed solution in such scenarios is to reduce the sample size of all the classes to the sample number of the smallest classes.
5 Nonstationary signal feature extraction
Figure 3 depicts the schematic of the feature extraction technique along with the proposed clustering method.
This approach captures the TF features by applying the nonnegative matrix factorization (NMF) [13] to the TFD of each signal. Spectrogram can be used as a simple TF representation. Seven features are extracted from the decomposed vectors including: ${\text{MO}}_{\overrightarrow{h}}^{\left(1\right)},{\text{MO}}_{\overrightarrow{w}}^{\left(1\right)},{\text{S}}_{\overrightarrow{h}},{\text{S}}_{\overrightarrow{w}},{\text{D}}_{\overrightarrow{h}},{\text{D}}_{\overrightarrow{w}}$, and ${\text{SH}}_{\overrightarrow{w}}$.
5.1 NMF
NMF was performed in the middle of the 1990s under the name positive matrix factorization (PMF) [14]. In 1999, Lee and Seung [13] introduced some simple algorithms for the factorization, and demonstrated the success of the technique on some classification applications. NMF decomposes a nonnegative matrix (V_{M×N}) as follows:
where r is the order of decomposition, and W and H are nonnegative matrices, which are called base and coefficient matrices, respectively. NMF algorithm starts with an initial estimate for W and H, and performs an iterative optimization to minimize a given cost function. Lee and Seung [15] introduce two updating algorithms using the least squares error and the Kullback–Leibler (KL) divergence as the cost functions:
In these equations, $\u3008\phantom{\rule{2.22198pt}{0ex}}\mathrm{A.B}\phantom{\rule{2.22198pt}{0ex}}\u3009$ and $\frac{\u3008A\phantom{\rule{2.22198pt}{0ex}}\u3009}{\u3008B\phantom{\rule{2.22198pt}{0ex}}\u3009}$ are termbyterm multiplication and division of two matrices, and 1 is a matrix of ones. KL divergence formula is not a boundconstrained problem, which requires the objective function to be well defined at any point of the bounded region [16]. The log function in KL divergence formula is not well defined if any elements in matrix V or WH is zero. Hence, we do not consider KL divergence formulation in this study. The least squares error approach is a standard boundconstrained optimization problem. Various minimization strategies have been proposed for the least squares error strategy. In this study, we use a projected gradient boundconstrained optimization method which is proposed by Lin [16].
5.2 Features
As shown in Figure 3, features are extracted from each decomposed W and H matrices. The obtained features are explained as follows:
5.2.1 Joint TF moments
Moments of base and coefficient vectors (i.e., W and H, respectively) carry an important information of the TF characteristics of a signal and could be used for classification of timevarying signals [17] and feature identification [18]. We denote the i th temporal and spectral moments with ${\text{MO}}_{\overrightarrow{h}}^{\left(i\right)}$ and ${\text{MO}}_{\overrightarrow{w}}^{\left(i\right)}$, respectively, and compute them using the following equations:
where ${\mu}_{{\overrightarrow{h}}_{j}}$ and ${\mu}_{{\overrightarrow{w}}_{j}}$ are the first moment of the j th coefficient and base vectors and are computed as follows: ${\mu}_{{\overrightarrow{h}}_{j}}=\sum _{n=0}^{N}n{\overrightarrow{h}}_{j}^{T}\left(n\right)$ and ${\mu}_{{\overrightarrow{w}}_{j}}=\sum _{m=0}^{M}m{\overrightarrow{w}}_{j}\left(m\right)$.
5.2.2 Sparsity
${\text{S}}_{{\overrightarrow{h}}_{j}}$ and ${\text{S}}_{{\overrightarrow{w}}_{j}}$ are the sparsity of coefficient and base vectors, respectively. These features help to distinguish between transient and continuous components. Several sparseness measures have been proposed and used in literature. We use a sparsity function as follows:
The sparsity feature is zero if and only if a vector contains a single nonzero component (i.e., maximum sparsity), and is negative infinity if and only if all the components are equal (i.e., minimum sparsity).
5.2.3 Discontinuity
${\text{D}}_{\overrightarrow{h}}$ and ${\text{D}}_{\overrightarrow{w}}$ represent the discontinuities and abrupt changes in each vector, respectively. These features are calculated as follows:
where ${\overrightarrow{h}}_{j}^{\prime}$ and ${\overrightarrow{w}}_{j}^{\prime}$ are the derivatives of coefficient and base vectors as defined in the following equations:
and
${\text{D}}_{\overrightarrow{h}}$ and ${\text{D}}_{\overrightarrow{w}}$ capture the discontinuities and abrupt changes in coefficient and base vectors, respectively. A vector with a smaller value of discontinuity feature is smoother compared to a vector with a larger discontinuity feature.
5.2.4 Sharpness
${\text{SH}}_{\overrightarrow{w}}$ measures the spread of the components in low frequencies. In addition, we need another feature to provide an information on the energy distribution in frequency. For each base vector, first we calculate the Fourier transform as given below:
where M is length of the base vector, and ${\overrightarrow{\mathit{\text{W}}}}_{i}\left(\nu \right)$ is the Fourier transform of the base vector ${\overrightarrow{w}}_{i}$. Next, we perform a second Fourier transform on the base vector, and obtain ${\overrightarrow{\mathit{\text{W}}}}_{i}\left(\kappa \right)$ as the following:
Finally, we sum up all the values of $\left\overrightarrow{\mathit{\text{W}}}\left(\kappa \right)\right$ for κ more than m_{0}, where m_{0} is a small number:
In order to demonstrate the behavior of feature ${\text{SH}}_{\overrightarrow{w}}$, we assume that the base vector, ${\overrightarrow{w}}_{i}$, has two components at frequencies samples m_{1}and m_{2} with energies of α and β respectively:
$\left\mathit{\text{W}}\left(\nu \right)\right$ (Equation (24)) is calculated as below:
6 Results
6.1 Synthetic dataset
6.1.1 Example 1
This example is designed to demonstrate the application of the proposed discriminant cluster selection method for signal classification. We present a synthetic example of a twoclass problem to demonstrate the identification process of signal classification using TF feature extraction and the proposed cluster selection method. In this experiment, we apply the TF features to a classification problem as introduced in [19, 20]. Test signals are defined as the sum of two linear chirps as defined below:
where a_{0}b_{0} belong to a uniform distribution U(0,1),a_{1}=0.25,b_{1}=0.40, and N=1024 is the signal length. Two classes are generated by selecting b_{2}from one of the following uniform distributions:
The TF representation for signals in each class is plotted in Figure 4.
A total number of 1,100 signals are generated in each class, and TF feature extraction and classification is performed as follows:

(i)
TF representation (i.e., TF matrix) of each signal is constructed.

(ii)
NMF matrix decomposition method is applied to the TF matrix, and 10 base and coefficient components (i.e., W and H, and r=10) are computed for each signal.

(iii)
A feature vector is extracted from each component pair as explained in Section 5; i.e., there are ten feature vectors for each signal and each feature vector contains the following feature values: $\left\{{\text{MO}}_{\overrightarrow{h}}^{\left(1\right)},{\text{MO}}_{\overrightarrow{w}}^{\left(1\right)},{\text{MO}}_{\overrightarrow{h}}^{\left(2\right)},{\text{MO}}_{\overrightarrow{w}}^{\left(2\right)},{\text{S}}_{\overrightarrow{h}},{\text{S}}_{\overrightarrow{w}},{\text{D}}_{\overrightarrow{h}},{\text{D}}_{\overrightarrow{w}},{\text{SH}}_{\overrightarrow{w}}\right\}$.

(iv)
SOTM clustering is used to train and then classify the signals in each class. The classifier is trained using 90% samples and classified over all the signals. SOTM is simultaneously applied to Classes 1 and 2 feature vectors and computes 25 clusters in the feature space. The number of feature vectors associated to Class 1 or Class 2 are counted in each cluster and the distribution of feature vectors in these 25 clusters is computed and displayed in Figure 5.In both hard and fuzzy labeling, clusters with more than 30% overlap (i.e., clusters 1, 12, 13, 18, 20, 23, 24, and 25) are assigned to common clusters, and the remaining clusters are identified as discriminant clusters and are labeled depending on the labeling method proposed in Section 4 (Figure 6). In hard labeling, clusters with more than 30% Class 1 feature vectors are labeled as Class 1 (i.e., clusters 3, 6, 9, 10, 11, 15, 16, and 17) and the ones with more than 30% Class 2 feature vectors are labeled as Class 2 (i.e., clusters 2, 4, 5, 7, 8, 14, 19, 21, and 22). However, in fuzzy labeling, a membership ratio is assigned to each cluster as follows:
$$M={\begin{array}{c}\left[\begin{array}{ccccccccccccccccc}{C}_{2}& {C}_{3}& {C}_{4}& {C}_{5}& {C}_{6}& {C}_{7}& {C}_{8}& {C}_{9}& {C}_{10}& {C}_{11}& {C}_{14}& {C}_{15}& {C}_{16}& {C}_{17}& {C}_{19}& {C}_{21}& {C}_{22}\\ 0.28& 0.99& 0.29& 0.33& 0.88& 0.26& 0.32& 0.96& 0.83& 0.99& 0.33& 0.74& 1.0& 1.0& 0.25& 0.29& 0.3193\\ 0.72& 0.01& 0.71& 0.67& 0.12& 0.74& 0.68& 0.04& 0.1& .011& 0.67& 0.26& 0.0& 0.0& 0.75& 0.71& 0.6807\end{array}\right]\end{array}}^{\text{T}}$$(31)

(v)
All the signals are tested and labeled. Figure 6 displays the receiver operating curve (ROC) of the final classification.
6.1.2 Example 2
The purpose of this example is to evaluate the application of the proposed method to identify an unknown discrimination pattern between two signals. Two synthetic signals (y_{1} and y_{2}) were generated using Equation 1. Panels A and B in Figure 7 show the two synthetic signals in time and TF domains, respectively. The signals were constructed in a way that all the components, except two of them, were similar. As can be seen from the TFD plots in Figure 7, the dissimilarity components were created by transforming two of the frequency modulated components (in the right panel signal: y_{1}) to the linearly modulated components (in the left panel signal: y_{2}).
TFD in panels C and D is constructed using spectrogram method, FFT size of 1,024 points and Kaiser window with parameter of 5, length of 256 samples and 220 samples overlap. Features were extracted as explained in Section 5: NMF with a decomposition order of 10 was applied to the spectrograms of y_{1} and y_{2}. The decomposed vectors were: ${\left[{\overrightarrow{w}}_{1}\left(i\right)\phantom{\rule{1em}{0ex}}{\overrightarrow{h}}_{1}^{\text{T}}\left(i\right)\right]}_{i=1:10}$ and ${\left[{\overrightarrow{w}}_{2}\left(i\right)\phantom{\rule{1em}{0ex}}{\overrightarrow{h}}_{2}^{\text{T}}\left(i\right)\right]}_{i=1:10}$, respectively. Seven TF features were extracted from each decomposed vector. Three of these features are shown in panel C of Figure 7 where the asterisk and circle correspond to y_{1} and y_{2} signals, respectively. Kmeans clustering with three clusters was applied to all the features. Each cluster with the majority membership of a signal was marked as the corresponding signal’s discriminant pattern.
As can be seen in this feature plane, there was a group of features which were clustered in the middle. Using the discriminant feature selection method, this cluster was selected as the discriminant pattern in signal y_{2}: D_{y 2}. The same method identified the discriminant pattern in y_{1}signal: D_{y 1}. The remaining features belonged to the commonalities between these two signals.
Panel D in Figure 7 displays the discriminant structures in y_{1} and y_{2}signals. These TF structures were built using the decomposed vectors corresponding to the D_{y 1} and D_{y 2} feature points: $\sum _{i={D}_{y1}}{\overrightarrow{w}}_{1}\left(i\right){\overrightarrow{h}}_{1}^{\text{T}}\left(i\right)$ and $\sum _{i={D}_{y2}}{\overrightarrow{w}}_{2}\left(i\right){\overrightarrow{h}}_{2}^{\text{T}}\left(i\right)$. As demonstrated in this example, the proposed method was able to successfully identify the discriminant structures in each signal. Once the discriminant clusters are selected, these clusters along with the proposed cluster labeling methods can be used to classify a new signal. The above example used only one signal from each class in arriving at the differences between TF structures. In practice, we have to use more number of signals in both classes before arriving at a robust discriminant pattern.
6.1.3 Example 3
This experiment introduces more challenges to the identification of the discriminant structures between two classes. In this example, the discriminant structure overlaps with the common structure; i.e., the abnormal components are mixed with the normal structure. As is demonstrated in this example, the proposed discriminant cluster selection method provides a successful separation between the normality and abnormality structures.
In Figure 8, panels A and B show the normal and abnormal synthetic signals in time and TF domains, respectively. The signal on the lefthand side is generated using Equation 1, and the one on the righthand side is formed by adding three linear FM chirp signals. The TF features extraction and discriminant cluster labeling were applied as explained in the previous example. Figure 8e displays the extracted feature vectors along with the discriminant clusters identified by our proposed method. The features outside this cluster were selected as the commonality structure between the two signals. All the TF features corresponding to the aboveselected cluster were chosen, and were used to reconstruct back the TFD. The TFDs corresponding to the common and discriminant structures were plotted in panel D. Observing the ease at which the proposed approach separated the synthetic and chirplike signal features in this example, it is evident that the this method has the potential to be a powerful and a useful tool in signal pattern recognition applications.
6.2 Real dataset
Pathological voice classification, Twave alternans (TWA) evaluation from the surface electrocardiogram (ECG), environmental audio classification, telemonitoring of PD are selected as the applications of the developed discriminant cluster selection method. The former is performed using the hard labeling clustering method, and the latter three are evaluated employing the fuzzy labeling approach.
6.2.1 Hard labeling: pathological speech detection
Dysphonia or pathological voice refers to speech problems resulting from damage to or malformation of the speech organs. Currently, patients are required to routinely visit a specialist to follow up their progress. Moreover, the traditional ways to diagnose voice pathology are subjective, and depending on the experience of the specialist, different evaluations can be resulted. Developing an automated technique saves time for both the patients and the specialist, and can improve the accuracy of the assessments. In a previous study from our group [7], we introduced the joint TF feature extraction and classification for pathological speech verification. In this study, we provide this application with a focus on nonstationary TF feature analysis + hard cluster labeling, and compare its performance with traditional clustering methods.
The proposed methodology was applied to the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database, distributed by Kay Elemetrics Corporation [21]. The database consists of 51 normal and 161 pathological speakers whose disorders spanned a variety of organic, neurological, traumatic, and psychogenic factors. The speech signal is sampled at 25 kHz and quantized at a resolution of 16 bits/sample. In this study, 25 abnormal and 25 normal signals were used to train the classifier. Each signal is divided into 80ms segments and the TFD is constructed [22, 23]. Next, NMF with base number of r=15 is employed to each TF representation, and 15 base and coefficient vectors are estimated as explained in Equation (15).
As explained in our previous study [7], abnormal speech behaves differently for voiced (i.e., vowel) and unvoiced (i.e., constant) components. Therefore, prior to feature extraction, the base vectors are divided into two groups: (a) low frequency (LF): the bases with dominant energy in the frequencies lower than 4 kHz, and (b) high frequency (HF): the bases with major energy concentration in the higher frequencies. Four features (${S}_{\overrightarrow{h}},{D}_{\overrightarrow{h}},{S}_{\overrightarrow{w}},S{H}_{\overrightarrow{w}}$) are extracted from each LF base vectors, and five features $\left\{{S}_{\overrightarrow{h}},{D}_{\overrightarrow{h}},M{O}_{\overrightarrow{w}}^{\left(1\right)},M{O}_{\overrightarrow{w}}^{\left(2\right)},M{O}_{\overrightarrow{w}}^{\left(3\right)}\right\}$ are obtained from each HF base vector.
The clustering and labeling are performed as explained in Sections 3.1 and 4.1, respectively. In the training stage, 100 and 20 clusters are experimentally found to be proper choice for the number of clusters (K) in case of LF and HF features, respectively. From the entire clusters, 25% were assigned as common clusters and the remaining clusters labeled class normal or abnormal as explained in hard labeling scheme.
In the test stage, for speech sample, the nearest cluster to each of the TF features are identified using ED criterion shown in Equation (9). Finally, signals with majority of feature vectors in the abnormal clusters are labeled as the pathological speech and the other signals are classified as normal. Figure 9 shows the ROC plot of the proposed TF feature extraction and discriminant cluster selection using hard labeling. The maximum classification accuracy rate of 98.6% is achieved with 50 signals of 51 normal and 159 out of 161 pathological signals are correctly classified. In this figure, the ROC using linear discriminant analysis (LDA) and GMM classifiers are displayed. It can be seen that the proposed discriminant clustering method provides a higher classification accuracy. In [24], wellknown Melfrequency cepstral coefficients (MFCCs) features along with signal pitch is used for pathological speech classification of the same database that we employed in this section. Dibazar et al. [24] achieved an accuracy rate of 98.3% using HMM as the classifier.
6.2.2 Fuzzy labeling: TWA evaluation from the surface ECG
Each year 400,000 North Americans die from sudden cardiac death (SCD). Identifying those patients at risk of SCD remains a formidable challenge. TWA evaluation is emerging as an important tool to risk stratify patients with heart diseases. TWA is a heart ratedependent phenomenon that manifests on the surface ECG as a change in the shape or amplitude of the Twave every second heart beat. The presence of large magnitude TWA often presages lethal ventricular arrhythmias. Because the TWA signal is typically in the microvolt range, accurate detection algorithms are required to control for confounding noise and changing physiological conditions (i.e. data nonstationarity). In our previous study [25], we proposed a novel technique, called NMFadaptive SM [25]. In this method, after preprocessing the ECG recordings to correct baseline wander and removing nonuniform QRS beats [26], the Twave of each beat is aligned as shown in Figure 10.
Next, the adaptive TFD [22, 23] of the aligned Twaves is constructed over each vertical sample (${\overrightarrow{A}}_{1}$, ${\overrightarrow{A}}_{2},\dots ,{\overrightarrow{A}}_{N}$). Adaptive TFD approach is a highresolution TF representation capable of adaptively tracking nonstationary structures. Adaptive TFD uses the matching pursuit [22] method to decompose the signal over a dictionary of TF atoms. At each iteration, the signal is projected over a dictionary of TF functions and the one which models the greatest fraction of the signal energy is selected. This TF function is then subtracted from the signal, and the residual signal is subsequently decomposed in further iterations till all or most of the signal energy is modeled. The matching pursuit decomposition with Gabor TF atoms has been chosen in this study because of its superior TF resolution [22], crossterm free nature, adaptivity, and suitability for pattern recognition applications. The adaptive TFD for each vertical sample is computed. If V_{1}, V_{2},…,V_{ N } are the TFD of each vertical sample, in the next stage, the TFD representative of the entire Twave (denoted with V_{avg}) is calculated as the average of V_{1}V_{2},…,V_{ N }. Once the average TFD is constructed, features are extracted as explained below:
NMF with base number of r=3 is employed to each averaged TFD, and three base and coefficient vectors are estimated as explained in Equation (15). NMF is expected to separate the TF structure the noise components that may mask the TWA signal. From each decomposed base component, 11 features are extracted. The first feature is determined as the estimate of the TWA magnitude [27] from each base:
where T is the energy of the decomposed base (W) at frequency of 0.5 cpb (cycle per beat), and μ_{noise}estimates the noise energy. Considering a white Gaussian noise, noise has a constant spectral density at the entire spectral bandwidth. Since the Twave alternation and respiratory activities do not have any spectral content over the spectral bandwidth of 0.36 to 0.49 cpb, this bandwidth is used to estimate the noise energy. The last ten elements in each base component represent the spectral content of the Twave signal. Basically, any information about Twave including noise and TWA value exist in the spectral content of the last ten elements in the base vector. Therefore, the other ten features are chosen to be the last ten elements in each base component. As a result of this feature extraction, 3 feature vectors are extracted for each ECG segment where each vector includes 11 feature values.
As explained in our previous study [25], realworld ECGs with inherent noise were obtained from 26 normal subjects who underwent 2 channel ambulatory ECG recordings (GE Healthcare, Inc.) for 24–48 continuous hours at our institution. The ambulatory ECGs were recorded at a sampling rate of 125 Hz and then exported for custom analysis. The mean heart rate of these recordings was 78–17 bpm (beats per minute) and the mean noise level was 40–67 μ V. Each ECG channel was included as a separate record.
Two groups of ambulatory ECGs were generated, one without simulated TWA (TWA magnitude = 0 μ V) and the other with simulated TWA (TWA magnitude = 5 μ V): ECG signals are recorded from normal subjects and therefore they are assumed to have 0 TWA. A simulated TWA signal with amplitude of 5 μ V is added to the ECG signals by uniformly increasing Twave amplitude of even beats and decreasing Twave amplitude of odd beats across the Twave. The use of a known TWA signal permits TWA quantification to be compared between the different methods. A TWA detection threshold of 5 μ V was prespecified as this cutpoint approximates the TWA magnitude measured by Klingenheben et al. [28] in patients with heart disease using a similar definition of TWA as our study. The extracted features from NMFadaptive SM were fed into two classifiers (the SOTM clustering and fuzzy labeling, and an LDA) to train and classify the ECG segments as TWA present or absent.
Half of the dataset is used for the training stage and the other half is employed to test the accuracy of the TWA detection. SOTM is applied on the training dataset and the number of valid clusters is calculated for the classification. Clusters with small number of samples are eliminated. We experimentally decided that the clusters with a membership of less than 1% of the entire feature vectors are not valid. The clusters are formed as the data are presented to the network and the number and size of the clusters is determined by the parameters such as the hierarchical control function (H(t)) and the learning rate (λ(t)). The initial values of these functions are appointed according to the dataset. In the next stage, the membership coefficients are calculated for each cluster based on the distribution of the train signals. In the test stage, each of the test signals are assigned to one of these cluster centers based on the minimum ED measure. Finally, the class label of each signal is determined by the weighted sum of the feature vectors falling within each cluster multiplied by the membership coefficients. Another point to be discussed here is that since the data are represented to the SOTM in a random manner, the number and the shape and size of the clusters might vary each time the clustering algorithm is run on the data. However, since there is not a onetoone correspondence between the clusters and the two groups, this fact has no considerable impact on the total performance of the classifier. In addition, the results of the several are averaged to further eliminate this effect.
Table 1 summarizes the results. The proposed TF features and fuzzy labeling classifier (NMFadaptive SM and fuzzy labeling) were more accurate in detecting the TWA signal than classic LDA classifier. In Table 1, the TWA detection accuracy for our method was further compared with two wellknown Twave analysis methods (spectral methods (SM) [27] and modified moving average (MMA) [26], and two previously described waveletbased methods (Wavelet 1 [29] and Wavelet 2 [30]). The significant improvement in the sensitivity and specificity of the proposed feature and classifier supports the effectiveness of this approach.
6.2.3 Fuzzy labeling: audio classification
Audio signals are the important sources of information for understanding the content of multimedia. Therefore, developing audio classification techniques that better characterize audio signals plays an essential role in many multimedia applications, such as multimedia indexing and retrieval, and auditory scene analysis. Having approximately 10% of the world population suffering from some sort of hearing loss, one of the important applications of audio classification is in hearing aids (HA) for hearing impaired people. In order to prevent the noise signals from being magnified by the hearing aid, the HA is required to detect the audio classes which the incoming signals belong to, and then change the HAs parameters accordingly. A recent article from our group [31] presented the benefits of joint TF feature extraction employed in environmental audio classification. Next section provides the performance of fuzzy cluster labeling employed along with nonstationary joint TF features when performed for audio scene analysis, and compares its performance with supervised classification.
In this study, we use an environmental audio dataset that was compiled in our signal analysis research group at Ryerson University [31]. The dataset is designed to have 10 different classes such that it consists of 192 audio signals of 5s duration each with a sampling rate of 22.05 kHz and a resolution of 16 bits/sample. This database is designed to have 10 different classes including 20 aircraft, 17 helicopters, 20 drums, 15 flutes, 20 pianos, 20 animals, 20 birds, and 20 insects, and the speech of 20 males and 20 females. Most of the music samples were collected from the Internet and suitably processed to have uniform sampling frequency and duration.
Threesecond audio signals are transformed into TF domain. Next, NMF with decomposition order of 15 (r=15) decomposes each TFD into 15 base and coefficient vectors. In this study, experimentally, r=15 is found to be a suitable choice for the application. Seven features (Section 5) are extracted from each base and coefficient vector. Finally, The SOTM clustering and fuzzy labeling is employed to train and classify the signals.
One of the most important classification tasks for a hearing aid system is to discriminate human speech from environmental noise. Therefore, in the first scenario the dataset consists of signals from human speech and environmental sounds. The human category includes 20 signals from male speakers and 20 signals from female speakers and environmental sounds include 10 bird, 10 aircraft, 10 piano, and 10 animal signals. Table 2 shows the results for this classification task where an accuracy of 96% has been achieved. As it can be seen from the confusion matrices, the system demonstrates high accuracy in discrimination of human voice from other audio signals. The achieved true positive rate shows that all human voice signals have been classified correctly. In addition, the overall accuracy rate for classification scenarios that include discrimination of human voice is very high.
The human versus nonhuman sound discrimination is also performed using GMM as a successful traditional clustering method for audio signals. This classification resulted in a lower performance with 86% overall accuracy rate. Fifteen mixtures are experimentally found sufficient and used for the GMM classification. We also compared the accuracy of the TF feature extraction and clustering method with the wellknown MFCCs features. MFCCs are shortterm spectral features and are widely used in the area of audio and speech processing.
In this application, a signal is divided into 32ms segments and then we compute the first 13 MFCCs for all the segments of the entire length of the audio signals and use them as feature vectors. Using GMM and 15 mixtures, MFCC features resulted in 79% overall classification accuracy rate. It can be seen that MFCC features and GMM system are able to successfully classify human signals; however, the method is not very effective for classifying the nonhuman signals (i.e., 57.5% accuracy rate). The reason for such behavior can be explained that MFCC features and GMM clustering system are useful for human speech analysis, but they are challenged when dealing with natural sounds with nonhuman sources. However, it can be evidenced that the combination of the TF feature vectors and the proposed discriminant cluster labeling are significantly successful.
Furthermore, in order to evaluate the efficiency of the system to discriminate human voice in particular environments, two other classification tasks have been defined. In the first case, an accuracy of 98% has been achieved in discrimination of human voice from the musical instruments. This capability could be useful in recognizing and separation of human voice from the background music in a song or at the concert. The second classification task was defined as discrimination of human voice from natural sounds, where an accuracy of 96% has been achieved. Furthermore, the proposed method was applied to other classification scenarios such as natural versus artificial sounds and musical instruments versus aircraft. Table 3 shows the overall obtained average accuracy rate and the dataset used for each classification scenario. The classification accuracy rate using GMM clustering method and the MFCC features are also presented in Table 3.
6.2.4 Fuzzy labeling: telemonitoring of PD
In this application, we present an assessment of the proposed discriminant clustering method for discriminating healthy people from people with PD by detecting dysphonia. The data for this application were obtained from Little et al. [32]. The dataset consists of 195 sustained vowel phonations from 31 male and female subjects, of which 23 were diagnosed with parkinson disease. The time since diagnoses ranged from 0 to 28 years, and the ages of the subjects ranged from 46 to 85 years (mean 65.8, standard deviation 9.8). Averages of six phonations were recorded from each subject, ranging from 1 to 36 s in length. See [32] for subject details. Little et al. [32] selected ten highly uncorrelated measures, and an exhaustive search of all possible combinations of these measures finds four that in combination lead to overall correct classification performance of 91.4%, using a kernel support vector machine (SVM).
In this section, we employ the ten features proposed in [32] and employ the proposed discriminant clustering method using soft labeling strategy to perform discrimination between people with PD and healthy subjects. It is worth mentioning that since this database provided only the extracted attributes and not the original signals, we could only use the given features. This way, we could evaluate the proposed discriminant cluster selection method and investigate the efficiency of this method in comparison to the exhaustive search and SVM classification used in [32].
Using the proposed discriminant cluster selection and soft labeling method, two common and four discriminant clusters are obtained. This method achieved an overall classification performance of 97% which was higher than 91.4% that was reported in [32]. GMM is also applied for the classification of the PD features. Five and four mixtures are obtained for PD and normal classes, respectively. A poor classification performance with an overall accuracy rate of 69% is obtained using GMM. ROCs for the classification using discriminant clustering and GMM are shown in Figure 11 with the area under the curve of 0.995 and 0.65, respectively.
7 Conclusion
The objective of this article was to improve the performance of pattern recognition systems when there is an overlapping feature vectors due to nonstationarity of the signals or the commonality that exist among different classes. To make this happen, the article introduced a different strategy to clustering techniques based on a fusion of unsupervised and supervised learning approaches. This method applied an unsupervised clustering to the feature vectors from all the different classes, and then used a supervised labeling method to select two types of clusters: discriminant and common clusters. The supervised cluster labeling approach selected the discriminant clusters from the common ones according to their importance for representing each corresponding class. The obtained discriminant clusters represented the differentiating patterns that exist among signals from different classes. Therefore, in the classification stage, only the feature vectors that were located in the discriminant clusters were considered for the classification of a given signal. These feature vectors were better representatives of the signals’ characteristics, and resulted in a significantly higher classification accuracy rate.
In order to identify the discriminant clusters, two cluster labeling methods were proposed: hard and fuzzy labeling. In hard labeling, discriminant clusters were assigned to one of the possible classes, but in fuzzy labeling, they were associated to each class with a relative membership value ranging from 0 to 1 (with 0 being the least contribution, and 1 being the most). Both proposed methods enhanced the commonly used supervised learning and clustering approaches. Kmeans and SOTM clustering methods were explained for the applications studied in this article. An advantage of SOTM compared to the Kmeans method was the number of clusters, which should be known beforehand in Kmeans, but was adaptively determined in the SOTM algorithm.
In conclusion, experiments performed with synthetic signals as well as pathological speech, surface ECG, telemonitoring of PD, and environmental audio signals demonstrated the potential of the proposed discriminant feature clustering framework for becoming a powerful pattern recognition tool.
References
 1.
Freeman G, Dony R, Areibi S: Audio environment classication for hearing aids using artificial neural networks with windowed input. In Proceedings of the IEEE Symposium on Computational Intelligence in Image and Signal Processing, vol. 2846. Honolulu, HI; April 2007:183188.
 2.
Pawlak Z: Rough sets. Int. J. Comput. Inf. Sci 1982, 11: 341356. 10.1007/BF01001956
 3.
Pawlak Z: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Norwell; 1992.
 4.
Jensen R, Shen Q: New approaches to fuzzyrough feature selection. IEEE Trans. Fuzzy Syst 2009, 17: 824838.
 5.
Saito N, Coifman R: Local discriminant bases and their applications. J. Math. Imag. Vis 1995, 5(4):337358. 10.1007/BF01250288
 6.
K Umapathy S: Krishnan, Timewidth versus frequency band mapping of energy distributions. IEEE Tran. Signal Process 2007, 55: 978989.
 7.
Ghoraani B, Krishnan S: A joint timefrequency and matrix decomposition feature extraction methodology for pathological voice classification. EURASIP J. Adv. Signal Process 2009, 2009(ID 928974):11. [http://dx.doi.org/10.1155/2009/928974] []
 8.
Jain A, Duin R, Mao J: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell 2000, 22(1):437. 10.1109/34.824819
 9.
Duda R, Hart P, Stork D: Pattern Classification. Wiley, New York; 2001.
 10.
Kong H, Guan L: Detection and removal of impulse noise by a neural network guided adaptive median filter. In Proceedings of the IEEE International Conference on Neural Networks, vol. 2. Perth, WA; November 1995:845849.
 11.
Kyan M: Unsupervised learning through dynamic selforganization: implications for microbiological image analysis. PhD thesis, School of Electrical and Information Engineering University of Sydney, (2007)
 12.
Kyan M, Jarrah J, Muneesawang P, Guan L: Strategies for unsupervised multimedia processing: selforganizing trees and forests. IEEE Comput. Intell. Mag 2006, 1: 2740. 10.1109/MCI.2006.1626492
 13.
Lee D, Seung H: Learning the parts of objects by nonnegative matrix factorization. Nature 1999, 401(6755):788791. 10.1038/44565
 14.
Paatero P, Tapper U: Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 1994, 5: 111126. 10.1002/env.3170050203
 15.
Lee D, Seung H: Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, MA); 556562.
 16.
Lin CJ: Projected gradient methods for nonnegative matrix factorization. Neural Comput 2007, 19(10):27562779. 10.1162/neco.2007.19.10.2756
 17.
Tacer B, Loughlin P: Timefrequency based classification,in Proceedings of the International Society for Optical Engineering (SPIE), vol. 2846. Denver, CO; August 1996:186192.
 18.
Groutage D, Bennink D: Feature sets for nonstationary signals derived from moments of the singular value decomposition of cohenposch (positive timefrequency) distributions. IEEE Trans. Signal Process 2000, 48(5):14981503. 10.1109/78.840002
 19.
Davy M, Doncarli C, BoudreauxBartels GF: Improved optimization of timefrequency based signal classifiers. IEEE Signal Process. Lett 2001, 8: 5257.
 20.
Davy M, Gretton A, Doucet A, Rayner P: Optimized support vector machines for nonstationary signal classification. IEEE Signal Process. Lett 2002, 9(12):442445.
 21.
Eye M, Infirmary E: Voice Disorders Database, Version 1.03. Kay Elemetrics Corporation, Lincoln Park; 1994.
 22.
Mallat SG, Zhifeng Z: Matching pursuits with timefrequency dictionaries. IEEE Trans. Signal Process 1993, 41(12):33973415. 10.1109/78.258082
 23.
Krishnan S, Rangayyan R, Bell G, Frank C: Adaptive timefrequency analysis of knee joint vibroarthrographic signals for noninvasive screening of articular cartilage pathology. IEEE Trans. Biomed. Eng 2000, 47(6):773783. 10.1109/10.844228
 24.
Dibazar A, Narayanan S, Berger T: Feature analysis for automatic detection of pathological speech. In Proceedings of the Second Joint EMBS/BMES Conference, vol. 1. (Houston, TX, USA; October 2002:182183.
 25.
Ghoraani B, Krishnan S, Selvaraj RJ, Chauhan VS: T wave alternans evaluation using adaptive timefrequency signal analysis and nonnegative matrix factorization. Med. Eng. Phys 2011, 33(6):700711. 10.1016/j.medengphy.2011.01.007
 26.
Nearing BD, Verrier RL: Modified moving average analysis of Twave alternans to predict ventricular fibrillation with high accuracy. J. Appl. Physiol 2002, 92: 541549.
 27.
Smith JM, Clancy EA, Valeri CR, Ruskin JN, Cohen RJ: Electrical alternans and cardiac electrical instability. Circulation 1988, 77(1):110121. 10.1161/01.CIR.77.1.110
 28.
Klingenheben T, Ptaszynski P, Hohnloser S: Quantitative assessment of microvolt twave alternans in patients with congestive heart failure. J. Cardiovasc. Electrophysiol 2005, 16: 620624. 10.1111/j.15408167.2005.40708.x
 29.
Romero I, Grubb N, Clegg G, Robertson C, Addison P, Watson J: Twave alternans found in preventricular tachyarrhythmias in CCU patients using a wavelet transformbased methodology. IEEE Trans. Biomed. Eng 2008, 55: 26582665.
 30.
Boix M, Cantó B, Cuesta D, Micó P: Using the wavelet transform for twave alternans detection. Math. Comput. Model 2009, 50: 738742. 10.1016/j.mcm.2009.05.002
 31.
Ghoraani B, Krishnan S: Timefrequency matrix feature extraction and classification of environmental audio signals. IEEE Trans. Audio Speech Lang. Process 2011, 19(7):21972209.
 32.
Little M, McSharry P, Roberts S, Costello D, Moroz I: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng 2007., OnLine 6(23):
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Ghoraani, B., Krishnan, S. Discriminant nonstationary signal features’ clustering using hard and fuzzy cluster labeling. EURASIP J. Adv. Signal Process. 2012, 250 (2012). https://doi.org/10.1186/168761802012250
Received:
Accepted:
Published:
Keywords
 Kmeans clustering
 The selforganizing tree map (SOTM)
 Time–frequency feature analysis
 Supervised classification
 Unsupervised clustering
 Discriminant cluster selection