A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices

A two-stage classifier is used to improve the classification performance between normal and pathological voices. A primary classification between normal and pathological voices is achieved by the Gaussian mixture model (GMM) log-likelihood scores. For samples that do not meet the thresholds for normal or disordered voice in the GMM, the final decision is made by a higher-order statistics (HOS)-based parameter. The normalized skewness and kurtosis, and means of the normalized skewness and kurtosis were estimated using a sustained vowel /a/ from 53 normal and 173 pathological voices taken from the Disordered Voice Database. Mel-frequency cepstral coefficients (MFCC)-based GMM, the HOS methods, and a two-stage classifier based on the GMM-HOS were performed for each voice signal. A Mann–Whitney rank sum test was used to detect differences in the means of the HOS-based parameters. A fivefold cross-validation scheme was performed to test the classification method. When 16 Gaussian mixtures were used, the MFCC-based GMM algorithm is performed with 92.0% accuracy. When means of the normalized skewness and kurtosis were used, performances of 82.31 and 83.67% were obtained, respectively. The two-stage classifier with 16 Gaussian mixtures and the mean of the normalized kurtosis classified samples with a 96.96% accuracy were obtained. The proposed two-stage classifier is more accurate than the MFCC-based GMM and HOS methods alone and shows potential for the classification of voices in the clinic.


Introduction
Speech is integral to day-to-day communication. Speech impediments negatively impact social interactions leading to interest in early detection and treatment of voice disorders. Many researchers have worked towards the goal of automatic and objective classification between normal and pathological voices using minimally invasive methods. A large amount of research has focused on the automatic detection of voice pathologies by means of acoustic analysis, parametric and non-parametric feature extraction, pattern recognition algorithms, and statistical methods [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].
Sáenz-Lechón et al. [5] presented an overview of previous classification schemes applied to the Massachusetts Eye & Ear Infirmary (MEEI) Voice Disorders Database [16]. They described some methodological concerns to be considered when designing automatic systems for pathological voice detection. They recommended the use of a commercially well-known database, a crossvalidation strategy based on several partitions to obtain averaged classification performances with confidence intervals, a report of the means of a detection error trade-off (DET), and an investigation of the area under receiver operating characteristic (ROC) curves.
The emergence of attractive pattern classification algorithms such as the Gaussian mixture model (GMM), neural network (NN), and hidden Markov model has received attention as a potential means to discriminate between normal and pathological voices [6][7][8][9][10][11][12][13]. The GMM has especially been reported as a very successful classification method [10][11][12][13]. Characteristic parameters, such as Mel-frequency cepstral coefficients (MFCC), have also become more popular for voice pathology detection [6,8,[10][11][12]. Recently, Wang et al. [12] proposed a GMM supervector kernel-support vector machine (GMM-SVM) classifier which was compared with the GMM classifier as a baseline algorithm. The GMM supervectors were largely effective parameters for the discrimination of normal and pathological voices. A classification accuracy of 96.1% was achieved by SVM classification of the 16 Gaussian GMM supervectors [12].
As an acoustic analysis method, higher-order statistics (HOS) have shown promising results in a number of signal processing applications, and are of particular value when dealing with a mixture of Gaussian and non-Gaussian processes and system nonlinearity [14,15,17,18]. The application of HOS to speech processing has primarily been motivated by its inherent Gaussian suppression and phase preservation properties [17,18]. Researches in the disordered voice field have been based on the assumption that speech has non-zero HOS that is distinct from that of Gaussian noise [18,19]. Alonso et al. [19] proposed seven new HOS-based parameters that were obtained directly or indirectly starting from the bispectrum of a voice frame. A success rate of 98.3% was obtained by using both the conventional and the HOS-based parameters with an NN classifier demonstrating the possibility of automatically discriminating pathological from healthy voices using HOS parameters [20]. Further study of how well each HOS parameter can detect pathological voices with the methodological designs recommended by Sáenz-Lechón et al. [5] is merited.
In this article, we propose new HOS-based parameters implemented in the time domain. They are means of the normalized skewness and kurtosis which are calculated from each frame and averaged in a sentence. The HOSbased parameters estimated in the time domain can easily be applied to a real-time environment in contrast to the Fourier series representation of the HOS parameters in frequency domain applied by Alonso et al. [19]. Second, we propose a two-stage approach to further improve the accuracy of the classification between normal and pathological voices. The classification system consists of a MFCC-based GMM algorithm which describes the primary classification achieved by the GMM loglikelihood scores, and an HOS-based parameter as a post-processor.

Material
Vocal signals were collected from the MEEI Voice Disorders Database [16]. Fifty-three normal and one hundred seventy-three pathological speakers with a wide range of organic, neurological, traumatic, and psychogenic voice disorders were selected. The extracted subset is the same as one described in the study of Wang et al. [12] to compare the result with this study. Voice samples were collected in a controlled environment and sampled with a 50-or 25-kHz sampling rate and 16 bits of resolution. Patients phonated a sustained /a/ (1-3 s). All voice data were down-sampled to 25 kHz and grouped into training (70% of the data) and test (30%) sets to implement all methods. Each set for a fivefold crossvalidation scheme was randomly selected from the subset [10,12].

Statistical analysis
Statistical analysis was conducted using Sigma Stat 3.0 (Jandel Scientific, SanRafael, CA, USA). The Mann-Whitney rank sum test was performed to test the differences between normal and pathological voices for the normalized skewness, the normalized kurtosis, and the means of the normalized skewness and kurtosis. A p-value of 0.05 was used for all measures.

MFCC-based GMM method
The voice samples were analyzed with 40-ms interval overlapping 20-ms with the previous frame and multiplied by a hamming window as performed in previous studies [12]. MFCC parameters were extracted and fed into a GMM-based detector enabling a final decision about the absence or presence of pathology. The number of the filter banks was 38, and 36-dimensional MFCC parameters (18 MFCC + 18 Delta-MFCC) with 1 derivative were calculated every frame of 18 Mel-cepstral coefficients as in Wang et al. [12]. Cepstral mean subtraction was also used during the extraction to reduce the cepstral bias of the recording channel. The Linde-Buso-Gray algorithm was used for the GMM initialization and the GMM having 8, 16, and 32 mixtures were trained with the expectation-maximization (EM) algorithm to determine the model parameters such as mean vectors, covariance matrices, and mixture weights. For an utter- where T is the number of frames, the log-likelihood ratio (LLR) applying Bayes' rule, disregarding the constant prior probabilities in the log domain, is presented in Equation (1).
where Λ(X) is the LLR, and λ C and λ C are GMM models for normal and pathological voices, respectively. Also N and P indicate normal and pathological voices, respectively. Figure 1 shows the histogram of the LLR estimated from normal and pathological voices in training procedure. The decision threshold, Λ NP , is then set to adjust the tradeoff between rejecting pathological voices (false rejection) and accepting normal voices (false acceptance). In test procedure, the LLR, Λ(X), is compared with a threshold Λ NP (X) and the voice is said to be

HOS method
A speech signal, x(n), which may be normal or pathological, can be expressed as given in Equation (2) [17].
where s(n) is a non-Gaussian signal generated by the oscillation of the vocal folds and w(n) is Gaussian noise which can be assumed to be zero in normal voices and not to be zero in pathological voices. Pathological voices are corrupted by noise, w(n), which is directly related to the perceived roughness of the voice [1][2][3]20]. If s(n) and w(n) are statistically independent, then the energy of x(n) is the sum of speech and noise energies: E x = E s + E w . Second-order statistics are thus directly affected in an additive way by the presence of noise [17,19]. However, when HOS analysis is applied to pathological voices, unstable and discontinuous statistics of x(n) may be estimated because HOS analysis is blind to Gaussian processes. On the other hand, in a normal voice, the HOS of only non-Gaussian measurements may be extracted because a Gaussian noise can be assumed to be zero. The variation of a non-Gaussian signal which is produced by vibration of the vocal folds can be an important clue for the classification of pathological and normal voices.
If x(n), where n = 0, ±1, ± 2, . . . , is a real stationary discrete-time signal and its moments up to order p exist, then its pth-order moment function is given by Equation (3).
And it depends only on the time differences τ 1 , τ 2 , . . ., τ p−1 , τ i = 0, ± 1, ± 2, . . . for all i. Here, E{•} denotes statistical expectation and for a deterministic signal, it is replaced by a time summation over all time samples or time averaging. In addition, if the signal has zero mean, then its cumulant functions (up to order four) are given by Equation (4) [17].
second À order cumulant : third À order cumulant : By setting all the lags to zero in the above cumulant expressions, we can obtain the variance, skewness, and kurtosis.
Skewness : When estimating HOS from finite data records, the variance of the estimators is reduced by normalizing the input data to have a unity variance, prior to computing the estimators. Equivalently, the third-and fourth-order statistics are normalized by the appropriate powers of the data variance, thus we define the normalized skewness and kurtosis as shown in Equations (6) and (7) [17].
Normalized skewness : Normalized kurtosis : In this article, the normalized skewness and kurtosis are extracted in frame as shown in Equation (8).
where x t is the speech sample value of tth frame and N is the number of samples. The proposed HOS-based parameters are means of the normalized skewness and kurtosis: γ 3 and γ 4 . They are estimated in a sentence and have their roots in γ 3 and γ 4 as described in Equation (9). As in the MFCC procedures, voice samples were cut into 40-ms overlapping frames which were shifted by 20 ms and multiplied by a hamming window to extract the HOS-based parameters.
where γ 3t and γ 4t are γ 3 and γ 4 extracted in the t th frame, respectively, and T is the number of the frames.

Two-stage classifier based on GMM-HOS
The block diagram of the proposed algorithm is shown in Figure 2. In the training phase, the MFCCs from the voice samples are extracted for each analysis frame. Λ N and Λ P indicate the thresholds of the LLR estimated by each GMM for normal and pathological voices. An example is shown in Figure 3 with false acceptance and false rejection plots versus LLR thresholds. Both lines cross over the equal error rate (EER).
Λ N and Λ P are the LLR values determined when the EER is 25.0%. The thresholds, γ 3 thre and γ 4 thre , are determined in advance according to the values to produce the best results when γ 3 or γ 4 are used alone for the pathological voice detection. That is, they are computed for the training data set. In the test phase, the LLR, Λ, is estimated with the feature vector and the pre-trained GMMs. The primary decision is executed by the MFCC-based GMM algorithm. If Λ P ≤ Λ≤ Λ N , the voice samples are processed using the HOS operator. The final decision is realized after calculating γ 3 and γ 4 . The values of γ 3 and γ 4 are Figure 2 Overall procedure of the two-stage classifier. Λ N and Λ P are the thresholds of LLR estimated by each GMM for normal and pathological voices in training procedure. γ 3 and γ 4 are means of the normalized skewness and kurtosis, respectively. γ 3 thre and γ 4 thre are the thresholds optimized from γ 3 and γ 4 in training procedure.
Λ is the LLR estimated from the pre-trained GMM in test procedure.
independently used to classify normal and pathological voice. When γ 3 is used, the voice is said to be pathological if γ 3 < γ 3 thre and normal if γ 3 ≥ γ 3 thre . When γ 4 is used, the voice is said to be normal if γ 4 < γ 4 thre and pathological if γ 4 ≥ γ 4 thre .

MFCC-based GMM method
The performance was assessed by averaging the results obtained from fivefold cross-validation scheme [10,12]. Table 1 shows the confusion matrix, accuracy (%) including 95% confidence intervals (CIs), specificity (%), sensitivity (%), and areas under the curve (AUC) according to the number of the Gaussian mixtures. Specificity and sensitivity means the test's ability to identify negative and positive results, respectively. The accuracy is the proportion of true results (both true positives and true negatives) in the population. The GMM models were trained using 8, 16, and 32 mixtures. The average performance was 92.00% when the number of Gaussian mixtures was 16. The result using 32 Gaussian mixtures was also better than one obtained with 8 Gaussian mixtures. Figure 4 shows the area under the ROC curve when MFCC-based GMM method shows the best accuracy. The EER of the MFCC-based GMM method is shown in Figure 5.   The best performance is highlighted in bold.   and have a leptokurtic distribution ( γ 4 > 3). For normal voices, these distributions tended to be skewed to the right and have a platykurtic distribution ( γ 4 < 3), overall. The distributions of pathological voices had a tendency to show larger variation than those of normal voices. A Mann-Whitney rank sum tests showed a statistically significant difference between normal and disordered voices for γ 3 and γ 4 (p < 0.001). The fivefold cross-validation was used to estimate the performances for each parameter. When γ 3 and γ 4 were used to classify normal and pathological voices, the average performances of 82.31 and 83.67% were obtained, respectively. Table 2 shows the confusion matrix, accuracy (%) including 95% CIs, specificity (%), sensitivity (%), and AUC when means of the normalized skewness and kurtosis are used to classify normal and pathological voices. The accuracy of the mean of the normalized kurtosis was higher than that of the mean of the normalized skewness. The ROC curve of the mean of the normalized kurtosis is shown in Figure 4. In Figure 5, the DET curves show the EERs for the mean of the normalized kurtosis. The MFCC-based GMM method outperformed HOS method. Table 3 shows the confusion matrix, accuracy (%) including 95% CIs, specificity (%), sensitivity (%), and AUC when the two-stage classifier is used. The results were measured in the fivefold cross-validation similar to the MFCC-based GMM and HOS methods. The best performance, 96.96%, was obtained when 16 Gaussian mixtures and the normalized kurtosis were utilized as the classifier. In general, when mean of the normalized kurtosis was used as second classifier, the performance was higher than that of mean of the normalized skewness.

Two-stage classifier based on GMM-HOS
The ROC and DET curves of a two-stage classifier using 16 Gaussian mixtures and mean of the normalized kurtosis are shown in Figures 4 and 5. The AUC of the method was larger than that those of 16 Gaussian mixtures and mean of the normalized kurtosis independently. EER of the two-stage classifier was 3.04% in Figure 5.

Conclusion and discussion
In this article, we define a two-stage technique to discriminate pathological from normal voices. The newly proposed model is comprised of two parts, an MFCCbased GMM algorithm which describes the primary decision achieved by the LLR scores of the GMM, and post-processing using an HOS analysis block incorporating means of the normalized skewness and kurtosis. The characteristics of the MFCC between normal and pathological voices are presented in the study of Godino-Llorente et al. [10]. A strong correlation between the   The best performance is highlighted in bold.

HOS coefficients (the normalized skewness and kurtosis)
and voice classification is demonstrated (p < 0.001). By introducing these parameters in cases where the MFCC-based GMM algorithm returns uncertain values, the classification can be improved. According to Sáenz-Lechón et al.'s recommendations [5], we utilized a commercially well-known database. Classification performance along with CI was obtained by a crossvalidation strategy based on several partitions. Results are also described by DET and AUC. The two-stage classifier outperformed the individual classification schemes. The best performance, 96.96%, is achieved by combining an MFCC-based GMM algorithm with 16 Gaussian mixtures and the mean of the normalized kurtosis. The MFCC-based GMM algorithm with 16 Gaussian mixtures performed at 92.0% while the means of the normalized skewness and kurtosis classified correctly 82.31 and 83.67%, respectively. A false decision is occasionally caused by the erroneous EM-based GMM estimation in the intersection regions where they have somewhat low likelihoods. Therefore, it is believed that the performance improvement is mainly due to the fact that our two-stage classifier successfully solves the false decision problem caused by low log-likelihood values.
Recently, Wang et al. [12] proposed a GMM-SVM classifier with a classification performance of 96.1%. Finally, in this article, we combined the MFCC-based GMM method utilized by Godino-Llorente et al. [8,10,11] and Wang et al. [12] with HOS parameters to obtain a performance of 96.96%.
The automatic classification between normal and pathological voices remains an open problem that calls for reliable algorithms to aid the clinicians. When the information gathered from simple physically informed GMMs is combined with HOS-based parameters, a valuable classifier can be obtained. This two-stage method can be used for the analysis and assessment of voice quality. The best performance is highlighted in bold.