- Research
- Open access
- Published:

# A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices

*EURASIP Journal on Advances in Signal Processing*
**volume 2012**, Article number: 252 (2012)

## Abstract

A two-stage classifier is used to improve the classification performance between normal and pathological voices. A primary classification between normal and pathological voices is achieved by the Gaussian mixture model (GMM) log-likelihood scores. For samples that do not meet the thresholds for normal or disordered voice in the GMM, the final decision is made by a higher-order statistics (HOS)-based parameter. The normalized skewness and kurtosis, and means of the normalized skewness and kurtosis were estimated using a sustained vowel /a/ from 53 normal and 173 pathological voices taken from the Disordered Voice Database. Mel-frequency cepstral coefficients (MFCC)-based GMM, the HOS methods, and a two-stage classifier based on the GMM-HOS were performed for each voice signal. A Mann–Whitney rank sum test was used to detect differences in the means of the HOS-based parameters. A fivefold cross-validation scheme was performed to test the classification method. When 16 Gaussian mixtures were used, the MFCC-based GMM algorithm is performed with 92.0% accuracy. When means of the normalized skewness and kurtosis were used, performances of 82.31 and 83.67% were obtained, respectively. The two-stage classifier with 16 Gaussian mixtures and the mean of the normalized kurtosis classified samples with a 96.96% accuracy were obtained. The proposed two-stage classifier is more accurate than the MFCC-based GMM and HOS methods alone and shows potential for the classification of voices in the clinic.

## Introduction

Speech is integral to day-to-day communication. Speech impediments negatively impact social interactions leading to interest in early detection and treatment of voice disorders. Many researchers have worked towards the goal of automatic and objective classification between normal and pathological voices using minimally invasive methods. A large amount of research has focused on the automatic detection of voice pathologies by means of acoustic analysis, parametric and non-parametric feature extraction, pattern recognition algorithms, and statistical methods [1–15].

Sáenz-Lechón et al. [5] presented an overview of previous classification schemes applied to the Massachusetts Eye & Ear Infirmary (MEEI) Voice Disorders Database [16]. They described some methodological concerns to be considered when designing automatic systems for pathological voice detection. They recommended the use of a commercially well-known database, a cross-validation strategy based on several partitions to obtain averaged classification performances with confidence intervals, a report of the means of a detection error trade-off (DET), and an investigation of the area under receiver operating characteristic (ROC) curves.

The emergence of attractive pattern classification algorithms such as the Gaussian mixture model (GMM), neural network (NN), and hidden Markov model has received attention as a potential means to discriminate between normal and pathological voices [6–13]. The GMM has especially been reported as a very successful classification method [10–13]. Characteristic parameters, such as Mel-frequency cepstral coefficients (MFCC), have also become more popular for voice pathology detection [6, 8, 10–12]. Recently, Wang et al. [12] proposed a GMM supervector kernel-support vector machine (GMM-SVM) classifier which was compared with the GMM classifier as a baseline algorithm. The GMM supervectors were largely effective parameters for the discrimination of normal and pathological voices. A classification accuracy of 96.1% was achieved by SVM classification of the 16 Gaussian GMM supervectors [12].

As an acoustic analysis method, higher-order statistics (HOS) have shown promising results in a number of signal processing applications, and are of particular value when dealing with a mixture of Gaussian and non-Gaussian processes and system nonlinearity [14, 15, 17, 18]. The application of HOS to speech processing has primarily been motivated by its inherent Gaussian suppression and phase preservation properties [17, 18]. Researches in the disordered voice field have been based on the assumption that speech has non-zero HOS that is distinct from that of Gaussian noise [18, 19]. Alonso et al. [19] proposed seven new HOS-based parameters that were obtained directly or indirectly starting from the bispectrum of a voice frame. A success rate of 98.3% was obtained by using both the conventional and the HOS-based parameters with an NN classifier demonstrating the possibility of automatically discriminating pathological from healthy voices using HOS parameters [20]. Further study of how well each HOS parameter can detect pathological voices with the methodological designs recommended by Sáenz-Lechón et al. [5] is merited.

In this article, we propose new HOS-based parameters implemented in the time domain. They are means of the normalized skewness and kurtosis which are calculated from each frame and averaged in a sentence. The HOS-based parameters estimated in the time domain can easily be applied to a real-time environment in contrast to the Fourier series representation of the HOS parameters in frequency domain applied by Alonso et al. [19]. Second, we propose a two-stage approach to further improve the accuracy of the classification between normal and pathological voices. The classification system consists of a MFCC-based GMM algorithm which describes the primary classification achieved by the GMM log-likelihood scores, and an HOS-based parameter as a post-processor.

## Material and methods

### Material

Vocal signals were collected from the MEEI Voice Disorders Database [16]. Fifty-three normal and one hundred seventy-three pathological speakers with a wide range of organic, neurological, traumatic, and psychogenic voice disorders were selected. The extracted subset is the same as one described in the study of Wang et al. [12] to compare the result with this study. Voice samples were collected in a controlled environment and sampled with a 50- or 25-kHz sampling rate and 16 bits of resolution. Patients phonated a sustained /a/ (1–3 s). All voice data were down-sampled to 25 kHz and grouped into training (70% of the data) and test (30%) sets to implement all methods. Each set for a fivefold cross-validation scheme was randomly selected from the subset [10, 12].

### Statistical analysis

Statistical analysis was conducted using Sigma Stat 3.0 (Jandel Scientific, SanRafael, CA, USA). The Mann–Whitney rank sum test was performed to test the differences between normal and pathological voices for the normalized skewness, the normalized kurtosis, and the means of the normalized skewness and kurtosis. A *p*-value of 0.05 was used for all measures.

### MFCC-based GMM method

The voice samples were analyzed with 40-ms interval overlapping 20-ms with the previous frame and multiplied by a hamming window as performed in previous studies [12]. MFCC parameters were extracted and fed into a GMM-based detector enabling a final decision about the absence or presence of pathology. The number of the filter banks was 38, and 36-dimensional MFCC parameters (18 MFCC + 18 Delta-MFCC) with 1 derivative were calculated every frame of 18 Mel-cepstral coefficients as in Wang et al. [12]. Cepstral mean subtraction was also used during the extraction to reduce the cepstral bias of the recording channel. The Linde–Buso–Gray algorithm was used for the GMM initialization and the GMM having 8, 16, and 32 mixtures were trained with the expectation-maximization (EM) algorithm to determine the model parameters such as mean vectors, covariance matrices, and mixture weights. For an utterance X=\left\{{x}_{1},{x}_{2},\dots .{x}_{T}\right\}, where *T* is the number of frames, the log-likelihood ratio (LLR) applying Bayes’ rule, disregarding the constant prior probabilities in the log domain, is presented in Equation (1).

where Λ(*X*) is the LLR, and *λ*_{
C
} and {\lambda}_{\overline{C}} are GMM models for normal and pathological voices, respectively. Also *N* and *P* indicate normal and pathological voices, respectively.

Figure 1 shows the histogram of the LLR estimated from normal and pathological voices in training procedure. The decision threshold, Λ_{
NP
}, is then set to adjust the tradeoff between rejecting pathological voices (false rejection) and accepting normal voices (false acceptance). In test procedure, the LLR, Λ(*X*), is compared with a threshold Λ_{
NP
}(*X*) and the voice is said to be pathological if Λ(*X*) < Λ_{
NP
}(*X*) and normal if Λ(*X*) > Λ_{
NP
}(*X*).

### HOS method

A speech signal, *x*(*n*), which may be normal or pathological, can be expressed as given in Equation (2) [17].

where *s*(*n*) is a non-Gaussian signal generated by the oscillation of the vocal folds and *w*(*n*) is Gaussian noise which can be assumed to be zero in normal voices and not to be zero in pathological voices.

Pathological voices are corrupted by noise, *w*(*n*), which is directly related to the perceived roughness of the voice [1–3, 20]. If *s*(*n*) and *w*(*n*) are statistically independent, then the energy of *x*(*n*) is the sum of speech and noise energies: *E*_{
x
} = *E*_{
s
} + *E*_{
w
}. Second-order statistics are thus directly affected in an additive way by the presence of noise [17, 19]. However, when HOS analysis is applied to pathological voices, unstable and discontinuous statistics of *x*(*n*) may be estimated because HOS analysis is blind to Gaussian processes. On the other hand, in a normal voice, the HOS of only non-Gaussian measurements may be extracted because a Gaussian noise can be assumed to be zero. The variation of a non-Gaussian signal which is produced by vibration of the vocal folds can be an important clue for the classification of pathological and normal voices.

If *x*(*n*), where *n* = 0, ±1, ± 2, … , is a real stationary discrete-time signal and its moments up to order *p* exist, then its *p* th-order moment function is given by Equation (3).

And it depends only on the time differences *τ*_{1}, *τ*_{2}, …, *τ*_{p−1}, *τ*_{
i
} = 0, ± 1, ± 2, … for all *i*. Here, *E*{·} denotes statistical expectation and for a deterministic signal, it is replaced by a time summation over all time samples or time averaging. In addition, if the signal has zero mean, then its cumulant functions (up to order four) are given by Equation (4) [17].

By setting all the lags to zero in the above cumulant expressions, we can obtain the variance, skewness, and kurtosis.

When estimating HOS from finite data records, the variance of the estimators is reduced by normalizing the input data to have a unity variance, prior to computing the estimators. Equivalently, the third- and fourth-order statistics are normalized by the appropriate powers of the data variance, thus we define the normalized skewness and kurtosis as shown in Equations (6) and (7) [17].

In this article, the normalized skewness and kurtosis are extracted in frame as shown in Equation (8).

where *x*_{
t
} is the speech sample value of *t* th frame and *N* is the number of samples.

The proposed HOS-based parameters are means of the normalized skewness and kurtosis: {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4}. They are estimated in a sentence and have their roots in *γ*_{3} and *γ*_{4} as described in Equation (9). As in the MFCC procedures, voice samples were cut into 40-ms overlapping frames which were shifted by 20 ms and multiplied by a hamming window to extract the HOS-based parameters.

where *γ*_{3t} and *γ*_{4t} are *γ*_{3} and *γ*_{4} extracted in the *t* th frame, respectively, and *T* is the number of the frames.

### Two-stage classifier based on GMM-HOS

The block diagram of the proposed algorithm is shown in Figure 2. In the training phase, the MFCCs from the voice samples are extracted for each analysis frame. {\overline{\Lambda}}_{N} and {\overline{\Lambda}}_{P} indicate the thresholds of the LLR estimated by each GMM for normal and pathological voices. An example is shown in Figure 3 with false acceptance and false rejection plots versus LLR thresholds. Both lines cross over the equal error rate (EER). {\overline{\Lambda}}_{N} and {\overline{\Lambda}}_{P} are the LLR values determined when the EER is 25.0%. The thresholds, {\overline{{\gamma}_{3}}}_{\mathit{thre}} and {\overline{{\gamma}_{4}}}_{\mathit{thre}}, are determined in advance according to the values to produce the best results when {\overline{\gamma}}_{3} or {\overline{\gamma}}_{4} are used alone for the pathological voice detection. That is, they are computed for the training data set. In the test phase, the LLR, \overline{\Lambda}, is estimated with the feature vector and the pre-trained GMMs. The primary decision is executed by the MFCC-based GMM algorithm. If {\overline{\Lambda}}_{P}\le \overline{\Lambda}\le {\overline{\Lambda}}_{N}, the voice samples are processed using the HOS operator. The final decision is realized after calculating {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4}. The values of {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4} are independently used to classify normal and pathological voice. When {\overline{\gamma}}_{3} is used, the voice is said to be pathological if \overline{{\gamma}_{3}}<{\overline{{\gamma}_{3}}}_{\mathit{thre}} and normal if \overline{{\gamma}_{3}}\ge {\overline{{\gamma}_{3}}}_{\mathit{thre}}. When {\overline{\gamma}}_{4} is used, the voice is said to be normal if \overline{{\gamma}_{4}}<{\overline{{\gamma}_{4}}}_{\mathit{thre}} and pathological if \overline{{\gamma}_{4}}\ge {\overline{{\gamma}_{4}}}_{\mathit{thre}}.

## Results

### MFCC-based GMM method

The performance was assessed by averaging the results obtained from fivefold cross-validation scheme [10, 12]. Table 1 shows the confusion matrix, accuracy (%) including 95% confidence intervals (CIs), specificity (%), sensitivity (%), and areas under the curve (AUC) according to the number of the Gaussian mixtures. Specificity and sensitivity means the test’s ability to identify negative and positive results, respectively. The accuracy is the proportion of true results (both true positives and true negatives) in the population. The GMM models were trained using 8, 16, and 32 mixtures. The average performance was 92.00% when the number of Gaussian mixtures was 16. The result using 32 Gaussian mixtures was also better than one obtained with 8 Gaussian mixtures. Figure 4 shows the area under the ROC curve when MFCC-based GMM method shows the best accuracy. The EER of the MFCC-based GMM method is shown in Figure 5.

### HOS method

Figure 6 presents the distributions of {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4} of normal and pathological voices. The distributions of {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4} for pathological voices tended to be skewed to the left and have a leptokurtic distribution ({\overline{\gamma}}_{4} > 3). For normal voices, these distributions tended to be skewed to the right and have a platykurtic distribution ({\overline{\gamma}}_{4} < 3), overall. The distributions of pathological voices had a tendency to show larger variation than those of normal voices. A Mann–Whitney rank sum tests showed a statistically significant difference between normal and disordered voices for {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4} (*p* < 0.001).

The fivefold cross-validation was used to estimate the performances for each parameter. When {\overline{\gamma}}_{3} and {\overline{\gamma}}_{4} were used to classify normal and pathological voices, the average performances of 82.31 and 83.67% were obtained, respectively. Table 2 shows the confusion matrix, accuracy (%) including 95% CIs, specificity (%), sensitivity (%), and AUC when means of the normalized skewness and kurtosis are used to classify normal and pathological voices. The accuracy of the mean of the normalized kurtosis was higher than that of the mean of the normalized skewness. The ROC curve of the mean of the normalized kurtosis is shown in Figure 4. In Figure 5, the DET curves show the EERs for the mean of the normalized kurtosis. The MFCC-based GMM method outperformed HOS method.

### Two-stage classifier based on GMM-HOS

Table 3 shows the confusion matrix, accuracy (%) including 95% CIs, specificity (%), sensitivity (%), and AUC when the two-stage classifier is used. The results were measured in the fivefold cross-validation similar to the MFCC-based GMM and HOS methods. The best performance, 96.96%, was obtained when 16 Gaussian mixtures and the normalized kurtosis were utilized as the classifier. In general, when mean of the normalized kurtosis was used as second classifier, the performance was higher than that of mean of the normalized skewness.

The ROC and DET curves of a two-stage classifier using 16 Gaussian mixtures and mean of the normalized kurtosis are shown in Figures 4 and 5. The AUC of the method was larger than that those of 16 Gaussian mixtures and mean of the normalized kurtosis independently. EER of the two-stage classifier was 3.04% in Figure 5.

## Conclusion and discussion

In this article, we define a two-stage technique to discriminate pathological from normal voices. The newly proposed model is comprised of two parts, an MFCC-based GMM algorithm which describes the primary decision achieved by the LLR scores of the GMM, and post-processing using an HOS analysis block incorporating means of the normalized skewness and kurtosis. The characteristics of the MFCC between normal and pathological voices are presented in the study of Godino-Llorente et al. [10]. A strong correlation between the HOS coefficients (the normalized skewness and kurtosis) and voice classification is demonstrated (*p* < 0.001). By introducing these parameters in cases where the MFCC-based GMM algorithm returns uncertain values, the classification can be improved. According to Sáenz-Lechón et al.’s recommendations [5], we utilized a commercially well-known database. Classification performance along with CI was obtained by a cross-validation strategy based on several partitions. Results are also described by DET and AUC.

The two-stage classifier outperformed the individual classification schemes. The best performance, 96.96%, is achieved by combining an MFCC-based GMM algorithm with 16 Gaussian mixtures and the mean of the normalized kurtosis. The MFCC-based GMM algorithm with 16 Gaussian mixtures performed at 92.0% while the means of the normalized skewness and kurtosis classified correctly 82.31 and 83.67%, respectively. A false decision is occasionally caused by the erroneous EM-based GMM estimation in the intersection regions where they have somewhat low likelihoods. Therefore, it is believed that the performance improvement is mainly due to the fact that our two-stage classifier successfully solves the false decision problem caused by low log-likelihood values.

Many studies have presented a variety of approaches [1–3, 6–13, 19]. Although different datasets and evaluation procedures make it difficult to compare the results of previous studies [5], Godino-Llorente and Gómez-Vilda [8] presented the experimental result using the learning vector quantization methodology, yielding 96% frame accuracy. Afterwards, Godino-Llorente et al. [10] published the article showing an accuracy of 94.07% with 24 MFCC parameters and a GMM of 6 mixtures. A classification performance of 98.3% was obtained by using both the perturbation and seven HOS-based parameters with an NN classifier implemented by Alonso et al. [19]. Recently, Wang et al. [12] proposed a GMM-SVM classifier with a classification performance of 96.1%. Finally, in this article, we combined the MFCC-based GMM method utilized by Godino-Llorente et al. [8, 10, 11] and Wang et al. [12] with HOS parameters to obtain a performance of 96.96%.

The automatic classification between normal and pathological voices remains an open problem that calls for reliable algorithms to aid the clinicians. When the information gathered from simple physically informed GMMs is combined with HOS-based parameters, a valuable classifier can be obtained. This two-stage method can be used for the analysis and assessment of voice quality.

## References

de Oliveira Rosa M, Pereira JC, Grellet M: Adaptive estimation of residual signal for voice pathology diagnosis.

*IEEE Trans. Biomed. Eng*2000, 47(1):96-104. 10.1109/10.817624Shama K, Krishna A, Cholayya NU: Study of harmonics-to-noise ratio and critical band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology.

*EURASIP J. Adv. Signal Process.*2007, 1997(1):50-59.Godino-Llorente JI, Osma-Ruiz V, Sáenz-Lechón N, Gómez-Vilda P, Blanco-Velasco M, Cruz-Roldán F: The effectiveness of the glottal to noise excitation ratio for the screening of voice disorders.

*J. Voice*2010, 24(1):47-56. 10.1016/j.jvoice.2008.04.006Khadivi Heris H, Seyed Aghazadeh B, Nikkhah-Bahrami M: Optimal feature selection for the assessment of vocal fold disorders.

*Comput. Biol. Med.*2009, 39(10):860-868. 10.1016/j.compbiomed.2009.06.014Sáenz-Lechón N, Godino-Llorente JI, Osma-Ruiz V, Gómez-Vilda P: Methodological issues in the development of automatic systems for voice pathology detection.

*Biomed. Signal Process. Control*2006, 1(2):120-128. 10.1016/j.bspc.2006.06.003Godino-Llorente JI, Aguilera-Navarro S, Gómez-Vilda P: Non-supervised neural net applied to the detection of voice impairment. In

*Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*. 6, Istanbul, Turkey; 2000:3594-3597.Hadjitodorov S, Mitev P: A computer system for acoustic analysis of pathological voices and laryngeal disease screening.

*Med. Eng. Phys.*2002, 24(6):419-429. 10.1016/S1350-4533(02)00031-0Godino-Llorente JI, Gómez-Vilda P: Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors.

*IEEE Trans. Biomed. Eng.*2004, 51(2):380-384. 10.1109/TBME.2003.820386Umapathy K, Krishman S, Parsa V, Jamieson DG: Discrimination of pathological voices using a time-frequency approach.

*IEEE Trans. Biomed. Eng.*2005, 52(3):421-430. 10.1109/TBME.2004.842962Godino-Llorente JI, Aguilera-Navarro S, Gómez-Vilda P: Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters.

*IEEE Trans. Biomed. Eng.*2006, 53(10):1943-1953. 10.1109/TBME.2006.871883Godino-Llorente JI, Sáenz-Lechón N, Osma-Ruiz V, Aguilera-Navarro S, Gómez-Vilda P: An integrated tool for the diagnosis of voice disorders.

*Med. Eng. Phys.*2006, 28(3):276-289. 10.1016/j.medengphy.2005.04.014Wang X, Zhang J, Yan Y: Discrimination between pathological and normal voices using GMM-SVM approach.

*J. Voice*2010, 25(1):38-43.Das R: A comparison of multiple classification methods for diagnosis of Parkinson disease.

*Expert Syst. Appl.*2010, 37(2):1568-1572. 10.1016/j.eswa.2009.06.040Nikias C, Mendel J: Signal processing with higher-order statistics spectra.

*IEEE Signal Process. Mag.*1993, 10(3):10-37.Mendel JM: Tutorial on higher-order statistics in signal processing and system theory: theoretical results and some applications.

*Proc. IEEE*1991, 79(3):278-305. 10.1109/5.75086Massachusetts Eye and Ear Infirmary:

*Voice Disorders Database, Version. 1.03 [CD-ROM]*. Kay Elemetrics Corp, Lincoln Park, NJ; 1994.Nemer E, Goubran R, Mahmoud S: Robust voice activity detection using higher-order statistics in the LPC residual domain.

*IEEE Trans. Speech Audio Process.*2001, 9(3):217-231. 10.1109/89.905996Lee JY, Minsoo H: Automatic assessment of pathological voice quality using higher-order statistics in the LPC residual domain.

*EURASIP J. Adv. Signal Process*2009. Article ID 748207, 8pagesAlonso JB, de Leon J, Alonso I, Ferrer MA: Automatic detection of pathologies in the voice by HOS based parameters.

*EURASIP J. Appl. Signal Process.*2001, 2001(4):275-284. 10.1155/S1110865701000336Kent RD, Ball MJ:

*Voice Quality Measurement (Thomson Learning)*. New York; 2000.

## Acknowledgments

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2011–0025153). The author would like to thank Minsoo Hahn and Sangbae Jeong for their guidance, advice, and kind supports.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The author declares that she has no competing interests.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Lee, J.Y. A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices.
*EURASIP J. Adv. Signal Process.* **2012**, 252 (2012). https://doi.org/10.1186/1687-6180-2012-252

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1687-6180-2012-252