- Research
- Open Access
A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices
- Ji Yeoun Lee^{1}Email author
https://doi.org/10.1186/1687-6180-2012-252
© Lee; licensee Springer. 2012
- Received: 5 March 2012
- Accepted: 30 October 2012
- Published: 30 November 2012
Abstract
A two-stage classifier is used to improve the classification performance between normal and pathological voices. A primary classification between normal and pathological voices is achieved by the Gaussian mixture model (GMM) log-likelihood scores. For samples that do not meet the thresholds for normal or disordered voice in the GMM, the final decision is made by a higher-order statistics (HOS)-based parameter. The normalized skewness and kurtosis, and means of the normalized skewness and kurtosis were estimated using a sustained vowel /a/ from 53 normal and 173 pathological voices taken from the Disordered Voice Database. Mel-frequency cepstral coefficients (MFCC)-based GMM, the HOS methods, and a two-stage classifier based on the GMM-HOS were performed for each voice signal. A Mann–Whitney rank sum test was used to detect differences in the means of the HOS-based parameters. A fivefold cross-validation scheme was performed to test the classification method. When 16 Gaussian mixtures were used, the MFCC-based GMM algorithm is performed with 92.0% accuracy. When means of the normalized skewness and kurtosis were used, performances of 82.31 and 83.67% were obtained, respectively. The two-stage classifier with 16 Gaussian mixtures and the mean of the normalized kurtosis classified samples with a 96.96% accuracy were obtained. The proposed two-stage classifier is more accurate than the MFCC-based GMM and HOS methods alone and shows potential for the classification of voices in the clinic.
Keywords
- Pathological voice detection
- Higher-order statistics
- Gaussian mixture model
- Two-stage classifier
Introduction
Speech is integral to day-to-day communication. Speech impediments negatively impact social interactions leading to interest in early detection and treatment of voice disorders. Many researchers have worked towards the goal of automatic and objective classification between normal and pathological voices using minimally invasive methods. A large amount of research has focused on the automatic detection of voice pathologies by means of acoustic analysis, parametric and non-parametric feature extraction, pattern recognition algorithms, and statistical methods [1–15].
Sáenz-Lechón et al. [5] presented an overview of previous classification schemes applied to the Massachusetts Eye & Ear Infirmary (MEEI) Voice Disorders Database [16]. They described some methodological concerns to be considered when designing automatic systems for pathological voice detection. They recommended the use of a commercially well-known database, a cross-validation strategy based on several partitions to obtain averaged classification performances with confidence intervals, a report of the means of a detection error trade-off (DET), and an investigation of the area under receiver operating characteristic (ROC) curves.
The emergence of attractive pattern classification algorithms such as the Gaussian mixture model (GMM), neural network (NN), and hidden Markov model has received attention as a potential means to discriminate between normal and pathological voices [6–13]. The GMM has especially been reported as a very successful classification method [10–13]. Characteristic parameters, such as Mel-frequency cepstral coefficients (MFCC), have also become more popular for voice pathology detection [6, 8, 10–12]. Recently, Wang et al. [12] proposed a GMM supervector kernel-support vector machine (GMM-SVM) classifier which was compared with the GMM classifier as a baseline algorithm. The GMM supervectors were largely effective parameters for the discrimination of normal and pathological voices. A classification accuracy of 96.1% was achieved by SVM classification of the 16 Gaussian GMM supervectors [12].
As an acoustic analysis method, higher-order statistics (HOS) have shown promising results in a number of signal processing applications, and are of particular value when dealing with a mixture of Gaussian and non-Gaussian processes and system nonlinearity [14, 15, 17, 18]. The application of HOS to speech processing has primarily been motivated by its inherent Gaussian suppression and phase preservation properties [17, 18]. Researches in the disordered voice field have been based on the assumption that speech has non-zero HOS that is distinct from that of Gaussian noise [18, 19]. Alonso et al. [19] proposed seven new HOS-based parameters that were obtained directly or indirectly starting from the bispectrum of a voice frame. A success rate of 98.3% was obtained by using both the conventional and the HOS-based parameters with an NN classifier demonstrating the possibility of automatically discriminating pathological from healthy voices using HOS parameters [20]. Further study of how well each HOS parameter can detect pathological voices with the methodological designs recommended by Sáenz-Lechón et al. [5] is merited.
In this article, we propose new HOS-based parameters implemented in the time domain. They are means of the normalized skewness and kurtosis which are calculated from each frame and averaged in a sentence. The HOS-based parameters estimated in the time domain can easily be applied to a real-time environment in contrast to the Fourier series representation of the HOS parameters in frequency domain applied by Alonso et al. [19]. Second, we propose a two-stage approach to further improve the accuracy of the classification between normal and pathological voices. The classification system consists of a MFCC-based GMM algorithm which describes the primary classification achieved by the GMM log-likelihood scores, and an HOS-based parameter as a post-processor.
Material and methods
Material
Vocal signals were collected from the MEEI Voice Disorders Database [16]. Fifty-three normal and one hundred seventy-three pathological speakers with a wide range of organic, neurological, traumatic, and psychogenic voice disorders were selected. The extracted subset is the same as one described in the study of Wang et al. [12] to compare the result with this study. Voice samples were collected in a controlled environment and sampled with a 50- or 25-kHz sampling rate and 16 bits of resolution. Patients phonated a sustained /a/ (1–3 s). All voice data were down-sampled to 25 kHz and grouped into training (70% of the data) and test (30%) sets to implement all methods. Each set for a fivefold cross-validation scheme was randomly selected from the subset [10, 12].
Statistical analysis
Statistical analysis was conducted using Sigma Stat 3.0 (Jandel Scientific, SanRafael, CA, USA). The Mann–Whitney rank sum test was performed to test the differences between normal and pathological voices for the normalized skewness, the normalized kurtosis, and the means of the normalized skewness and kurtosis. A p-value of 0.05 was used for all measures.
MFCC-based GMM method
where Λ(X) is the LLR, and λ_{ C } and ${\lambda}_{\overline{C}}$ are GMM models for normal and pathological voices, respectively. Also N and P indicate normal and pathological voices, respectively.
HOS method
where s(n) is a non-Gaussian signal generated by the oscillation of the vocal folds and w(n) is Gaussian noise which can be assumed to be zero in normal voices and not to be zero in pathological voices.
Pathological voices are corrupted by noise, w(n), which is directly related to the perceived roughness of the voice [1–3, 20]. If s(n) and w(n) are statistically independent, then the energy of x(n) is the sum of speech and noise energies: E_{ x } = E_{ s } + E_{ w }. Second-order statistics are thus directly affected in an additive way by the presence of noise [17, 19]. However, when HOS analysis is applied to pathological voices, unstable and discontinuous statistics of x(n) may be estimated because HOS analysis is blind to Gaussian processes. On the other hand, in a normal voice, the HOS of only non-Gaussian measurements may be extracted because a Gaussian noise can be assumed to be zero. The variation of a non-Gaussian signal which is produced by vibration of the vocal folds can be an important clue for the classification of pathological and normal voices.
where x_{ t } is the speech sample value of t th frame and N is the number of samples.
where γ_{3t} and γ_{4t} are γ_{3} and γ_{4} extracted in the t th frame, respectively, and T is the number of the frames.
Two-stage classifier based on GMM-HOS
Results
MFCC-based GMM method
Performance of the MFCC-based GMM method
Method | Confusion matrix | Accuracy (%) | Specify (%) | Sensitivity (%) | AUC (%) | |
---|---|---|---|---|---|---|
GMM 8 mixtures | 89.58 | 10.42 | 89.35 ± 3.00 | 89.67 | 89.53 | 95.95 |
10.88 | 89.12 | |||||
GMM 16 mixtures | 89.58 | 8.33 | 92.00 ± 4.79 | 92.27 | 91.72 | 98.59 |
7.68 | 92.32 | |||||
GMM 32 mixtures | 89.58 | 10.42 | 90.31 ± 4.09 | 90.90 | 89.73 | 96.02 |
8.96 | 91.04 |
HOS method
Performance of the means of the normalized skewness and kurtosis
Method | Confusion matrix | Accuracy (%) | Specificity (%) | Sensitivity (%) | AUC (%) | |
---|---|---|---|---|---|---|
Mean of the normalized skewness | 81.25 | 18.75 | 82.31 ± 5.67 | 83.00 | 81.64 | 89.36 |
16.64 | 83.36 | |||||
Mean of the normalized kurtosis | 83.33 | 16.67 | 83.67 ± 6.89 | 83.89 | 83.49 | 92.38 |
16.00 | 84.00 |
Two-stage classifier based on GMM-HOS
Performance of the proposed two-stage classifiers
Method | Confusion matrix | Accuracy (%) | Specificity (%) | Sensitivity (%) | AUC (%) | ||
---|---|---|---|---|---|---|---|
GMM 8 mixtures | Mean of the normalized skewness (${\overline{\gamma}}_{3}$) | 93.75 | 6.25 | 94.00 ± 1.67 | 94.21 | 93.78 | 99.69 |
5.76 | 94.24 | ||||||
Mean of the normalized kurtosis (${\overline{\gamma}}_{4}$) | 93.75 | 6.25 | 94.00 ± 1.67 | 94.21 | 93.78 | 99.69 | |
5.76 | 94.24 | ||||||
GMM 16 mixtures | Mean of the normalized skewness(${\overline{\gamma}}_{3}$) | 95.83 | 4.17 | 96.64 ± 4.09 | 97.43 | 95.90 | 99.69 |
2.56 | 97.44 | ||||||
Mean of the normalize kurtosis (${\overline{\gamma}}_{4}$) | 95. 83 | 4. 17 | 96. 96 ± 4. 79 | 98. 04 | 95. 92 | 99. 95 | |
1.92 | 98.08 | ||||||
GMM 32 mixtures | Mean of the normalized skewness (${\overline{\gamma}}_{3}$) | 93.75 | 6.25 | 94.64 ± 1.92 | 95.44 | 93.86 | 99.69 |
4.48 | 95.52 | ||||||
Mean of the normalized kurtosis (${\overline{\gamma}}_{4}$) | 95.83 | 4.17 | 96.00 ± 3.13 | 96.15 | 95.84 | 99.87 | |
3.84 | 96.16 |
The ROC and DET curves of a two-stage classifier using 16 Gaussian mixtures and mean of the normalized kurtosis are shown in Figures 4 and 5. The AUC of the method was larger than that those of 16 Gaussian mixtures and mean of the normalized kurtosis independently. EER of the two-stage classifier was 3.04% in Figure 5.
Conclusion and discussion
In this article, we define a two-stage technique to discriminate pathological from normal voices. The newly proposed model is comprised of two parts, an MFCC-based GMM algorithm which describes the primary decision achieved by the LLR scores of the GMM, and post-processing using an HOS analysis block incorporating means of the normalized skewness and kurtosis. The characteristics of the MFCC between normal and pathological voices are presented in the study of Godino-Llorente et al. [10]. A strong correlation between the HOS coefficients (the normalized skewness and kurtosis) and voice classification is demonstrated (p < 0.001). By introducing these parameters in cases where the MFCC-based GMM algorithm returns uncertain values, the classification can be improved. According to Sáenz-Lechón et al.’s recommendations [5], we utilized a commercially well-known database. Classification performance along with CI was obtained by a cross-validation strategy based on several partitions. Results are also described by DET and AUC.
The two-stage classifier outperformed the individual classification schemes. The best performance, 96.96%, is achieved by combining an MFCC-based GMM algorithm with 16 Gaussian mixtures and the mean of the normalized kurtosis. The MFCC-based GMM algorithm with 16 Gaussian mixtures performed at 92.0% while the means of the normalized skewness and kurtosis classified correctly 82.31 and 83.67%, respectively. A false decision is occasionally caused by the erroneous EM-based GMM estimation in the intersection regions where they have somewhat low likelihoods. Therefore, it is believed that the performance improvement is mainly due to the fact that our two-stage classifier successfully solves the false decision problem caused by low log-likelihood values.
Many studies have presented a variety of approaches [1–3, 6–13, 19]. Although different datasets and evaluation procedures make it difficult to compare the results of previous studies [5], Godino-Llorente and Gómez-Vilda [8] presented the experimental result using the learning vector quantization methodology, yielding 96% frame accuracy. Afterwards, Godino-Llorente et al. [10] published the article showing an accuracy of 94.07% with 24 MFCC parameters and a GMM of 6 mixtures. A classification performance of 98.3% was obtained by using both the perturbation and seven HOS-based parameters with an NN classifier implemented by Alonso et al. [19]. Recently, Wang et al. [12] proposed a GMM-SVM classifier with a classification performance of 96.1%. Finally, in this article, we combined the MFCC-based GMM method utilized by Godino-Llorente et al. [8, 10, 11] and Wang et al. [12] with HOS parameters to obtain a performance of 96.96%.
The automatic classification between normal and pathological voices remains an open problem that calls for reliable algorithms to aid the clinicians. When the information gathered from simple physically informed GMMs is combined with HOS-based parameters, a valuable classifier can be obtained. This two-stage method can be used for the analysis and assessment of voice quality.
Declarations
Acknowledgments
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2011–0025153). The author would like to thank Minsoo Hahn and Sangbae Jeong for their guidance, advice, and kind supports.
Authors’ Affiliations
References
- de Oliveira Rosa M, Pereira JC, Grellet M: Adaptive estimation of residual signal for voice pathology diagnosis. IEEE Trans. Biomed. Eng 2000, 47(1):96-104. 10.1109/10.817624View ArticleGoogle Scholar
- Shama K, Krishna A, Cholayya NU: Study of harmonics-to-noise ratio and critical band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology. EURASIP J. Adv. Signal Process. 2007, 1997(1):50-59.Google Scholar
- Godino-Llorente JI, Osma-Ruiz V, Sáenz-Lechón N, Gómez-Vilda P, Blanco-Velasco M, Cruz-Roldán F: The effectiveness of the glottal to noise excitation ratio for the screening of voice disorders. J. Voice 2010, 24(1):47-56. 10.1016/j.jvoice.2008.04.006View ArticleGoogle Scholar
- Khadivi Heris H, Seyed Aghazadeh B, Nikkhah-Bahrami M: Optimal feature selection for the assessment of vocal fold disorders. Comput. Biol. Med. 2009, 39(10):860-868. 10.1016/j.compbiomed.2009.06.014View ArticleGoogle Scholar
- Sáenz-Lechón N, Godino-Llorente JI, Osma-Ruiz V, Gómez-Vilda P: Methodological issues in the development of automatic systems for voice pathology detection. Biomed. Signal Process. Control 2006, 1(2):120-128. 10.1016/j.bspc.2006.06.003View ArticleMATHGoogle Scholar
- Godino-Llorente JI, Aguilera-Navarro S, Gómez-Vilda P: Non-supervised neural net applied to the detection of voice impairment. In Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 6, Istanbul, Turkey; 2000:3594-3597.Google Scholar
- Hadjitodorov S, Mitev P: A computer system for acoustic analysis of pathological voices and laryngeal disease screening. Med. Eng. Phys. 2002, 24(6):419-429. 10.1016/S1350-4533(02)00031-0View ArticleGoogle Scholar
- Godino-Llorente JI, Gómez-Vilda P: Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors. IEEE Trans. Biomed. Eng. 2004, 51(2):380-384. 10.1109/TBME.2003.820386View ArticleGoogle Scholar
- Umapathy K, Krishman S, Parsa V, Jamieson DG: Discrimination of pathological voices using a time-frequency approach. IEEE Trans. Biomed. Eng. 2005, 52(3):421-430. 10.1109/TBME.2004.842962View ArticleGoogle Scholar
- Godino-Llorente JI, Aguilera-Navarro S, Gómez-Vilda P: Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters. IEEE Trans. Biomed. Eng. 2006, 53(10):1943-1953. 10.1109/TBME.2006.871883View ArticleGoogle Scholar
- Godino-Llorente JI, Sáenz-Lechón N, Osma-Ruiz V, Aguilera-Navarro S, Gómez-Vilda P: An integrated tool for the diagnosis of voice disorders. Med. Eng. Phys. 2006, 28(3):276-289. 10.1016/j.medengphy.2005.04.014View ArticleGoogle Scholar
- Wang X, Zhang J, Yan Y: Discrimination between pathological and normal voices using GMM-SVM approach. J. Voice 2010, 25(1):38-43.View ArticleGoogle Scholar
- Das R: A comparison of multiple classification methods for diagnosis of Parkinson disease. Expert Syst. Appl. 2010, 37(2):1568-1572. 10.1016/j.eswa.2009.06.040View ArticleGoogle Scholar
- Nikias C, Mendel J: Signal processing with higher-order statistics spectra. IEEE Signal Process. Mag. 1993, 10(3):10-37.View ArticleGoogle Scholar
- Mendel JM: Tutorial on higher-order statistics in signal processing and system theory: theoretical results and some applications. Proc. IEEE 1991, 79(3):278-305. 10.1109/5.75086View ArticleGoogle Scholar
- Massachusetts Eye and Ear Infirmary: Voice Disorders Database, Version. 1.03 [CD-ROM]. Kay Elemetrics Corp, Lincoln Park, NJ; 1994.Google Scholar
- Nemer E, Goubran R, Mahmoud S: Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans. Speech Audio Process. 2001, 9(3):217-231. 10.1109/89.905996View ArticleGoogle Scholar
- Lee JY, Minsoo H: Automatic assessment of pathological voice quality using higher-order statistics in the LPC residual domain. EURASIP J. Adv. Signal Process 2009. Article ID 748207, 8pagesGoogle Scholar
- Alonso JB, de Leon J, Alonso I, Ferrer MA: Automatic detection of pathologies in the voice by HOS based parameters. EURASIP J. Appl. Signal Process. 2001, 2001(4):275-284. 10.1155/S1110865701000336View ArticleGoogle Scholar
- Kent RD, Ball MJ: Voice Quality Measurement (Thomson Learning). New York; 2000.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.