Low-dimensional representation of Gaussian mixture model supervector for language recognition
© Yang et al; licensee Springer. 2012
Received: 14 May 2011
Accepted: 27 February 2012
Published: 27 February 2012
In this article, we propose a new feature which could be used for the framework of SVM-based language recognition, by introducing the idea of total variability used in speaker recognition to language recognition. We consider the new feature as low-dimensional representation of Gaussian mixture model supervector. Thus we propose multiple total variability (MTV) language recognition system based on total variability (TV) language recognition system. Our experiments show that the total factor vector includes the language dependent information; what's more, multiple total factor vector contains more language dependent information.
Experimental results on 2007 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) databases show that MTV outperforms TV in 30 s tasks, and both TV and MTV systems can achieve performance similar to that obtained by state-of-the-art approaches. Best performance of our acoustic language recognition systems can be further improved by combining these two new systems.
Keywordslanguage recognition total variability (TV) multiple total variability (MTV) support vector machine linear discriminant analysis locality preserving projection
The aim of language recognition is to determine the language spoken in a given segment of speech. It is generally believed that phonotactic feature and spectral feature provide complementary cues to each other [1, 2]. Phone recognizer followed by language models (PRLM) and parallel PRLM (PPRLM) approaches that use phonotactic information have shown very successful performance [2, 3]. The acoustic method which uses spectral feature has the advantage that it does not require specialized language knowledge and is computationally simple. This article focuses on the acoustic component of the language recognition systems. The spectral features of speech are collected as independent vectors. The collection of vectors can be extracted as shifted-delta-cepstral acoustic features, and then modeled by Gaussian mixture model (GMM). The result was reported in . The approach was further improved by using discriminative training that is named maxi-mum mutual information (MMI). Several studies use support vector machine (SVM) in language recognition to form GMM-SVM system [5, 6]. In language recognition evaluation, MMI and GMM-SVM are primary acoustic systems.
Recently, total variability approach has been proposed in speaker recognition [7, 8], which uses the factor analysis to define a new low-dimensional space that is named total variability space. In contrast to classical joint factor analysis (JFA), the speaker and the channel variability are contained simultaneously in this new space. The intersession compensation can be carried out in low-dimensional space.
Actually, we can consider total variability approach as a classical application of the probabilistic principal component analysis (PPCA) . The factor analysis of the total variability approach can obtain useful information by reducing the dimension of the space of GMM supervectors. That is all utterances could in fact be well represented in a low-dimensional space. We believe useful language information can be obtained by similar front-end processes. Therefore we try to introduce the idea of total variability to language recognition. We estimate the language total variability space by using the dataset shown in Section 5, and we suppose that a given target language's entire set of utterances is regarded as having been belonging to different language. Then, the total factor vector is extracted by projecting an utterance to the language total variability space. As in speaker recognition, intersession compensation can also be performed well on low-dimension total factor vector. In our experiments, two intersession compensation techniques--linear discriminant analysis (LDA)  and locality preserving projection (LPP) [10–12]--are used to improve the performance of language recognition.
In some previous studies [13, 14], rich information is obtained by using multiple reference models, such as male and female gender-dependent models in speaker recognition. Generally, there are abundant language data for each target language in language recognition, and the number of target languages is limited. Based on TV language recognition system [12, 15], we propose MTV language recognition system where we use language-dependent GMMs instead of universal background model (UBM) in the process of language total variability space estimation and total factor vector extraction. Our experiments show that total factor vector (TV system) includes the language dependent information; what's more, multiple total factor vector (MTV system) contains more language dependent information.
This article is organized as follows: In Section 2, we give a simple review of total variability, support vector machines, and compensation of channel factors. In Section 3, we apply total variability in language recognition. In Section 4, the proposed language recognition system is presented in detail. Corpora and evaluation are given in Section 5. Section 6 gives the experimental results. Finally, we conclude in Section 7.
2.1 Total variability in speaker recognition
In speaker recognition, unlike in classical joint factor analysis (JFA), the total variability approach defines a new low-dimensional space that is named total variability space, which contains the speaker and the channel variability simultaneously. The total variability approach in speaker recognition relaxes the independent assumption between speaker and channel variability spaces in JFA speaker recognition .
where mubm is the UBM supervector, T is total variability space, and the member of the vector w is total factor.
We believe useful language information can be obtained by similar front-end process. Thus we try to apply total variability in language recognition.
2.2 Support vector machines
where N is the number of support vectors, t i is the ideal output, α i is the weight for the support vector x i , α i > 0 and . The ideal outputs are either 1 or -1, depending upon whether the corresponding support vector belongs to class 0 or class 1. For classification, a class decision is based upon whether the value, f(x), is above or below a threshold.
2.3 Compensation of channel factors
where γ m (t) is the Gaussian posterior probability of each Gaussian mixture m of the universal background model (UBM) for a given frame of an utterance. U m and y(i)are about the intersession compensation related to the m th Gaussian of UBM. U m is intersession subspace and y(i)is channel factor vector. In our proposed language recognition system, we use spectral feature after compensation of channel factors.
3 Applying total variability in language recognition
There is only one difference between total variability space T estimation and eigenvoice space estimation in speaker recognition [8, 20]. All the recordings of a speaker are considered as to belong to the same person in the eigenvoice estimation. However, in the total variability space estimation, a given speaker's entire set of utterances is regarded as having been produced by different speakers. If we suppose that a given target language's entire set of utterances is regarded as having been produced by different languages, a common pool of hidden variables acts as basis factors and represents the utterances from different languages. Then, the process of language total variability space estimation is exactly the same as the process of total variability space estimation and eigenvoice space estimation in speaker recognition. The process is an iterative algorithm . The use of the data which is the only difference is critical. Therefore, we suggest that all utterances of each target language had better be used to estimate language total variability space.
3.1 Language total variability space estimation
For a given utterance, the language and channel variability dependent GMM supervector can also be denoted as Equation (1), because the process of language total variability space estimation is exactly the same as the process of total variability space estimation and eigenvoice space estimation in speaker recognition. We can consider the total factor vector model as a new feature extractor that projects an utterance to a low rank space T to get a language and channel variability dependent total factor vector w. Space estimation can be implemented by an iterative algorithm .
3.2 Language-dependent total variability space estimation
In language total variability space estimation, total variability space is estimated relative to UBM, which is language, speaker, channel, gender, and environment independent. Some previous studies [13, 14] show that rich information can be obtained by using multiple reference models. These studies suggest the possibility of using language-dependent GMM instead of language-independent UBM in language total variability space estimation. We call language total variability space language-dependent total variability space when the total variability space is related to language-dependent GMM.
3.3 Intersession compensation
After the new feature extractor, the intersession compensation can be carried out in low-dimensional space. In our experiment, we use the linear discriminant analysis (LDA) approach and locality preserving projection (LPP) approach for intersession compensation.
3.3.1 Linear discriminant analysis
where λ is the diagonal matrix of eigenvalues. ν is the eigenvector corresponding to the non-zero eigenvalue. The matrix S b is the between class covariance matrix and S w is the within class covariance matrix.
3.3.2 Locality preserving projection
By LPP transformation matrix ALPP in Equation (8), the total factor vector w is projected to w ' to preserve local structure of the total factor vector.
The justification for this choice of weights can be traced back to .
4 The proposed language recognition system
The proposed TV and MTV language recognition systems contain three main processes, spectral feature extraction, total factor vector extraction, SVM model and language score calibration.
4.1 Spectral feature extraction
The spectral feature in the system is 7 Mel-frequency cepstral coefficients (MFCC) concatenated with shifted-delta-cpectral (SDC) N-d-p-k feature, where N = 7, d = 1, p = 3, and k = 7, which is in total 56-dimension coefficients each frame. This representation is selected based upon prior excellent results with this choice, and the improvement of adding direct coefficients with the C0 coefficient in this feature vector was studied in . In this article, spectral feature refers to this 56-dimension feature. Nonspeech frames are eliminated after speech activity detection and 56-dimension spectral feature is extracted. Then feature warping  and cepstral variance normalization are applied to the previously extracted spectral feature such that each feature is normalized to mean 0 and variance 1.
4.2 Total factor vector extraction
In our system, spectral feature after compensation of channel factors is used. First, language total variability space and language-dependent total variability spaces are estimated. Then, we extract total factor vector as shown in Figure 1. In our experiments, the number of mixtures of UBM (or GMM) is 1024, and total variability space T is a rectangular matrix of low rank with dimension 1024*56 by 400. The dimension of w is 400.
where L is the frames, c is the Gaussian index of C mixture Gaussian components, P (c/y t , Ω) corresponds to posterior probability of mixture component c generating the vector y t , and, m c is the mean of UBM mixture component c.
Multiple total factor vector is extracted with similar method by using language-dependent GMM instead of language-independent UBM and using language-dependent total variability space instead of language total variability space as in Equation (4). Then, the multiple total factor vector wMTV is a combination of w1, w2, . . . , wmandarin, . . . , w L as shown in Figure 1 and Equation (5). Actually, in multiple total variability language recognition system, the combination of total factor vectors is implemented after intersession compensation which is shown in Section 3.3.
4.3 SVM model and language score calibration
Total factor vectors and multiple total factor vectors are used as SVM features in our proposed TV and MTV systems. Our experiments are implemented by using the SVMTorch  with a linear inner-product kernel function.
A more complex full backend process is given [6, 28], LDA and diagonal covariance Gaussians are used to calculate the log-likelihoods for each target language and achieve improvement in detection performance. This process transforms language scores with LDA, models the transformed scores with diagonal covariance Gaussians (one for each language), and then applies the transform in Equation (15).
In this article, the backend process of the LDA and diagonal covariance Gaussians is used in language recognition system, because the backend process of the LDA and diagonal covariance Gaussians is superior to log-likelihood ratios normalization in our experiments.
5 Corpora and evaluation
The experiments are performed using the NIST LRE 2007 evaluation database. There are 14 target languages in the corpora used in this article: Arabic, Bengali, Chinese, English, Farsi, German, Hindustani, Japanese, Korean, Russian, Spanish, Tamil, Thai, and Vietnamese. The task of this evaluation was to detect the presence of a hypothesized target language for each test utterance. The training data were primarily from Callfriend corpora, Callhome corpora, Mixer corpora, OHSU corpora, OGI corpora, and LRE07Train. The development data consists of LRE03, LRE05, and LRE07Train. We use equal error rate (EER) and the minimum decision cost value (minDCF) as metrics for evaluation.
First, total variability language recognition system (TV) is experimented, then exports to multiple total variability language recognition system (MTV).
Results of the MMI system, GMM-SVM system and the TV and MTV systems with the intersession compensation techniques of LDA and LPP on the NIST LRE07 30 s corpus
Results of the combination of MMI system and GMM-SVM system, and the combination of the MMI system, GMM-SVM system, TV system, and MTV system on the NIST LRE07 30 s corpus
In this article, multiple total factor vector are proposed for language recognition based on using total factor vector in language recognition. Our experiments show that total factor vector includes the language dependent information. Further more, multiple total factor vector contains more language dependent information. Comparing to popular acoustic system (MMI and GMM-SVM system) in language recognition, those two new language features contain different language dependent information. We believe it is attractive that our proposed features can improve our best acoustic performance of the combination of the MMI and GMM-SVM systems. In our future study, different approaches of intersession compensation will be carried on the new features.
This study was partially supported by the National Natural Science Foundation of China (Nos. 10925419, 90920302, 10874203, 60875014, 61072124, 11074275).
- Torres-Carrasquillo PA, Singer E, Campbell WM, Gleason T, McCree A, Reynolds DA, Richardson F, Shen W, Sturim DE: "The mitll nist lre 2007 language recognition system". In Ninth Annual Conference of the International Speech Communication Association. Volume 1. Brisbane, Australia; 2008:719-722.Google Scholar
- Zissman MA: "Language identification using phoneme recognition and phonotactic language modeling". In IEEE International Conference On Acoustics Speech And Signal Processing. Volume 5. Institute Of Electrical engineers INC (IEE); 1995. 3503-3503, Detroit USAGoogle Scholar
- Yan Y, Barnard E: "An approach to automatic language identification based on language-dependent phone recognition". In icassp. Volume 5. IEEE; 1995:3511-3514. Detroit USAGoogle Scholar
- Torres-Carrasquillo PA, Singer E, Kohler MA, Greene RJ, Reynolds DA, Deller JR Jr: "Approaches to language identification using Gaussian mixture models and shifted delta cepstral features". In Seventh International Conference on Spoken Language Processing. Volume 1. Citeseer; 2002:89-92. Denver, USAGoogle Scholar
- Li H, Ma B, Lee CH: "A vector space modeling approach to spoken language identification". IEEE Transactions on Audio, Speech, and Language Processing 2007, 15(1):271-284.View ArticleGoogle Scholar
- Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA: "Support vector machines for speaker and language recognition". Computer Speech & Language 2006, 20(2-3):210-229. 10.1016/j.csl.2005.06.003View ArticleGoogle Scholar
- Dehak N, Dehak R, Kenny P, Brümmer N, Ouellet P, Dumouchel P: "Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification". In Tenth Annual Conference of the International Speech Communication Association. Volume 1. Brighton, United Kingdom; 2009:1559-1562.Google Scholar
- Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P: "Front-end factor analysis for speaker verification". Audio, Speech, and Language Processing, IEEE Transactions on 2011, 19(4):2011.View ArticleGoogle Scholar
- Tipping ME, Bishop CM: "Probabilistic principal component analysis". Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1999, 61(3):611-622. 10.1111/1467-9868.00196MathSciNetView ArticleMATHGoogle Scholar
- He X, Niyogi P: "Locality preserving projections". In Advances in neural information processing systems 16: proceedings of the 2003 conference. Volume 16. Citeseer; 2003:153-160.Google Scholar
- He X, Yan S, Hu Y, Niyogi P, Zhang HJ: "Face recognition using laplacianfaces". IEEE Transactions on Pattern Analysis and Machine Intelligence 2005, 27: 328-340.View ArticleGoogle Scholar
- Yang J, Zhang X, Lu L, Zhang J, Yan Y: "Language Recognition With Locality Preserving Projection". In The Sixth International Conference on Digital Telecommunications (ICDT 2011,). Budapest, Hungary; 2011:46-50.Google Scholar
- Stolcke A, Kajarekar SS, Ferrer L, Shrinberg E: "Speaker recognition with session variability normalization based on mllr adaptation transforms". Audio, Speech, and Language Processing, IEEE Transactions on 2007, 15(7):1987-1998.View ArticleGoogle Scholar
- Ferras M, Leung CC, Barras C, Gauvain JL: "Comparison of speaker adaptation methods as feature extraction for svm-based speaker recognition". Audio, Speech, and Language Processing, IEEE Transactions on 2010, 18(6):1366-1378.View ArticleGoogle Scholar
- Dehak N, Torres-Carrasquillo PA, Reynolds D, Dehak R: "Language recognition via ivectors and dimensionality reduction". 12th Annual Conference of the International Speech Communication Association 2011, 1: 857-860.Google Scholar
- Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P: "A study of interspeaker variability in speaker verification". Audio, Speech, and Language Processing, IEEE Transactions on 2008, 16(5):980-988.View ArticleGoogle Scholar
- Cristianini N, Shawe-Taylor J: "Support Vector Machines". Cambridge University Press, Cambridge, UK; 2000.View ArticleGoogle Scholar
- Castaldo F, Colibro D, Dalmasso E, Laface P, Vair C: "Compensation of nuisance factors for speaker and language recognition". Audio, Speech, and Language Processing, IEEE Transactions on 2007, 15(7):1969-1978.View ArticleGoogle Scholar
- Castaldo F, Cumani S, Laface P, Colibro D: "Language recognition using language factors". In Tenth Annual Conference of the International Speech Communication Association. Volume 1. Brighton, U.K; 2009:176-179.Google Scholar
- Kenny P, Boulianne G, Dumouchel P: "Eigenvoice modeling with sparse training data". Speech and Audio Processing, IEEE Transactions on 2005, 13(3):345-354.View ArticleGoogle Scholar
- Kuhn R, Junqua JC, Nguyen P, Niedzielski N: "Rapid speaker adaptation in eigenvoice space". Speech and Audio Processing, IEEE Transactions on 2000, 8(6):695-707. 10.1109/89.876308View ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: "Maximum likelihood from incomplete data via the EM algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 1977, 39(1):1-38.MathSciNetMATHGoogle Scholar
- Belkin M, Niyogi P: "Laplacian eigenmaps and spectral techniques for embedding and clustering". Advances in neural information processing systems 2002, 1: 585-592.Google Scholar
- Burget L, Matějka P, Černocký J: "Discriminative Training Techniques for Acoustic Language". In Proceedings of ICASSP. Volume 1. Toulouse, France; 2006:209-212.Google Scholar
- Allen F, Ambikairajah E, Epps J: "Warped magnitude and phase-based features for language identification". In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. Volume 1. IEEE; 2006:201-204. Toulouse, FranceGoogle Scholar
- Collobert R, Bengio S: "SVMTorch: support vector machines for large-scale regression problems". The Journal of Machine Learning Research 2001, 1: 143-160.MathSciNetMATHGoogle Scholar
- Brummer N, van Leeuwen DA: "On calibration of language recognition scores". In IEEE Odyssey 2006:The Speaker and Language Recognition Workshop, 2006. Volume 1. San Juan, Puerto Rico; 2006:1-8.View ArticleGoogle Scholar
- Singer E, Torres-Carrasquillo PA, Gleason TP, Campbell WM, Reynolds DA: "Acoustic, phonetic, and discriminative approaches to automatic language identification". In Eighth European Conference on Speech Communication and Technology. Volume 1. Geneva, Switzerland; 2003:1345-1348.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.